Due to the COVID-19 pandemic we were forced to turn to e-learning. I must admit that for me it was fun. As an academic teacher I learned a lot. We recorded videos of how to use the instruments, then I edited the video using Shotcut. The editing was reduced to merely adding some text, speeding up some fragments, cutting and merging. But since that were laboratory classes, students should gather some data to further analyse them and produce some conclusions.
One of the classes is Geiger-Muller tube voltage calibration. During the classes students are altering Geiger-Muller counter's voltage and note how much counts does the counter record. Some explanation. I would add that thanks to the plateau the number of counts, does not vary that much if the voltage changes slightly during or between the measurements (for example due to power supply malfunction).
Knowing that the radioactive decay is subject to Poisson distribution I generated the data using Python. Poisson distribution is a one parameter distribution where the parameter is both mean value and variance. So I took some results of previous years classes and dived into code. I do not impose the data processing way. Students are free to use any kind of software. I tried to ecourage them to use Python, R, Matlab, Scilab or whatever else there is in the world.
Well.. They all use MSExcel, so I generated so much data that it failed to read it. Scroll down to check the code and some plots.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib import rcParams
plt.style.use("dark_background")
rcParams["font.size"] = 25
rcParams["figure.figsize"] = (15,8)
volts = "340;420;430;440;460;500;520;540;560;580;600;620;640;660;680;700;720;740;760;780;800;820;840;860;880;900;920;940"
cnts = "0.0;0.0;0.0;277;294;306;316;259;320;356;284;257;291;316;302;304;325;360;346;369;372;425;476;519;647;761;884;1221"
voltages = [int(x) for x in volts.split(";")]
counts = [int(float(x)) for x in cnts.split(";")]
basic_data = {voltage: count for voltage, count in zip([f"{x} V" for x in voltages], counts)}
basic_data = pd.DataFrame(basic_data, index=["mean"])
basic_data
340 V | 420 V | 430 V | 440 V | 460 V | 500 V | 520 V | 540 V | 560 V | 580 V | ... | 760 V | 780 V | 800 V | 820 V | 840 V | 860 V | 880 V | 900 V | 920 V | 940 V | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
mean | 0 | 0 | 0 | 277 | 294 | 306 | 316 | 259 | 320 | 356 | ... | 346 | 369 | 372 | 425 | 476 | 519 | 647 | 761 | 884 | 1221 |
1 rows × 28 columns
fig, ax = plt.subplots()
sns.lineplot(x=voltages, y=basic_data.iloc[0,:], ax=ax, linewidth=3)
ax.set(xlabel="voltage", ylabel="counts")
ax.tick_params(axis="x", rotation=45)
Well although, the data is not perfect, the plateau region is clearly visible.
num_measurements = 100000
pd0 = pd.DataFrame(np.zeros((num_measurements,3)))
pd1 = basic_data[basic_data.columns[3:]].apply(np.random.poisson, size=num_measurements)
data = pd.concat([pd0, pd1], axis=1)
data.columns = [f"{x} V" for x in voltages]
data.head()
340 V | 420 V | 430 V | 440 V | 460 V | 500 V | 520 V | 540 V | 560 V | 580 V | ... | 760 V | 780 V | 800 V | 820 V | 840 V | 860 V | 880 V | 900 V | 920 V | 940 V | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.0 | 0.0 | 0.0 | 264 | 298 | 298 | 309 | 263 | 331 | 349 | ... | 339 | 359 | 347 | 398 | 471 | 498 | 670 | 752 | 902 | 1271 |
1 | 0.0 | 0.0 | 0.0 | 257 | 286 | 316 | 352 | 290 | 308 | 364 | ... | 326 | 359 | 396 | 386 | 497 | 474 | 686 | 751 | 880 | 1216 |
2 | 0.0 | 0.0 | 0.0 | 287 | 303 | 292 | 323 | 247 | 317 | 358 | ... | 350 | 381 | 384 | 442 | 455 | 519 | 670 | 766 | 845 | 1148 |
3 | 0.0 | 0.0 | 0.0 | 292 | 296 | 319 | 289 | 275 | 305 | 347 | ... | 350 | 349 | 356 | 387 | 490 | 555 | 627 | 749 | 843 | 1187 |
4 | 0.0 | 0.0 | 0.0 | 305 | 285 | 328 | 326 | 266 | 351 | 368 | ... | 359 | 367 | 381 | 447 | 468 | 487 | 683 | 763 | 927 | 1183 |
5 rows × 28 columns
melted_data = data.melt(var_name="Voltage [V]", value_name="Counts")
melted_data.head()
Voltage [V] | Counts | |
---|---|---|
0 | 340 V | 0.0 |
1 | 340 V | 0.0 |
2 | 340 V | 0.0 |
3 | 340 V | 0.0 |
4 | 340 V | 0.0 |
melted_data["Voltage [V]"] = melted_data["Voltage [V]"].apply(lambda x: int(x.split()[0]))
melted_data.head()
Voltage [V] | Counts | |
---|---|---|
0 | 340 | 0.0 |
1 | 340 | 0.0 |
2 | 340 | 0.0 |
3 | 340 | 0.0 |
4 | 340 | 0.0 |
Now we can generate a plot with confidence interval. One of the goals to pass the class is to generate a plot with confidence interval and indicate what should be the optimal voltage for the G-M tube that was used during the measurements.
fig, ax = plt.subplots()
sns.lineplot(x="Voltage [V]", y="Counts", data=melted_data, ci="sd", linewidth=3)
ax.tick_params(axis="x", rotation=45)
data_transposed = data.T
data.shape, melted_data.shape, data_transposed.shape
((100000, 28), (2800000, 2), (28, 100000))
We now have the same data but organised into 3 different shapes.
- data - 100k rows, 28 columns; this is readable in MSExcel
- melted_data - 2.8 million rows, 2 columns; this amount of rows is not readable in MSExcel
- data_transposed - 28 rows, 100k columns; this amount of columns is not readable in MSExcel
data.to_csv("data_ready.csv") # 11.9MB
melted_data.to_csv("data2_ready.csv") # 54.5MB
data_transposed.to_csv("data3_ready.csv") # 16.9MB
But now, after reading this article I would use .parquet file format. To do that we must change column names of transposed data to strings. You may also need to install pyarrow or fastarrow to use parquet file format.
data_transposed.columns = [f"measurement_{x}" for x in range(num_measurements)]
data.to_parquet("data_ready.parquet") # 2.5MB
melted_data.to_parquet("data2_ready.parquet") # 2.8MB
data_transposed.to_parquet("data3_ready.parquet") # 87.8MB
As you can see the transposed .parquet file takes more disk space than .csv so I would stick to .csv in this case.