data_generation

Due to the COVID-19 pandemic we were forced to turn to e-learning. I must admit that for me it was fun. As an academic teacher I learned a lot. We recorded videos of how to use the instruments, then I edited the video using Shotcut. The editing was reduced to merely adding some text, speeding up some fragments, cutting and merging. But since that were laboratory classes, students should gather some data to further analyse them and produce some conclusions.

One of the classes is Geiger-Muller tube voltage calibration. During the classes students are altering Geiger-Muller counter's voltage and note how much counts does the counter record. Some explanation. I would add that thanks to the plateau the number of counts, does not vary that much if the voltage changes slightly during or between the measurements (for example due to power supply malfunction).

Knowing that the radioactive decay is subject to Poisson distribution I generated the data using Python. Poisson distribution is a one parameter distribution where the parameter is both mean value and variance. So I took some results of previous years classes and dived into code. I do not impose the data processing way. Students are free to use any kind of software. I tried to ecourage them to use Python, R, Matlab, Scilab or whatever else there is in the world.

Well.. They all use MSExcel, so I generated so much data that it failed to read it. Scroll down to check the code and some plots.

In [1]:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib import rcParams

In [2]:

plt.style.use("dark_background")
rcParams["font.size"] = 25
rcParams["figure.figsize"] = (15,8)

In [3]:

volts = "340;420;430;440;460;500;520;540;560;580;600;620;640;660;680;700;720;740;760;780;800;820;840;860;880;900;920;940"
cnts = "0.0;0.0;0.0;277;294;306;316;259;320;356;284;257;291;316;302;304;325;360;346;369;372;425;476;519;647;761;884;1221"
voltages = [int(x) for x in volts.split(";")]
counts = [int(float(x)) for x in cnts.split(";")]
basic_data = {voltage: count for voltage, count in zip([f"{x} V" for x in voltages], counts)}
basic_data = pd.DataFrame(basic_data, index=["mean"])
basic_data

Out[3]:

	340 V	420 V	430 V	440 V	460 V	500 V	520 V	540 V	560 V	580 V	...	760 V	780 V	800 V	820 V	840 V	860 V	880 V	900 V	920 V	940 V
mean	0	0	0	277	294	306	316	259	320	356	...	346	369	372	425	476	519	647	761	884	1221

1 rows × 28 columns

In [4]:

fig, ax = plt.subplots()
sns.lineplot(x=voltages, y=basic_data.iloc[0,:], ax=ax, linewidth=3)
ax.set(xlabel="voltage", ylabel="counts")
ax.tick_params(axis="x", rotation=45)

Well although, the data is not perfect, the plateau region is clearly visible.

In [5]:

num_measurements = 100000
pd0 = pd.DataFrame(np.zeros((num_measurements,3)))
pd1 = basic_data[basic_data.columns[3:]].apply(np.random.poisson, size=num_measurements)
data = pd.concat([pd0, pd1], axis=1)
data.columns = [f"{x} V" for x in voltages]

In [6]:

data.head()

Out[6]:

	440 V	460 V	500 V	520 V	540 V	560 V	580 V	...	760 V	780 V	800 V	820 V	840 V	860 V	880 V	900 V	920 V	940 V
0	264	298	298	309	263	331	349	...	339	359	347	398	471	498	670	752	902	1271
1	257	286	316	352	290	308	364	...	326	359	396	386	497	474	686	751	880	1216
2	287	303	292	323	247	317	358	...	350	381	384	442	455	519	670	766	845	1148
3	292	296	319	289	275	305	347	...	350	349	356	387	490	555	627	749	843	1187
4	305	285	328	326	266	351	368	...	359	367	381	447	468	487	683	763	927	1183

5 rows × 28 columns

In [7]:

melted_data = data.melt(var_name="Voltage [V]", value_name="Counts")

In [8]:

melted_data.head()

Out[8]:

	Voltage [V]	Counts
0	340 V	0.0
1	340 V	0.0
2	340 V	0.0
3	340 V	0.0
4	340 V	0.0

In [9]:

melted_data["Voltage [V]"] = melted_data["Voltage [V]"].apply(lambda x: int(x.split()[0]))

In [10]:

melted_data.head()

Out[10]:

	Voltage [V]	Counts
0	340	0.0
1	340	0.0
2	340	0.0
3	340	0.0
4	340	0.0

Now we can generate a plot with confidence interval. One of the goals to pass the class is to generate a plot with confidence interval and indicate what should be the optimal voltage for the G-M tube that was used during the measurements.

In [11]:

fig, ax = plt.subplots()
sns.lineplot(x="Voltage [V]", y="Counts", data=melted_data, ci="sd", linewidth=3)
ax.tick_params(axis="x", rotation=45)

In [12]:

data_transposed = data.T

In [13]:

data.shape, melted_data.shape, data_transposed.shape

Out[13]:

((100000, 28), (2800000, 2), (28, 100000))

We now have the same data but organised into 3 different shapes.

data - 100k rows, 28 columns; this is readable in MSExcel
melted_data - 2.8 million rows, 2 columns; this amount of rows is not readable in MSExcel
data_transposed - 28 rows, 100k columns; this amount of columns is not readable in MSExcel

In [14]:

data.to_csv("data_ready.csv")  #  11.9MB
melted_data.to_csv("data2_ready.csv")  #  54.5MB
data_transposed.to_csv("data3_ready.csv")  # 16.9MB

But now, after reading this article I would use .parquet file format. To do that we must change column names of transposed data to strings. You may also need to install pyarrow or fastarrow to use parquet file format.

In [15]:

data_transposed.columns = [f"measurement_{x}" for x in range(num_measurements)]

In [16]:

data.to_parquet("data_ready.parquet")  #  2.5MB
melted_data.to_parquet("data2_ready.parquet")  #  2.8MB
data_transposed.to_parquet("data3_ready.parquet")  # 87.8MB

As you can see the transposed .parquet file takes more disk space than .csv so I would stick to .csv in this case.

Bartłomiej Fliszkiewicz

piątek, 10 września 2021

A tale of e-learning and how I tried to make students abandon MSExcel.

A tale of e-learning and how I tried to make students abandon MSExcel.