Sampling of multivariate random variables
“Ozone Level Detection Data Set” was used in this work. Columns 'WSR_PK', 'WSR_AV' (wind speed ratio peak and average), 'T_PK', 'T_AV' (temperature peak and average), 'KI' (K-index), 'TT' (T-Totals), 'Precp' (Precipitation) were used as features. 'SLP' (sea pressure level) column was chosen as the target. All used variables are continuous.
First of all, 10 features were selected from the original dataset, 3 of which became target values, and the rest became predictors.
- T85: continuous. T at 850 hpa level (or about 1500 m height)
- U70: continuous. U wind - east-west direction wind at 700 hpa
- HT50: continuous. Geopotential height at 500 hpa, it is about the same as height at low altitude
- T70: continuous. T at 700 hpa level
- T8: continuous. 8-th measured temperature of a day
- V85: continuous. V wind - N-S direction wind at 850
- KI: continuous. K-Index
- T_AV: continuous. Average T
- T_PK: continuous. Peak T
- T0: continuous. First measured temperature of a day
In the first part of the second step, it was necessary to sample the target values using the inverse transform method. In the first part of the second step, it was necessary to sample the target values using the inverse transform method. Figure 1 shows the histograms of the distributions of target values - the blue color represents the initial data from the dataset, and the orange color represents the generated values using inverse transform sampling.
Figure 1 - results of sampling target values using inverse transform method (blue - original data, orange - sampled data)
Next, in the second step, it was necessary to sample the target values using the Accept-Reject method. To do this, it was necessary to select a function similar to the target value distribution and multiply this function by the Scale Factor so that the target value distribution is completely below it. Figure 2 shows the target function distribution values and the selected functions that are required for the Accept-Reject method.
Figure 2 - target functions distributions values t(x) (blue) and chosen functions h(x) (orange); ‘M‘ in the title stands for the Scale Factor value
At the third step, it was necessary to assess the relationship between target and predictor values. For this, a heatmap (Figure 4) was built with the corresponding values of the correlation between all the considered quantities.
Step 4. Building a Bayesian network for a chosen set of variables. Structure is based on multivariate analysis.
At the fourth step, a Bayesian network was built based on a multivariate analysis of the selected features. The data for analysis was taken from the previous step, namely from the heatmap with correlation values. Figure 5 shows a graph that reflects the structure of the constructed Bayesian network.
Step 5. Building a Bayesian network for the same set of variables using 2 algorithms for structural learning
The fifth step was to build a Bayesian network using algorithms for structural learning. The first network was built using K2 and the Hill Climb algorithm. Figure 6 shows a graph that reflects the structure of the constructed Bayesian network.
The second network was built using the evolutionary algorithm and MI. Figure 7 shows a graph that reflects the structure of the constructed Bayesian network.
Step 6. Analyzing quality of sampled target variables from the point of view of synthetic generation
At the sixth step, it was necessary to analyze the quality of the sampled target variables, for which the histograms of the initial and generated data were built. Figure 8-9 shows histograms of the original and synthetically generated target values.
Figure 9 - Results of sampling target values using Bayesian network made with structure-learning methods
As a result of the work, various sampling methods were investigated and implemented for the ozone dataset. In the course of comparing the results of sampling by various methods, it was revealed that each of the methods made it possible to obtain high-quality synthetic data for the dataset under consideration.