These are the practices from some AI-related course from my university which was taught by our department of applied mathematics.
This source code can be freely used as well as the other materials even without mentioning the original author -
you can safely write them off!
Original repository
# noinspection MachineTranslation, Math
Free datasets were used from Gapminer and Kaggle, other datasets were provided with the tasks, so their original sources are unknown.
- Practice 1 - Getting started with Python language
- Practice 2 - Various data visualization libraries
- Practice 3 - Various methods of statistical research
- Practice 4 - Methods for calculating correlation and linear regression, conducting analysis of variance
- Practice 5 - Applying machine learning algorithms to solve classification problems
- Practice 6 - Applying machine learning algorithms to solve clusterization problems
- Practice 7 - Ensemble learning methods
- Practice 8 - Teaching methods based on association rules
Install Python - Seriously!
Write a program that calculates the area of a figure, the parameters of which are supplied to the input. Figures that are submitted for input: triangle, rectangle, circle. The result of the work is a dictionary, where the key is the name of the figure, and the value is the area.
P1T2
Output
t2 {'triangle': 1.0, 'rectangle': 12.0, 'circle':
78.53981633974483}
Write a program that takes two numbers as input and the operation that needs to be applied to them. Must be implemented the following operations: +, -, /, //, abs – modulus, pow or ** – exponentiation.
P1T3
Output
t3 3.0 2.0 1.0
Write a program that reads numbers from the console (by one per line) until the sum of the entered numbers is equal to 0 and after that it displays the sum of the squares of all read numbers.
P1T4
Output
t4:
1
2
-3
t4 14
Write a program that prints the sequence numbers of length N, where each number is repeated as many times as it is equal to. A non-negative integer N is passed to the program input. For example, if N = 7, then the program should print 1 2 2 3 3 3 4. Printing list elements separated by a space – print(*list).
P1T5
Output
t5 [1, 2, 2, 3, 3, 3, 4]
Given two lists: A = [1, 2, 3, 4, 2, 1, 3, 4, 5, 6, 5, 4, 3, 2] B = ['a', 'b', 'c', 'c', 'c', 'b', 'a', 'c', 'a', 'a', 'b', 'c', ' b', 'a']. Create a dictionary in in which the keys are the contents of list B, and the values for the dictionary keys are is the sum of all elements of list A according to the letter contained in the same position in list B. Example program result: {‘a’ : 10, ‘b’ : 15, ‘c’ : 6}.
P1T6
Output
t6 {'a': 17, 'b': 11, 'c': 17}
Tasks seven to twelve were combined into due to their small size. 7. Download and Upload Home Value Data in California using the sklearn library. 8. Use the info() method. 9. Find out if there are missing values using isna().sum(). 10. Withdraw records where the average age of houses in the area is more than 50 years and the population is more than 2500 people using the loc() method. 11. Find out the maximum and minimum median house price values. 12. Using the apply() method, output to screen the name of the characteristic and its average value.
P1T7_12
Output
t7:
MedInc HouseAge AveRooms ... AveOccup Latitude Longitude
0 8.3252 41.0 6.984127 ... 2.555556 37.88 -122.23
1 8.3014 21.0 6.238137 ... 2.109842 37.86 -122.22
2 7.2574 52.0 8.288136 ... 2.802260 37.85 -122.24
3 5.6431 52.0 5.817352 ... 2.547945 37.85 -122.25
4 3.8462 52.0 6.281853 ... 2.181467 37.85 -122.25
... ... ... ... ... ... ... ...
20635 1.5603 25.0 5.045455 ... 2.560606 39.48 -121.09
20636 2.5568 18.0 6.114035 ... 3.122807 39.49 -121.21
20637 1.7000 17.0 5.205543 ... 2.325635 39.43 -121.22
20638 1.8672 18.0 5.329513 ... 2.123209 39.43 -121.32
20639 2.3886 16.0 5.254717 ... 2.616981 39.37 -121.24
[20640 rows x 8 columns]
t8:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 MedInc 20640 non-null float64
1 HouseAge 20640 non-null float64
2 AveRooms 20640 non-null float64
3 AveBedrms 20640 non-null float64
4 Population 20640 non-null float64
5 AveOccup 20640 non-null float64
6 Latitude 20640 non-null float64
7 Longitude 20640 non-null float64
dtypes: float64(8)
memory usage: 1.3 MB
t9:
MedInc 0
HouseAge 0
AveRooms 0
AveBedrms 0
Population 0
AveOccup 0
Latitude 0
Longitude 0
dtype: int64
t10:
MedInc HouseAge AveRooms ... AveOccup Latitude Longitude
460 1.4012 52.0 3.105714 ... 9.534286 37.87 -122.26
4131 3.5349 52.0 4.646119 ... 5.910959 34.13 -118.20
4440 2.6806 52.0 4.806283 ... 4.007853 34.08 -118.21
5986 1.8750 52.0 4.500000 ... 21.333333 34.10 -117.71
7369 3.1901 52.0 4.730942 ... 4.182735 33.97 -118.21
8227 2.3305 52.0 3.488860 ... 3.955439 33.78 -118.20
13034 6.1359 52.0 8.275862 ... 230.172414 38.69 -121.15
15634 1.8295 52.0 2.628169 ... 4.164789 37.80 -122.41
15652 0.9000 52.0 2.237474 ... 2.237474 37.80 -122.41
15657 2.5166 52.0 2.839075 ... 1.621520 37.79 -122.41
15659 1.7240 52.0 2.278566 ... 1.780142 37.79 -122.41
15795 2.5755 52.0 3.402576 ... 2.108696 37.77 -122.42
15868 2.8135 52.0 4.584329 ... 3.966799 37.76 -122.41
[13 rows x 8 columns]
t11:
15.0001 0.4999
t12:
MedInc 3.870671
HouseAge 28.639486
AveRooms 5.429000
AveBedrms 1.096675
Population 1425.476744
AveOccup 3.070655
Latitude 35.631861
Longitude -119.569704
dtype: float64
Find and download multidimensional data (with a large number of features - columns) using the pandas library. Describe the data found in the report.
Display information about the data using the .info(), .head() methods. Check data for empty values. If present, remove row data or interpolate missing values. If necessary, additionally pre-process the data for further work with it.
t1-t2
Output
t2:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 197 entries, 0 to 196
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 income 195 non-null float64
1 life_exp 195 non-null float64
2 population 195 non-null float64
3 year 197 non-null int64
4 country 197 non-null object
5 four_regions 193 non-null object
6 six_regions 193 non-null object
7 eight_regions 193 non-null object
8 world_bank_region 193 non-null object
dtypes: float64(3), int64(1), object(5)
memory usage: 14.0+ KB
t2:
income life_exp ... eight_regions world_bank_region
0 1910.0 61.0 ... asia_west South Asia
1 11100.0 78.1 ... europe_east Europe & Central Asia
2 11100.0 74.7 ... africa_north Middle East & North Africa
3 46900.0 81.9 ... europe_west Europe & Central Asia
4 7680.0 60.8 ... africa_sub_saharan Sub-Saharan Africa
[5 rows x 9 columns]
Plot a bar chart (.bar) using the graph_objs module from the Plotly library with the following parameters:
- On the X-axis indicate the date or name, on the Y-axis indicate the quantitative indicator.
- Make the column take on a color depending on the value of the indicator (marker=dict(color=attribute, coloraxis="coloraxis")).
- Make sure that the borders of each column are highlighted with a black line with a thickness of 2.
- Display the chart title, centered at the top, with text size 20.
- Add labels for the X and Y axes with a text size of 16. For the x-axis, rotate the labels so that they are read at an angle of 315.
- Make the text size of the axis labels equal to 14.
- Place the graph across the entire width of the work area and set the height to 700 pixels.
- Add a grid to the graph, make its color 'ivory' and thickness equal to 2. (You can do this when setting the axes using gridwidth=2, gridcolor='ivory').
- Remove extra padding along the edges.
t3
Create a pie chart (go.Pie) using the data and design style from the previous graph. Make sure that the boundaries of each share are highlighted with a black line with a thickness of 2 and the categories of the pie chart are readable (for example, combine some objects).
t4
Construct linear graphs, take one of the parameters and determine the relationship between several other (from 2 to 5) indicators using the matplotlib library. Draw a conclusion. Make a graph with lines and markers, line color 'crimson', point color 'white', point border color 'black', point border thickness 2. Add a grid to the graph, make its color 'mistyrose' and width equal to 2. (You can do this when setting the axes using linewidth=2, color='mistyrose').
t5
Conclusion
The first graph shows that the higher the income, the longer the life expectancy.
The second graph shows that the majority of income is concentrated in the vast
minority of people, and also that most people have incomes that do not exceed $20,000.
Visualize multidimensional data using t-SNE. It is necessary to use the MNIST or fashion MNIST data set (you can also use other ready-made data sets where you can observe the division of objects into clusters). Consider the visualization results for different perplexity values.
t6
Output
t6: label 1x1 1x2 1x3 1x4 1x5 ... 28x23 28x24 28x25 28x26 28x27 28x28
0 5 0 0 0 0 0 ... 0 0 0 0 0 0
1 0 0 0 0 0 0 ... 0 0 0 0 0 0
2 4 0 0 0 0 0 ... 0 0 0 0 0 0
3 1 0 0 0 0 0 ... 0 0 0 0 0 0
4 9 0 0 0 0 0 ... 0 0 0 0 0 0
[5 rows x 785 columns]
t6: Elapsed time: 1.5147051811218262 seconds
Conclusion
From the resulting graphs it follows that the higher the perplexity value, the larger the
clusters become. Perplexity (a variable parameter) describes the expected density around
each point. Low values focus the algorithm on fewer neighbors, high values reduce the
number of more densely packed groups.
Visualize multidimensional data using UMAP with different n_neighbors and min_dist parameters. Calculate the running time of the algorithm using the time library and compare it with the running time of t-SNE.
t7
Output
t7: Elapsed time: 1.9380676746368408 seconds
Conclusion
Based on the obtained graphs, we can draw the following conclusion: Small values of
the n_neighbors parameter mean that the algorithm is limited to a small neighborhood
around each point - it tries to capture the local structure of the data. Large ones
retain the global structure, but lose details. The min_dist parameter determines the
minimum distance at which points can be located in the new space. Low values define
the division of data into clusters, while high values define the structure of the data
as a whole. Despite the fact that theoretically the UMAP method should be faster than
the TSNE method, practical measurements have shown the opposite, although the difference
is less than a second. The likely reason is that only the first 1000 data items are used,
so both methods are fast, but if you increase the number of data items, the UMAP method
is faster.
Load data from file
Use the describe() method to view statistics on the data. Draw conclusions.
t1-t2
Output
t2:
age sex bmi children smoker region charges
0 19 female 27.900 0 yes southwest 16884.92400
1 18 male 33.770 1 no southeast 1725.55230
2 28 male 33.000 3 no southeast 4449.46200
3 33 male 22.705 0 no northwest 21984.47061
4 32 male 28.880 0 no northwest 3866.85520
age bmi children charges
count 1338.000000 1338.000000 1338.000000 1338.000000
mean 39.207025 30.663397 1.094918 13270.422265
std 14.049960 6.098187 1.205493 12110.011237
min 18.000000 15.960000 0.000000 1121.873900
25% 27.000000 26.296250 0.000000 4740.287150
50% 39.000000 30.400000 1.000000 9382.033000
75% 51.000000 34.693750 2.000000 16639.912515
max 64.000000 53.130000 5.000000 63770.428010
--------------------------------------------------
Conclusion
You can see a count of all the attributes of the dataset, and also that age in the dataset goes
from 18 to 64, bmi (body mass index) from 15.9 to 53, children (number of children) from 0 to 5,
charges (expenses) from 1121.9 up to 63770. Average value (mean) of each attribute: 39 for age,
30.6 for bmi, 1 for children, 13270 for charges. Standard deviation is an estimate for a sample
that allows you to evaluate how much the data changes relative to their average: 14 for age, 6
for bmi, 1 for children, 12110 for charges. Each subsequent quarter increases (25%, 50%, 75%,
100%), the charges attribute increases more strongly. The count attribute is the same everywhere.
Construct histograms for numerical indicators. Draw conclusions.
t3
Conclusion
The x-axis indicates the values of the variable, and the y-axis indicates how often the value of
this variable occurs in a certain interval. The interval length was chosen to be 15. From left to
right, top to bottom, you can see how often the value of the variable appears. Charges values close
to zero appear more often. The most common age value is close to zero, while the rest are evenly
distributed. The bmi values have a normal distribution. The most common value of children is zero;
the larger the number, the less repeated it is.
Find measures of central tendency and measures of dispersion for body mass index (bmi) and charges (charges). Display results as text and in histograms (3 vertical lines). Add a legend to graphs. Draw conclusions.
t4
Output
t4:
Mean BMI = 30.663397
Mode BMI: 32.3
Median BMI = 30.400000
Mean Charges = 13270.422265
Mode Charges: 1639.5631
Median Charges = 9382.033000
Standard Deviation of charges: 12110.011236694001
Range of charges: 62648.554110000005
Quarter range of charges using numpy: 11879.80148
Quarter range of charges with scipy: 11879.80148
Standard Deviation of bmi: 6.098186911679014
Range of bmi: 37.17
Quarter range of bmi using numpy: 8.384999999999998
Quarter range of bmi with scipy: 8.384999999999998
--------------------------------------------------
Conclusion
The bmi graph shows that the values in it have a normal distribution. According to the charges graph,
from left to right the values are repeated less. You can also see from the bmi graph that mode and
median are close to each other - the mean and central values are almost the same. The most common
value is mode. In charges, the mode value is the most repeated (on the left), also in charges there
is much more variability in values, the average differs from the central one. The measure of dispersion
includes: Standard Deviation, Range, Quarter range (the difference between the 1st and 3rd quarters
is the most common). The range y of the charges attribute is very large (max – min).
Construct a box-plot for numerical indicators. The names of the graphs must correspond to the names of the features. Draw conclusions.
t5
Conclusion
In bmi and charges, the points outside the “whiskers” (quarters 1 and 3 (second 50%)) are outliers (values
that are very different from other values, they are very rare), the orange line inside the “box” (clusters
of average values) – median, outliers outside the category are too large to characterize the category. The
graph shows the distribution of information in a certain category. Categories: age, bmi, charges and children.
There are outliers only in the bmi and charges attributes, and these outliers are strictly greater than the
maximum; in the rest they are not present. In charges, half of the values are outliers.
Using the charges or imb attribute, check whether the central limit theorem holds. Use different sample lengths n. Number of samples = 300. Display the result in the form of histograms. Find the standard deviation and mean for the resulting distributions. Draw conclusions.
t6
Output
t6:
Mean of n=1 12198.327287 Std of n=1 10855.966798
Mean of n=10 13357.920196 Std of n=10 4062.840514
Mean of n=50 13118.524016 Std of n=50 1833.114858
Mean of n=100 13378.232319 Std of n=100 1197.308316
Mean of n=150 13309.708941 Std of n=150 800.090288
Mean of n=200 13260.895946 Std of n=200 735.038002
Standard Deviation: 12110.011236694001
Range: 62648.554110000005
Quarter range using numpy: 11879.80148
Quarter range with scipy: 11879.80148
--------------------------------------------------
Conclusion
Dataset values pass the central limit theorem in general and for various sample lengths (except n = 1) in particular.
The larger n, the closer to the ideal form of the normal distribution. The value n is the length of samples. All mean
values are around 12 thousand. The larger the sample length, the smaller the standard deviation and the closer the
graph is to a very accurate form of normal distribution.
Construct 95% and 99% confidence intervals for the mean expenditure and mean BMI.
t7
Output
t7:
90% confidence interval for Charges: (12725.864762144516, 13814.979768137997)
95% confidence interval for Charges: (12621.54197822916, 13919.302552053354)
90% confidence interval for BMI: (30.389176352638128, 30.93761736933497)
95% confidence interval for BMI: (30.336642971534822, 30.990150750438275)
--------------------------------------------------
Check the distribution of the following characteristics for normality: body mass index, expenses. Formulate the null and alternative hypotheses. For each characteristic, use the KS test and q-q plot. Draw conclusions based on the obtained p-values.
t8
Output
t8:
KstestResult(statistic=0.02613962682509635, pvalue=0.31453976932347394, statistic_location=28.975, statistic_sign=1)
KstestResult(statistic=0.18846204110424236, pvalue=4.39305730768502e-42, statistic_location=13470.86, statistic_sign=1)
--------------------------------------------------
Conclusion
The task is to test the null and alternative hypotheses, null – there is no difference (or there are few significant
differences), alternative – there are significant differences (visible differences). They are determined by the p-level
value (pvalue) - if it is less than 0.05, then the null hypothesis is rejected and the alternative is accepted, if it
is more, vice versa. Hypotheses are always about difference. Normality – comparison of the dependence of the original
sample values with the values of an ideal normal distribution. If the values follow the line exactly, then they are
normally distributed. If the deviations are higher than the straight line, then the values are higher than normal and
vice versa. Conclusions from the graphs: the bmi graph shows a fairly normal distribution, but the charges graph is
very different from the normal distribution. The x-axis shows the standard normal distribution, and the y-axis shows
the distribution of the sample under study. Null hypothesis - we assume that there are no differences between the ideal
normal distribution and the dependence of our initial values (charges, for example). Alternative hypothesis - we assume
that significant differences exist between the sample values and the normal distribution. For the bmi characteristic,
the null hypothesis was chosen and the alternative was rejected, and for the charges characteristic, vice versa. The
essence of the KS test is to assess the significance of the differences between two samples, as in the previous test
(q-q). Here too, hypotheses are selected based on the pvalue. For the bmi feature, the pvalue is higher than 0.05,
which means we need to accept the null hypothesis, since the sample has a normal distribution. The charges attribute
has a much smaller pvalue, which means the null hypothesis is rejected since the sample does not have a normal
distribution.
Load data from file
t9
Output
t9:
dateRep day month year cases deaths countriesAndTerritories \
0 14/12/2020 14 12 2020 746 6 Afghanistan
1 13/12/2020 13 12 2020 298 9 Afghanistan
2 12/12/2020 12 12 2020 113 11 Afghanistan
3 12/12/2020 12 12 2020 113 11 Afghanistan
4 11/12/2020 11 12 2020 63 10 Afghanistan
... ... ... ... ... ... ... ...
61899 25/03/2020 25 3 2020 0 0 Zimbabwe
61900 24/03/2020 24 3 2020 0 1 Zimbabwe
61901 23/03/2020 23 3 2020 0 0 Zimbabwe
61902 22/03/2020 22 3 2020 1 0 Zimbabwe
61903 21/03/2020 21 3 2020 1 0 Zimbabwe
geoId countryterritoryCode popData2019 continentExp \
0 AF AFG 38041757.0 Asia
1 AF AFG 38041757.0 Asia
2 AF AFG 38041757.0 Asia
3 AF AFG 38041757.0 Asia
4 AF AFG 38041757.0 Asia
... ... ... ... ...
61899 ZW ZWE 14645473.0 Africa
61900 ZW ZWE 14645473.0 Africa
61901 ZW ZWE 14645473.0 Africa
61902 ZW ZWE 14645473.0 Africa
61903 ZW ZWE 14645473.0 Africa
Cumulative_number_for_14_days_of_COVID-19_cases_per_100000
0 9.013779
1 7.052776
2 6.868768
3 6.868768
4 7.134266
... ...
61899 NaN
61900 NaN
61901 NaN
61902 NaN
61903 NaN
[61904 rows x 12 columns]
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 61904 entries, 0 to 61903
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 dateRep 61904 non-null object
1 day 61904 non-null int64
2 month 61904 non-null int64
3 year 61904 non-null int64
4 cases 61904 non-null int64
5 deaths 61904 non-null int64
6 countriesAndTerritories 61904 non-null object
7 geoId 61629 non-null object
8 countryterritoryCode 61781 non-null object
9 popData2019 61781 non-null float64
10 continentExp 61904 non-null object
11 Cumulative_number_for_14_days_of_COVID-19_cases_per_100000
59025 non-null float64
dtypes: float64(2), int64(5), object(5)
memory usage: 5.7+ MB
--------------------------------------------------
Check the data for missing values. Display the number of missing values as a percentage. Remove the two features that have the most missing values. For the remaining features, process gaps: for a categorical feature, use filling with the default value (for example, “other”), for a numeric feature, use filling with the median value. Show that there are no more gaps in the data.
t10
Output
t10:
dateRep : 0.0%
day : 0.0%
month : 0.0%
year : 0.0%
cases : 0.0%
deaths : 0.0%
countriesAndTerritories : 0.0%
geoId : 0.4%
countryterritoryCode : 0.2%
popData2019 : 0.2%
continentExp : 0.0%
Cumulative_number_for_14_days_of_COVID-19_cases_per_100000 : 4.7%
dateRep day month year cases deaths countriesAndTerritories \
0 14/12/2020 14 12 2020 746 6 Afghanistan
1 13/12/2020 13 12 2020 298 9 Afghanistan
2 12/12/2020 12 12 2020 113 11 Afghanistan
3 12/12/2020 12 12 2020 113 11 Afghanistan
4 11/12/2020 11 12 2020 63 10 Afghanistan
... ... ... ... ... ... ... ...
61899 25/03/2020 25 3 2020 0 0 Zimbabwe
61900 24/03/2020 24 3 2020 0 1 Zimbabwe
61901 23/03/2020 23 3 2020 0 0 Zimbabwe
61902 22/03/2020 22 3 2020 1 0 Zimbabwe
61903 21/03/2020 21 3 2020 1 0 Zimbabwe
countryterritoryCode popData2019 continentExp
0 AFG 38041757.0 Asia
1 AFG 38041757.0 Asia
2 AFG 38041757.0 Asia
3 AFG 38041757.0 Asia
4 AFG 38041757.0 Asia
... ... ... ...
61899 ZWE 14645473.0 Africa
61900 ZWE 14645473.0 Africa
61901 ZWE 14645473.0 Africa
61902 ZWE 14645473.0 Africa
61903 ZWE 14645473.0 Africa
[61904 rows x 10 columns]
dateRep : 0.0%
day : 0.0%
month : 0.0%
year : 0.0%
cases : 0.0%
deaths : 0.0%
countriesAndTerritories : 0.0%
countryterritoryCode : 0.0%
popData2019 : 0.0%
continentExp : 0.0%
--------------------------------------------------
View statistics on data using describe(). Draw conclusions about which features contain outliers. See for which countries the number of deaths per day exceeded 3000 and how many such days there were.
t11
Output
t11:
day month year cases deaths \
count 61904.000000 61904.000000 61904.000000 61904.000000 61904.000000
mean 15.629232 7.067104 2019.998918 1155.079026 26.053987
std 8.841624 2.954816 0.032881 6779.010824 131.222948
min 1.000000 1.000000 2019.000000 -8261.000000 -1918.000000
25% 8.000000 5.000000 2020.000000 0.000000 0.000000
50% 15.000000 7.000000 2020.000000 15.000000 0.000000
75% 23.000000 10.000000 2020.000000 273.000000 4.000000
max 31.000000 12.000000 2020.000000 234633.000000 4928.000000
popData2019
count 6.190400e+04
mean 4.091909e+07
std 1.529798e+08
min 8.150000e+02
25% 1.324820e+06
50% 7.169456e+06
75% 2.851583e+07
max 1.433784e+09
0 False
1 False
2 False
3 False
4 False
...
61899 False
61900 False
61901 False
61902 False
61903 False
Name: deaths, Length: 61904, dtype: bool
There are 11 days where deaths >= 3000
dateRep day month year cases deaths countriesAndTerritories \
2118 02/10/2020 2 10 2020 14001 3351 Argentina
16908 07/09/2020 7 9 2020 -8261 3800 Ecuador
37038 09/10/2020 9 10 2020 4936 3013 Mexico
44888 14/08/2020 14 8 2020 9441 3935 Peru
44909 24/07/2020 24 7 2020 4546 3887 Peru
59007 12/12/2020 12 12 2020 234633 3343 United_States_of_America
59009 10/12/2020 10 12 2020 220025 3124 United_States_of_America
59016 03/12/2020 3 12 2020 203311 3190 United_States_of_America
59239 24/04/2020 24 4 2020 26543 3179 United_States_of_America
59245 18/04/2020 18 4 2020 30833 3770 United_States_of_America
59247 16/04/2020 16 4 2020 30148 4928 United_States_of_America
countryterritoryCode popData2019 continentExp
2118 ARG 44780675.0 America
16908 ECU 17373657.0 America
37038 MEX 127575529.0 America
44888 PER 32510462.0 America
44909 PER 32510462.0 America
59007 USA 329064917.0 America
59009 USA 329064917.0 America
59016 USA 329064917.0 America
59239 USA 329064917.0 America
59245 USA 329064917.0 America
59247 USA 329064917.0 America
--------------------------------------------------
Conclusion
Outliers are present in the cases and deaths characteristics because there the minima are negative (can be seen from
describe()) – the values to the left of the significant minimum. It is also clear from the general graph that outliers
are also present in the year and popData2019 features. The latter has more of them than the others. A total of 11 days
were found when the number of deaths exceeded 3000. Countries in which these days were recorded: Argentina (Argentina),
Ecuador (Ecuador), Mexico (Mexico), Peru (Peru), United_States_of_America (USA).
Find data duplication. Remove duplicates.
t12
Output
t12:
dateRep day month year cases deaths countriesAndTerritories \
3 12/12/2020 12 12 2020 113 11 Afghanistan
218 12/05/2020 12 5 2020 285 2 Afghanistan
48010 29/05/2020 29 5 2020 0 0 Saint_Lucia
48073 28/03/2020 28 3 2020 0 0 Saint_Lucia
countryterritoryCode popData2019 continentExp
3 AFG 38041757.0 Asia
218 AFG 38041757.0 Asia
48010 LCA 182795.0 America
48073 LCA 182795.0 America
dateRep day month year cases deaths countriesAndTerritories \
0 14/12/2020 14 12 2020 746 6 Afghanistan
1 13/12/2020 13 12 2020 298 9 Afghanistan
2 12/12/2020 12 12 2020 113 11 Afghanistan
4 11/12/2020 11 12 2020 63 10 Afghanistan
5 10/12/2020 10 12 2020 202 16 Afghanistan
... ... ... ... ... ... ... ...
61899 25/03/2020 25 3 2020 0 0 Zimbabwe
61900 24/03/2020 24 3 2020 0 1 Zimbabwe
61901 23/03/2020 23 3 2020 0 0 Zimbabwe
61902 22/03/2020 22 3 2020 1 0 Zimbabwe
61903 21/03/2020 21 3 2020 1 0 Zimbabwe
countryterritoryCode popData2019 continentExp
0 AFG 38041757.0 Asia
1 AFG 38041757.0 Asia
2 AFG 38041757.0 Asia
4 AFG 38041757.0 Asia
5 AFG 38041757.0 Asia
... ... ... ...
61899 ZWE 14645473.0 Africa
61900 ZWE 14645473.0 Africa
61901 ZWE 14645473.0 Africa
61902 ZWE 14645473.0 Africa
61903 ZWE 14645473.0 Africa
[61900 rows x 10 columns]
Empty DataFrame
Columns: [dateRep, day, month, year, cases, deaths, countriesAndTerritories, countryterritoryCode, popData2019, continentExp]
Index: []
--------------------------------------------------
Load data from the file “bmi.csv”. Take two samples from there. One sample is the body mass index of people from the northwest region, the second sample is the body mass index of people from the southwest region. Compare the means of these samples using Student's t-test. Preliminarily check samples for normality (Shopiro-Wilk test) and homogeneity of variance (Bartlett test).
t13
Output
t13:
bmi region
0 22.705 northwest
1 28.880 northwest
2 27.740 northwest
3 25.840 northwest
4 28.025 northwest
bmi region
0 22.705 northwest
1 28.880 northwest
2 27.740 northwest
3 25.840 northwest
4 28.025 northwest
.. ... ...
320 26.315 northwest
321 31.065 northwest
322 25.935 northwest
323 30.970 northwest
324 29.070 northwest
[325 rows x 2 columns]
bmi region
325 27.9 southwest
326 34.4 southwest
327 24.6 southwest
328 40.3 southwest
329 35.3 southwest
.. ... ...
645 20.6 southwest
646 38.6 southwest
647 33.4 southwest
648 44.7 southwest
649 25.8 southwest
[325 rows x 2 columns]
The variance of both data groups: 26.305165492071005 32.29731162130177
TtestResult(statistic=-3.2844171500398582, pvalue=0.001076958496307695, df=648.0)
(-3.2844171500398667, 0.0010769584963076643, 648.0)
ShapiroResult(statistic=0.9954646825790405, pvalue=0.4655335247516632)
ShapiroResult(statistic=0.9949268698692322, pvalue=0.3629520535469055)
BartlettResult(statistic=3.4000745256459286, pvalue=0.06519347353581818)
--------------------------------------------------
Conclusion
Null hypothesis - there will be no significant difference between the average bmi values of the northwest and southwest
regions, alternative - there will be a difference. Since 0.001 (T test) is less than 0.005, the null theory must be
rejected - there is a significant difference between the average bmi values of the two regions. Normality: in both
tests the pvalue (Shapiro) is above 0.05, which means we need to accept the null hypothesis - bmi in both regions has
a normal distribution. Homogeneity – testing the equality of depressions in two samples. Null hypothesis – the samples
under consideration are obtained from general populations with the same depression. The alternative hypothesis is the
opposite. Since 0.06 (Barlett) > 0.05 – we accept the null hypothesis – the depressions of the samples are the same –
there are no significant differences between the bmi values of the regions.
The dice was rolled 600 times and the following results were obtained (see Listing 13). Use the Chi-square test to check whether the resulting distribution is uniform. Use the scipy.stats.chisquare() function.
t14
Output
t14:
N Observed Expected
0 1 97 100
1 2 98 100
2 3 109 100
3 4 95 100
4 5 97 100
5 6 104 100
Power_divergenceResult(statistic=1.44, pvalue=0.9198882077437889)
--------------------------------------------------
Conclusion
The null hypothesis is that there will be a uniform distribution in the number of drops. Since 0.92 > 0.05, we accept
the null hypothesis – uniform distribution.
Use the Chi-square test to test whether the variables are dependent. Create a dataframe using the following code (see Listing 14). Use the scipy.stats.chi2_contingency() function. Does marital status affect employment?
t15
Output
t15:
Married Civil marriage Isn't in relationships
Full working day 89 80 35
Part-time employment 17 22 44
Temporary doesn't work 11 20 35
On the household 43 35 6
Retired 22 6 8
1.7291616900960234e-21
--------------------------------------------------
Conclusion
The null hypothesis is that marital status does not affect employment, the alternative hypothesis does (there is a
significant relationship). Since the pvalue is very small (< 0.05), we reject the null hypothesis and accept the
alternative - there is a relationship (marital status affects employment).
Practice 4 - Methods for calculating correlation and linear regression, conducting analysis of variance
Determine two vectors representing the number of cars parked during 5 working days at the business center in the street parking lot and in the underground garage.
- Find and interpret the correlation between the variables “Street” and “Garage” (calculate the Pearson correlation).
- Construct a scatter plot for the above variables.
t1
Output
t1:
[[ 1. -1.]
[-1. 1.]]
-0.9999999999999998
--------------------------------------------------
Conclusion
From the correlation matrix and the separately derived correlation coefficient, it is clear that there is a
relationship - the correlation is almost -1, which means there is a strong negative correlation.
Find and download data. Derive, preprocess and describe the features.
- Construct a correlation matrix for one target variable. Determine the most correlated variable and continue working with it in the next paragraph.
- Implement regression manually, display slope, shift and MSE.
- Visualize the regression on a graph.
t2
Output
t2:
species island culmen_length_mm culmen_depth_mm flipper_length_mm \
0 Adelie Torgersen 39.1 18.7 181.0
1 Adelie Torgersen 39.5 17.4 186.0
2 Adelie Torgersen 40.3 18.0 195.0
3 Adelie Torgersen NaN NaN NaN
4 Adelie Torgersen 36.7 19.3 193.0
.. ... ... ... ... ...
339 Gentoo Biscoe NaN NaN NaN
340 Gentoo Biscoe 46.8 14.3 215.0
341 Gentoo Biscoe 50.4 15.7 222.0
342 Gentoo Biscoe 45.2 14.8 212.0
343 Gentoo Biscoe 49.9 16.1 213.0
body_mass_g sex
0 3750.0 MALE
1 3800.0 FEMALE
2 3250.0 FEMALE
3 NaN NaN
4 3450.0 FEMALE
.. ... ...
339 NaN NaN
340 4850.0 FEMALE
341 5750.0 MALE
342 5200.0 FEMALE
343 5400.0 MALE
[344 rows x 7 columns]
species island culmen_length_mm culmen_depth_mm flipper_length_mm \
0 0 0 39.1 18.7 181.0
1 0 0 39.5 17.4 186.0
2 0 0 40.3 18.0 195.0
3 0 0 NaN NaN NaN
4 0 0 36.7 19.3 193.0
.. ... ... ... ... ...
339 2 1 NaN NaN NaN
340 2 1 46.8 14.3 215.0
341 2 1 50.4 15.7 222.0
342 2 1 45.2 14.8 212.0
343 2 1 49.9 16.1 213.0
body_mass_g sex
0 3750.0 0
1 3800.0 1
2 3250.0 1
3 NaN -1
4 3450.0 1
.. ... ...
339 NaN -1
340 4850.0 1
341 5750.0 0
342 5200.0 1
343 5400.0 0
[344 rows x 7 columns]
species island culmen_length_mm culmen_depth_mm flipper_length_mm \
0 0.0 0.0 0.254545 0.666667 0.152542
1 0.0 0.0 0.269091 0.511905 0.237288
2 0.0 0.0 0.298182 0.583333 0.389831
3 0.0 0.0 0.167273 0.738095 0.355932
4 0.0 0.0 0.261818 0.892857 0.305085
.. ... ... ... ... ...
337 1.0 0.5 0.549091 0.071429 0.711864
338 1.0 0.5 0.534545 0.142857 0.728814
339 1.0 0.5 0.665455 0.309524 0.847458
340 1.0 0.5 0.476364 0.202381 0.677966
341 1.0 0.5 0.647273 0.357143 0.694915
body_mass_g sex
0 0.291667 0.333333
1 0.305556 0.666667
2 0.152778 0.666667
3 0.208333 0.666667
4 0.263889 0.333333
.. ... ...
337 0.618056 0.666667
338 0.597222 0.666667
339 0.847222 0.333333
340 0.694444 0.666667
341 0.750000 0.333333
[342 rows x 7 columns]
species island culmen_length_mm culmen_depth_mm flipper_length_mm \
0 0.0 0.0 0.254545 0.666667 0.152542
1 0.0 0.0 0.269091 0.511905 0.237288
2 0.0 0.0 0.298182 0.583333 0.389831
3 0.0 0.0 0.167273 0.738095 0.355932
4 0.0 0.0 0.261818 0.892857 0.305085
.. ... ... ... ... ...
337 1.0 0.5 0.549091 0.071429 0.711864
338 1.0 0.5 0.534545 0.142857 0.728814
339 1.0 0.5 0.665455 0.309524 0.847458
340 1.0 0.5 0.476364 0.202381 0.677966
341 1.0 0.5 0.647273 0.357143 0.694915
body_mass_g sex
0 0.291667 0.333333
1 0.305556 0.666667
2 0.152778 0.666667
3 0.208333 0.666667
4 0.263889 0.333333
.. ... ...
337 0.618056 0.666667
338 0.597222 0.666667
339 0.847222 0.333333
340 0.694444 0.666667
341 0.750000 0.333333
[342 rows x 7 columns]
[[1. 0.75049112]
[0.75049112 1. ]]
0.7504911189081507
species island culmen_length_mm culmen_depth_mm \
species 1.000 0.005 0.731 -0.744
island 0.005 1.000 0.223 0.180
culmen_length_mm 0.731 0.223 1.000 -0.235
culmen_depth_mm -0.744 0.180 -0.235 1.000
flipper_length_mm 0.854 -0.145 0.656 -0.584
body_mass_g 0.750 -0.189 0.595 -0.472
sex 0.012 0.047 -0.269 -0.323
flipper_length_mm body_mass_g sex
species 0.854 0.750 0.012
island -0.145 -0.189 0.047
culmen_length_mm 0.656 0.595 -0.269
culmen_depth_mm -0.584 -0.472 -0.323
flipper_length_mm 1.000 0.871 -0.197
body_mass_g 0.871 1.000 -0.347
sex -0.197 -0.347 1.000
[0.93208977] 0.10126324431123246
0.013649956706809902
--------------------------------------------------
Conclusion
- A pair of any 2 variables was selected, a correlation matrix was calculated from them and a coefficient of 0.75049111189081507 was determined - rather, there is a positive correlation. Then, based on the calculated table, the pair of variables with the highest correlation was precisely determined - body_mass_g and flipper_length_mm with a coefficient of 0.871.
- Regression was calculated (best fit line - a line that captures the majority of points, a line that shows how the data (points) are correlated). Next, a slope (model1.coef_) equal to 0.93208977 and an offset (model1.intercept_) equal to 0.10126324431123246 were found, the slope coefficient reports a slope of approximately 30 degrees from the origin. At the end, MSE was found - mean squared error (mean squared error - a metric that shows how accurate the forecasts are and what is the magnitude of the deviation from the actual values) equal to 0.013649956706809902, since the closer the value is to zero, the better the model, then at a given value ( 0.01) the constructed model is almost ideal.
- A regression graph was constructed, from which a positive relationship can be seen - the more one, the more the other.
Load data: 'insurance.csv'. Output and preprocess. List unique regions.
- Perform a one-way ANOVA test to test the effect of region on body mass index (BMI) using the first method through the Scipy library.
- Perform a one-way ANOVA test to test the effect of region on body mass index (BMI) using the second method, using the anova_lm() function from the statsmodels library.
- Using Student's t test, sort through all pairs. Define the Bonferroni correction. Draw conclusions.
- Perform Tukey's post-hoc tests and plot the graph.
- Run a two-way ANOVA test to test the effect of region and gender on body mass index (BMI) using the anova_lm() function from the statsmodels library.
- Perform Tukey's post-hoc tests and plot the graph.
t3
Output
t3:
age sex bmi children smoker region charges
0 19 female 27.900 0 yes southwest 16884.92400
1 18 male 33.770 1 no southeast 1725.55230
2 28 male 33.000 3 no southeast 4449.46200
3 33 male 22.705 0 no northwest 21984.47061
4 32 male 28.880 0 no northwest 3866.85520
... ... ... ... ... ... ... ...
1333 50 male 30.970 3 no northwest 10600.54830
1334 18 female 31.920 0 no northeast 2205.98080
1335 18 female 36.850 0 no southeast 1629.83350
1336 21 female 25.800 0 no southwest 2007.94500
1337 61 female 29.070 0 yes northwest 29141.36030
[1338 rows x 7 columns]
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 age 1338 non-null int64
1 sex 1338 non-null object
2 bmi 1338 non-null float64
3 children 1338 non-null int64
4 smoker 1338 non-null object
5 region 1338 non-null object
6 charges 1338 non-null float64
dtypes: float64(2), int64(2), object(3)
memory usage: 73.3+ KB
age : 0.0%
sex : 0.0%
bmi : 0.0%
children : 0.0%
smoker : 0.0%
region : 0.0%
charges : 0.0%
['southwest' 'southeast' 'northwest' 'northeast']
F_onewayResult(statistic=39.49505720170283, pvalue=1.881838913929143e-24)
sum_sq df F PR(>F)
region 4055.880631 3.0 39.495057 1.881839e-24
Residual 45664.319755 1334.0 NaN NaN
northeast northwest
TtestResult(statistic=-0.060307727183293185, pvalue=0.951929170821864, df=647.0)
5.711575024931184
northeast southeast
TtestResult(statistic=-8.790905562598699, pvalue=1.186014937424813e-17, df=686.0)
7.116089624548878e-17
northeast southwest
TtestResult(statistic=-3.1169000930045923, pvalue=0.0019086161671573072, df=647.0)
0.011451697002943843
northwest southeast
TtestResult(statistic=-9.25649013552548, pvalue=2.643571405230106e-19, df=687.0)
1.5861428431380637e-18
northwest southwest
TtestResult(statistic=-3.2844171500398582, pvalue=0.001076958496307695, df=648.0)
0.006461750977846171
southeast southwest
TtestResult(statistic=5.908373821545118, pvalue=5.4374009639680636e-09, df=687.0)
3.2624405783808385e-08
30.66339686098655
30.4
Multiple Comparison of Means - Tukey HSD, FWER=0.05
==========================================================
group1 group2 meandiff p-adj lower upper reject
----------------------------------------------------------
northeast northwest 0.0263 0.9999 -1.1552 1.2078 False
northeast southeast 4.1825 0.0 3.033 5.332 True
northeast southwest 1.4231 0.0107 0.2416 2.6046 True
northwest southeast 4.1562 0.0 3.0077 5.3047 True
northwest southwest 1.3968 0.0127 0.2162 2.5774 True
southeast southwest -2.7594 0.0 -3.9079 -1.6108 True
----------------------------------------------------------
df sum_sq mean_sq F PR(>F)
region 3.0 4055.880631 1351.960210 39.602259 1.636858e-24
sex 1.0 86.007035 86.007035 2.519359 1.126940e-01
region:sex 3.0 174.157808 58.052603 1.700504 1.650655e-01
Residual 1330.0 45404.154911 34.138462 NaN NaN
Multiple Comparison of Means - Tukey HSD, FWER=0.05
======================================================================
group1 group2 meandiff p-adj lower upper reject
----------------------------------------------------------------------
northeastfemale northeastmale -0.2998 0.9998 -2.2706 1.6711 False
northeastfemale northwestfemale -0.0464 1.0 -2.0142 1.9215 False
northeastfemale northwestmale -0.2042 1.0 -2.1811 1.7728 False
northeastfemale southeastfemale 3.3469 0.0 1.41 5.2839 True
northeastfemale southeastmale 4.6657 0.0 2.7634 6.568 True
northeastfemale southwestfemale 0.7362 0.9497 -1.2377 2.71 False
northeastfemale southwestmale 1.8051 0.1007 -0.1657 3.776 False
northeastmale northwestfemale 0.2534 0.9999 -1.7083 2.2152 False
northeastmale northwestmale 0.0956 1.0 -1.8752 2.0665 False
northeastmale southeastfemale 3.6467 0.0 1.7159 5.5775 True
northeastmale southeastmale 4.9655 0.0 3.0695 6.8614 True
northeastmale southwestfemale 1.036 0.7515 -0.9318 3.0037 False
northeastmale southwestmale 2.1049 0.0258 0.1402 4.0697 True
northwestfemale northwestmale -0.1578 1.0 -2.1257 1.81 False
northwestfemale southeastfemale 3.3933 0.0 1.4656 5.321 True
northwestfemale southeastmale 4.712 0.0 2.8192 6.6049 True
northwestfemale southwestfemale 0.7825 0.9294 -1.1822 2.7473 False
northwestfemale southwestmale 1.8515 0.0806 -0.1103 3.8132 False
northwestmale southeastfemale 3.5511 0.0 1.6141 5.4881 True
northwestmale southeastmale 4.8698 0.0 2.9676 6.7721 True
northwestmale southwestfemale 0.9403 0.8354 -1.0335 2.9142 False
northwestmale southwestmale 2.0093 0.042 0.0385 3.9801 True
southeastfemale southeastmale 1.3187 0.3823 -0.542 3.1795 False
southeastfemale southwestfemale -2.6108 0.0011 -4.5446 -0.6769 True
southeastfemale southwestmale -1.5418 0.2304 -3.4726 0.389 False
southeastmale southwestfemale -3.9295 0.0 -5.8286 -2.0304 True
southeastmale southwestmale -2.8606 0.0001 -4.7565 -0.9646 True
southwestfemale southwestmale 1.069 0.7201 -0.8988 3.0367 False
----------------------------------------------------------------------
Conclusion
There are 4 unique regions in total: southwest, southeast, northwest, northeast.
- The result of a one-way ANOVA test (analysis of variance, a statistical procedure for comparing the average values of a certain variable and two or more independent groups) shows that the p-value is 1.881838913929143e-24, it is less than 0.05, which means the feature region has a statistically significant effect on the feature bmi (body mass index).
- The result of the test (PR(>F) - p-relationship is p-value) coincides with the result of the previous test. In this test, you do not need to pre-divide into 4 regions, unlike the previous one.
- Based on the results of calculating the Student's t test (compares all pairs and shows whether there is an effect) and the Bonferroni correction (the simplest and most well-known way to control the group probability of error) for all pairs, it was found that only the northeast northwest pair had a p-value more than 0.05 - you need to accept the null hypothesis that there is no significant influence of features on each other. The Bonferroni correction is the p-value, calculated as the Student's t-test p-value multiplied by the number of pairs.
- Results of post-hoc tests (they check due to which differences the effect turned out to be significant, the differences are significant or not) Tukey (the most popular of them) determined that for all pairs except the first (northeast northwest) the null hypothesis should be rejected (reject column) and accept the alternative hypothesis - there is a significant effect. The graph also shows that southwest and southeast have no intersections (there is a significant difference), and northwest northeast (p-adj equals 0.9999 > 0.05) has intersections (shown by horizontal lines - if the line is exactly under the other, then there is an intersection) - between them no difference. The larger the intersection, the smaller the difference. The red line on the graph is the average bmi value. Meandiff – the difference between the average values for each pair. Black dots are average values. The horizontal black lines are the same length because the number of elements is the same. Lower is the least, upper is the most. P-adj – p-value adjusted – normalized p value – 0.1 is very small.
- The test result shows that only the region factor has a significant effect on the body mass index (bmi) trait, since the p-value (PR(>F)) equal to 1.636858e-24 is less than 0.05, the gender factor does not affect the ratio between 2 factors (region and gender) does not affect bmi. Two-factor (one more factor is added) – explains the influence of factors a, b and error checking - the resulting answer does not come from the influence of a and b on each other.
- First, a combination of the characteristics region and gender is created, with its help we find its effect on the trait body mass index (bmi). Tukey finds all the unique pairs and compares them. On the graph, the last 4 elements have few differences between each other, but the first 4 have significant differences both between themselves and between the rest.
Find data for classification. Pre-process the data if necessary.
t1
Output
t1:
species island culmen_length_mm culmen_depth_mm flipper_length_mm \
0 Adelie Torgersen 39.1 18.7 181.0
1 Adelie Torgersen 39.5 17.4 186.0
2 Adelie Torgersen 40.3 18.0 195.0
3 Adelie Torgersen NaN NaN NaN
4 Adelie Torgersen 36.7 19.3 193.0
.. ... ... ... ... ...
339 Gentoo Biscoe NaN NaN NaN
340 Gentoo Biscoe 46.8 14.3 215.0
341 Gentoo Biscoe 50.4 15.7 222.0
342 Gentoo Biscoe 45.2 14.8 212.0
343 Gentoo Biscoe 49.9 16.1 213.0
body_mass_g sex
0 3750.0 MALE
1 3800.0 FEMALE
2 3250.0 FEMALE
3 NaN NaN
4 3450.0 FEMALE
.. ... ...
339 NaN NaN
340 4850.0 FEMALE
341 5750.0 MALE
342 5200.0 FEMALE
343 5400.0 MALE
[344 rows x 7 columns]
species island culmen_length_mm culmen_depth_mm flipper_length_mm \
0 Adelie Torgersen 39.1 18.7 181.0
1 Adelie Torgersen 39.5 17.4 186.0
2 Adelie Torgersen 40.3 18.0 195.0
4 Adelie Torgersen 36.7 19.3 193.0
5 Adelie Torgersen 39.3 20.6 190.0
.. ... ... ... ... ...
338 Gentoo Biscoe 47.2 13.7 214.0
340 Gentoo Biscoe 46.8 14.3 215.0
341 Gentoo Biscoe 50.4 15.7 222.0
342 Gentoo Biscoe 45.2 14.8 212.0
343 Gentoo Biscoe 49.9 16.1 213.0
body_mass_g sex
0 3750.0 MALE
1 3800.0 FEMALE
2 3250.0 FEMALE
4 3450.0 FEMALE
5 3650.0 MALE
.. ... ...
338 4925.0 FEMALE
340 4850.0 FEMALE
341 5750.0 MALE
342 5200.0 FEMALE
343 5400.0 MALE
[334 rows x 7 columns]
species island culmen_length_mm culmen_depth_mm flipper_length_mm \
0 0 0 39.1 18.7 181.0
1 0 0 39.5 17.4 186.0
2 0 0 40.3 18.0 195.0
4 0 0 36.7 19.3 193.0
5 0 0 39.3 20.6 190.0
.. ... ... ... ... ...
338 2 1 47.2 13.7 214.0
340 2 1 46.8 14.3 215.0
341 2 1 50.4 15.7 222.0
342 2 1 45.2 14.8 212.0
343 2 1 49.9 16.1 213.0
body_mass_g sex
0 3750.0 0
1 3800.0 1
2 3250.0 1
4 3450.0 1
5 3650.0 0
.. ... ...
338 4925.0 1
340 4850.0 1
341 5750.0 0
342 5200.0 1
343 5400.0 0
[334 rows x 7 columns]
--------------------------------------------------
Draw a histogram that shows the balance of classes. Draw conclusions.
t2
Output
t2:
species
0 146
1 146
2 146
Name: count, dtype: int64
--------------------------------------------------
Conclusion
The data was divided into 3 classes (based on the number of unique penguin breeds - spicies). The class with the
largest number of elements is zero, the class with the smallest number is first, and the second class has an average
number of elements. The classes are unbalanced; to correct this, the method of adding similar values to the first and
second classes was used to equalize the number of elements in them and, accordingly, balance all classes -
oversampling. There are also 2 more methods: undersampling - reduces the number of elements to a minimum, synthetic
data - adding data using neural networks.
Divide the sample into training and test. Training to train the model, test to check its quality.
t3
Output
t3:
Size of Predictor Train set (350, 6)
Size of Predictor Test set (88, 6)
Size of Target Train set (350,)
Size of Target Test set (88,)
--------------------------------------------------
Conclusion
X_train.shape - size for training set features, x_test.shape - size for test set features, y_train.shape - size for
target training set indicator, y_test.shape - size for test set indicator. Predictor - columns/features, target -
goal - what you need to teach the machine to find (rock type, remove from x). X - without the spices column. The test
sample (20% of the data) differs from the training sample (80% of the data) only in quantity. Based on train, the
machine searches for connections, test is used to check the identified connections.
Apply classification algorithms: logistic regression, SVM, KNN. Construct an error matrix based on the results of the models (use confusion_matrix from sklearn.metrics).
t4
Output
t4:
Prediction values:
[2 1 0 2 1 0 2 1 2 1 1 0 1 2 0 1 2 1 0 1 2 1 0 1 2 0 1 0 2 0 0 1 2 0 1 2 1
0 2 0 0 1 0 2 2 0 2 0 2 1 0 0 2 2 1 1 2 1 1 2 1 1 1 0 2 0 0 1 0 1 1 1 1 2
2 0 1 2 1 2 2 0 1 1 0 1 0 1]
Target values:
[2 1 0 2 1 0 2 1 2 1 1 0 1 2 0 1 2 1 0 1 2 1 0 1 2 0 1 0 2 0 0 1 2 0 1 2 1
0 2 0 0 1 0 2 2 0 2 0 2 1 0 0 2 2 1 1 2 1 1 2 1 1 1 0 2 0 0 1 0 1 1 1 1 2
0 0 1 2 1 2 2 0 1 1 0 1 0 1]
0.9872727272727272
0.9866666666666667
GridSearchCV(cv=6, estimator=SVC(),
param_grid={'kernel': ('linear', 'rbf', 'poly', 'sigmoid')})
linear
0.8430742255990649
KNeighborsClassifier(n_neighbors=3)
--------------------------------------------------
Conclusion
The linear regression method showed an almost perfect result of classifying data into classes; the algorithm made only
one error. The SVM method showed an ideal result (several runs were carried out and the result was the same in all of
them). The KNN method showed average results overall and the worst among all results.
The model is better when the classes have the same number of elements. Macro_avg - arithmetic average of the indicator
between classes (used when there is an imbalance, shows how accurately a small class is predicted), precision -
accuracy of predicting classes (% of correct answers), recall - accuracy of predicting positive values (the same as
precision) (positive classes - are predicted correctly) (both should be close to unity), f1-score - a general metric
for assessing the relationship between precision and recall, support - shows how many correct values there are,
macro_avg - takes the recall values, adds them and divides them by 3.
Linear regression is a model of the dependence of a variable on one or more other variables (factors, regressors,
independent variables) with a linear dependence function.
SVM - support vector machine - support vector machine, looks at the distance between each point, predicts classes
based on the distance between points, makes vector analysis point by point and based on the results. If a point is
close to a cluster of other points FROM THE LINE, then this point will have the same class as the class of the cluster
of those points. Having previously separated all classes, draws a line, finds the distance from the line to the points
and uses vector analysis. Distance from line to points. divides into classes using answers.
There are 4 parameters - linear (simply multiplication), radial basis function (exponent), polynomial (a * b + c)^d,
sigmoidal (tangent). GridSearch is a method that allows you to find more accurate parameters for a model, runs through
all the parameters and selects the best one. grid_search_svm.best_estimator_.best_model.kernel shows the best model
(in this case linear). SVM is better than logistic regression and better than KNN.
KNN is the simplest classification algorithm, it uses finding neighbors, takes a point, finds which other points it
is closer to, their class will be the class of this point. Also uses answers, but does not divide into classes. Works
worst in terms of accuracy.
Compare classification results using accuracy, precision, recall and f1-measure (you can use classification_report from sklearn.metrics). Draw conclusions.
t5
Output
t5:
precision recall f1-score support
0 1.00 0.96 0.98 28
1 1.00 1.00 1.00 35
2 0.96 1.00 0.98 25
accuracy 0.99 88
macro avg 0.99 0.99 0.99 88
weighted avg 0.99 0.99 0.99 88
precision recall f1-score support
0 1.00 1.00 1.00 28
1 1.00 1.00 1.00 35
2 1.00 1.00 1.00 25
accuracy 1.00 88
macro avg 1.00 1.00 1.00 88
weighted avg 1.00 1.00 1.00 88
precision recall f1-score support
0 0.64 0.82 0.72 22
1 0.86 0.81 0.83 37
2 0.96 0.83 0.89 29
accuracy 0.82 88
macro avg 0.82 0.82 0.81 88
weighted avg 0.84 0.82 0.82 88
--------------------------------------------------
Conclusion
The best classification method is SVM, and the worst is KNN. It was described in more detail in the output of task 4.
In the first two conclusions (linear regression and SVM), the results are almost everywhere one, which tells us about
the high quality of class prediction - these models work great. The KNN method showed an accuracy of around 82%, which
is also good, but if data balancing had not been carried out, its result would have been worse - during testing, an
accuracy of around 50% was shown.
Find data for clustering. If the features in the data have very different scales, then the data must first be normalized.
t1
Output
species island culmen_length_mm culmen_depth_mm flipper_length_mm \
0 Adelie Torgersen 39.1 18.7 181.0
1 Adelie Torgersen 39.5 17.4 186.0
2 Adelie Torgersen 40.3 18.0 195.0
3 Adelie Torgersen NaN NaN NaN
4 Adelie Torgersen 36.7 19.3 193.0
.. ... ... ... ... ...
339 Gentoo Biscoe NaN NaN NaN
340 Gentoo Biscoe 46.8 14.3 215.0
341 Gentoo Biscoe 50.4 15.7 222.0
342 Gentoo Biscoe 45.2 14.8 212.0
343 Gentoo Biscoe 49.9 16.1 213.0
body_mass_g sex
0 3750.0 MALE
1 3800.0 FEMALE
2 3250.0 FEMALE
3 NaN NaN
4 3450.0 FEMALE
.. ... ...
339 NaN NaN
340 4850.0 FEMALE
341 5750.0 MALE
342 5200.0 FEMALE
343 5400.0 MALE
[344 rows x 7 columns]
species island culmen_length_mm culmen_depth_mm flipper_length_mm \
0 0 0 39.1 18.7 181.0
1 0 0 39.5 17.4 186.0
2 0 0 40.3 18.0 195.0
4 0 0 36.7 19.3 193.0
5 0 0 39.3 20.6 190.0
.. ... ... ... ... ...
338 2 1 47.2 13.7 214.0
340 2 1 46.8 14.3 215.0
341 2 1 50.4 15.7 222.0
342 2 1 45.2 14.8 212.0
343 2 1 49.9 16.1 213.0
body_mass_g sex
0 3750.0 0
1 3800.0 1
2 3250.0 1
4 3450.0 1
5 3650.0 0
.. ... ...
338 4925.0 1
340 4850.0 1
341 5750.0 0
342 5200.0 1
343 5400.0 0
[334 rows x 7 columns]
Perform data clustering using the k-means algorithm. Use the elbow rule and silhouette coefficient to find the optimal number of clusters.
t2
Output
[[1.74809160e+00 1.03816794e+00 4.71885496e+01 1.57244275e+01
2.14786260e+02 5.06660305e+03 3.96946565e-01]
[3.89162562e-01 1.34975369e+00 4.19330049e+01 1.80871921e+01
1.92128079e+02 3.65566502e+03 5.66502463e-01]]
cluster
1 203
0 131
Name: count, dtype: int64
Conclusion
The task is to understand with the help of graphs how many classes to define, whether it is possible to divide the
dataset into classes, whether there is a pure division into classes in the dataset. We select the number of clusters
through graphical analysis and then divide each point into classes.
The first graph (score1) is the value of the cost function. Applying the elbow rule, we look for the break point of
the line (from minus to plus / from decreasing to increasing), in this case the change in slope is the very first on
the left, the value 3 on the X-axis is the first break, so the number of clusters is also 3. The elbow method is not
accurate and , so another method is used. In the second graph, the silhouette coefficient is applied, we are looking
for the maximum value - the value on the X axis - the optimal number of clusters (in our case 2) - this method is more
accurate, but is not ideal, since there is a way to automatically find the number of clusters.
The k means method is the most commonly used clustering algorithm. The algorithm takes random points, takes them as
cluster centers (centroids), for each other point it finds the centroid closest to it, each centroid corresponds to a
set of closest points, the centroids go to the found cluster center and everything repeats. Cluster_centers -
centroids, value_counts - how many elements are in each class.
We create a Kmeans model, adjust the model to the data, get centroids, if there is a point that is closer to a
certain centroid, then this point will belong to the centroid class. The algorithm requires specifying the number of
clusters. We create a cluster column by adding labels_ to the dataset, these are essentially our centroids (their
numbers), they are needed to determine the color of each cluster. Coordinates in three-dimensional graphs are
important predictors/features/signs, they are the most important and are determined subjectively.
Perform data clustering using a hierarchical clustering algorithm.
t3
Conclusion
Hierarchical agglomerative - starts with a small number of points, then more and more are added - from smaller
clusters to larger ones - a method of creating groupings between clusters (this is a designation of
agglomerativeness), this method is no better than k means. In this method, clusters are nested within each other and
form a tree structure. Hierarchical clustering is used to determine relationships between objects. Trees are better
than logistic regression. Here we also choose the number of clusters equal to 2.
Perform data clustering using the DBSCAN algorithm.
t4
Output
22
cluster
-1 195
0 12
8 10
6 9
3 9
5 8
12 7
1 7
4 6
2 6
9 6
13 6
14 6
15 6
16 6
7 5
11 5
10 5
18 5
17 5
20 5
19 5
Name: count, dtype: int64
Conclusion
DBSCAN (Density-based spatial clustering of applications with noise) is based on density. The algorithm groups
together points that are closely spaced (high density), and marks points that are in areas of low density with
outliers (which are ignored). This algorithm itself determines the number of clusters (22 in this case). In this case,
there were 3 clusters according to the number of penguin breeds (spicies field).
The eps parameter (you can change these parameters) is the minimum distance between points; in our case, if it is
less than 11, then these 2 points are considered neighbors. The samples parameter is the minimum number of neighbors
for each point at which this point can be considered a centroid.
Visualize clustered data using t-SNE or UMAP if necessary. If the data is three-dimensional, then a three-dimensional scatter plot can be used.
t5
Conclusion
Visualization was done using the t-SNE method, and the rest was visualized by 3D scatter plots. In principle, there
is no need to use t-SNE or UMAP to visualize the data, since the visualization has already been done in another way.
This method works better than all the previous ones. They are essentially the same thing - they depend on the distance
between the points.
Find data for a classification task or for a regression task.
t1
Output
species island culmen_length_mm culmen_depth_mm flipper_length_mm \
0 Adelie Torgersen 39.1 18.7 181.0
1 Adelie Torgersen 39.5 17.4 186.0
2 Adelie Torgersen 40.3 18.0 195.0
3 Adelie Torgersen NaN NaN NaN
4 Adelie Torgersen 36.7 19.3 193.0
.. ... ... ... ... ...
339 Gentoo Biscoe NaN NaN NaN
340 Gentoo Biscoe 46.8 14.3 215.0
341 Gentoo Biscoe 50.4 15.7 222.0
342 Gentoo Biscoe 45.2 14.8 212.0
343 Gentoo Biscoe 49.9 16.1 213.0
body_mass_g sex
0 3750.0 MALE
1 3800.0 FEMALE
2 3250.0 FEMALE
3 NaN NaN
4 3450.0 FEMALE
.. ... ...
339 NaN NaN
340 4850.0 FEMALE
341 5750.0 MALE
342 5200.0 FEMALE
343 5400.0 MALE
[344 rows x 7 columns]
species island culmen_length_mm culmen_depth_mm flipper_length_mm \
0 0 0 39.1 18.7 181.0
1 0 0 39.5 17.4 186.0
2 0 0 40.3 18.0 195.0
4 0 0 36.7 19.3 193.0
5 0 0 39.3 20.6 190.0
.. ... ... ... ... ...
338 2 1 47.2 13.7 214.0
340 2 1 46.8 14.3 215.0
341 2 1 50.4 15.7 222.0
342 2 1 45.2 14.8 212.0
343 2 1 49.9 16.1 213.0
body_mass_g sex
0 3750.0 0
1 3800.0 1
2 3250.0 1
4 3450.0 1
5 3650.0 0
.. ... ...
338 4925.0 1
340 4850.0 1
341 5750.0 0
342 5200.0 1
343 5400.0 0
[334 rows x 7 columns]
Size of Predictor Train set (267, 6)
Size of Predictor Test set (67, 6)
Size of Target Train set (267,)
Size of Target Test set (67,)
Implement bugging.
t2
Output
Elapsed time: 0.13482189178466797 seconds
F1 metric for training set 0.9955112808599441
F1 metric for test set 0.98200460347353
Elapsed time: 5.649371862411499 seconds
F1 metric for training set 0.9910658307210031
F1 metric for test set 0.98200460347353
Conclusion
First, we create a regular tree (RandomForest is the most frequently used) and do not set parameters for what ratio
to look for - we try everything. With this we check whether the tree works on our data in principle - if the tree
shows an accuracy of less than 60%, then we need to use a neural network. We get 2 results: f1 for training - how well
the tree has trained (not the most important result); f1 for test - how well the tree works on new data (the result
may be worse, but in this case the same - ~0.98). Tree parameters: max_depth - maximum depth of the tree,
min_samples_split - minimum number of examples to split an internal node. Regression tries to find a linear
relationship (x increases, which means y also increases), the tree checks all possible relationships, each leaf is a
test of the relationships between points, so trees are better than regression.
We set intervals in params_grid, make grid_search_cv, it lays out all the options, runs them all and gives the most
accurate answer. Estimator - the model to be used; scoring - the metric we want to get (f1_macro - accuracy in % in
prediction), cv - the number of runs of each parameter. Best_model - best_estimator_ - the best parameters with which
the answers are predicted. Average=macro in f1_score - take the largest (best) average value.
Bagging - tuning for trees (tuning is included in bagging). In this case, bagging for trees is done using grid_search
(all options are laid out and selected). We set the intervals (we choose not by eye, but according to the standard -
indicated in the documentation and in the manual). Next, the best parameters are selected after running all possible
combinations. The prediction result in the test set may deteriorate after tuning due to the randomness factor of the
tree. Tuning should be used when the prediction result is 60% or less. The train set contains the answers, but the
test set (more importantly) does not contain the answers. If after applying tuning the result has not changed, then
this is the best result.
Implement boosting on the same data that was used for bugging.
t3
Output
Learning rate set to 0.019854
0: learn: 1.0712628 total: 64.1ms remaining: 3m 12s
1: learn: 1.0446230 total: 100ms remaining: 2m 30s
2: learn: 1.0206709 total: 152ms remaining: 2m 32s
3: learn: 0.9960718 total: 202ms remaining: 2m 31s
...
2997: learn: 0.0028512 total: 44s remaining: 29.4ms
2998: learn: 0.0028501 total: 44s remaining: 14.7ms
2999: learn: 0.0028491 total: 44s remaining: 0us
Elapsed time: 46.70735192298889 seconds
Boosted F1 metric for train set 1.0
Boosted F1 metric for test set 0.98200460347353
Conclusion
Boosting makes several trees at once. Boosting uses trees as its base algorithms and most often works on samples with
heterogeneous data. Boosting helps to find connections where there are none (helps trees), ordinary trees
(RandomForest) should find connections, but cannot always find them, regression will not be able to find connections
if the dependencies are “not visible to the eye.” In this case, cat boost is used (not the best algorithm), there are
better algorithms, such as (the best): ada boost and xgboost.
Compare the results of the algorithms (working time and quality of models). Draw conclusions.
Conclusion
The results of the execution of the algorithms, measurements of their operating time and the quality of the models
are given in the results of the corresponding tasks.
First we built a simple tree, then we did bugging, and finally we did boosting. Boosting works best and should be
the slowest, since 3000 trees are built, but each of them is built relatively quickly. Bagging is slower than
building a single tree in boosting and slower than building a simple tree.
Trees work better than regression (for classification mostly), boosting works better than bugging, and bugging works
better than a regular tree. The first level is a tree with a choice of parameters “by eye”, the second level is
bugging (tuning) and the third level is boosting. Bagging uses intervals and boosting uses iterations. The result
of “F1 metric for test set” is much more important than “F1 metric for training set” (based on train, the machine
searches for connections, test is used to check the identified connections), it is this result that we check.
Execution on a graphics processor (GPU) is faster than on a central processing unit (CPU).
Load data.
t1
Output
shrimp almonds avocado vegetables mix \
0 burgers meatballs eggs NaN
1 chutney NaN NaN NaN
2 turkey avocado NaN NaN
3 mineral water milk energy bar whole wheat rice
4 low fat yogurt NaN NaN NaN
... ... ... ... ...
7495 butter light mayo fresh bread NaN
7496 burgers frozen vegetables eggs french fries
7497 chicken NaN NaN NaN
7498 escalope green tea NaN NaN
7499 eggs frozen smoothie yogurt cake low fat yogurt
green grapes whole weat flour yams cottage cheese energy drink \
0 NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN
3 green tea NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN
... ... ... ... ... ...
7495 NaN NaN NaN NaN NaN
7496 magazines green tea NaN NaN NaN
7497 NaN NaN NaN NaN NaN
7498 NaN NaN NaN NaN NaN
7499 NaN NaN NaN NaN NaN
tomato juice low fat yogurt green tea honey salad mineral water salmon \
0 NaN NaN NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ...
7495 NaN NaN NaN NaN NaN NaN NaN
7496 NaN NaN NaN NaN NaN NaN NaN
7497 NaN NaN NaN NaN NaN NaN NaN
7498 NaN NaN NaN NaN NaN NaN NaN
7499 NaN NaN NaN NaN NaN NaN NaN
antioxydant juice frozen smoothie spinach olive oil
0 NaN NaN NaN NaN
1 NaN NaN NaN NaN
2 NaN NaN NaN NaN
3 NaN NaN NaN NaN
4 NaN NaN NaN NaN
... ... ... ... ...
7495 NaN NaN NaN NaN
7496 NaN NaN NaN NaN
7497 NaN NaN NaN NaN
7498 NaN NaN NaN NaN
7499 NaN NaN NaN NaN
[7500 rows x 20 columns]
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7500 entries, 0 to 7499
Data columns (total 20 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 shrimp 7500 non-null object
1 almonds 5746 non-null object
2 avocado 4388 non-null object
3 vegetables mix 3344 non-null object
4 green grapes 2528 non-null object
5 whole weat flour 1863 non-null object
6 yams 1368 non-null object
7 cottage cheese 980 non-null object
8 energy drink 653 non-null object
9 tomato juice 394 non-null object
10 low fat yogurt 255 non-null object
11 green tea 153 non-null object
12 honey 86 non-null object
13 salad 46 non-null object
14 mineral water 24 non-null object
15 salmon 7 non-null object
16 antioxydant juice 3 non-null object
17 frozen smoothie 3 non-null object
18 spinach 2 non-null object
19 olive oil 0 non-null float64
dtypes: float64(1), object(19)
memory usage: 1.1+ MB
Visualize data (display relative and actual frequency of occurrence for the 20 most popular products on histograms).
t2
Output
mineral water 1787
eggs 1348
spaghetti 1306
french fries 1282
chocolate 1230
green tea 990
milk 972
ground beef 737
frozen vegetables 715
pancakes 713
burgers 654
cake 608
cookies 603
escalope 595
low fat yogurt 573
shrimp 535
tomatoes 513
olive oil 493
frozen smoothie 474
turkey 469
Name: count, dtype: int64
mineral water 0.238267
eggs 0.179733
spaghetti 0.174133
french fries 0.170933
chocolate 0.164000
green tea 0.132000
milk 0.129600
ground beef 0.098267
frozen vegetables 0.095333
pancakes 0.095067
burgers 0.087200
cake 0.081067
cookies 0.080400
escalope 0.079333
low fat yogurt 0.076400
shrimp 0.071333
tomatoes 0.068400
olive oil 0.065733
frozen smoothie 0.063200
turkey 0.062533
Name: count, dtype: float64
Conclusion
The stack() function makes a column/stack (arranges from top to bottom by quantity), value_counts() - finds the most
frequent ones - highlighting the most frequently repeated products. We find the most frequent ones and put them on the
stack. You can also apply normalization to build relative frequencies and, in order to reduce them to close values,
otherwise the most unpopular values will be invisible on the graph, we lower the large values and raise the small
ones. Shape - quantity - dimension of all data. We divide each element by this dimension.
Apply the Apriori algorithm using 3 different libraries (apriori_python, apyori, efficient_apriori). Select hyperparameters for the algorithms so that about 10 best rules are output.
t3
Output
burgers ['burgers', 'meatballs', 'eggs']
first:
[[{'frozen vegetables', 'spaghetti', 'turkey', 'milk'}, {'mineral water'}, 0.9], [{'chocolate', 'frozen vegetables', 'olive oil', 'shrimp'}, {'mineral water'}, 0.9], [{'pasta', 'eggs', 'mineral water'}, {'shrimp'}, 0.9090909090909091], [{'herb & pepper', 'rice', 'mineral water'}, {'ground beef'}, 0.9090909090909091], [{'pancakes', 'ground beef', 'whole wheat rice'}, {'mineral water'}, 0.9090909090909091], [{'red wine', 'soup'}, {'mineral water'}, 0.9333333333333333], [{'pasta', 'mushroom cream sauce'}, {'escalope'}, 0.95], [{'french fries', 'pasta', 'mushroom cream sauce'}, {'escalope'}, 1.0], [{'cake', 'olive oil', 'shrimp'}, {'mineral water'}, 1.0], [{'meatballs', 'cake', 'mineral water'}, {'milk'}, 1.0], [{'olive oil', 'ground beef', 'light cream'}, {'mineral water'}, 1.0]]
end
second:
[RelationRecord(items=frozenset({'pasta', 'escalope', 'mushroom cream sauce'}), support=0.002533333333333333, ordered_statistics=[OrderedStatistic(items_base=frozenset({'pasta', 'mushroom cream sauce'}), items_add=frozenset({'escalope'}), confidence=0.95, lift=11.974789915966385)]), RelationRecord(items=frozenset({'red wine', 'soup', 'mineral water'}), support=0.0018666666666666666, ordered_statistics=[OrderedStatistic(items_base=frozenset({'red wine', 'soup'}), items_add=frozenset({'mineral water'}), confidence=0.9333333333333333, lift=3.917179630665921)]), RelationRecord(items=frozenset({'meatballs', 'cake', 'mineral water', 'milk'}), support=0.0010666666666666667, ordered_statistics=[OrderedStatistic(items_base=frozenset({'meatballs', 'cake', 'mineral water'}), items_add=frozenset({'milk'}), confidence=1.0, lift=7.71604938271605)]), RelationRecord(items=frozenset({'shrimp', 'cake', 'olive oil', 'mineral water'}), support=0.0012, ordered_statistics=[OrderedStatistic(items_base=frozenset({'cake', 'olive oil', 'shrimp'}), items_add=frozenset({'mineral water'}), confidence=1.0, lift=4.196978175713486)]), RelationRecord(items=frozenset({'pasta', 'shrimp', 'eggs', 'mineral water'}), support=0.0013333333333333333, ordered_statistics=[OrderedStatistic(items_base=frozenset({'pasta', 'eggs', 'mineral water'}), items_add=frozenset({'shrimp'}), confidence=0.9090909090909091, lift=12.744265080713678)]), RelationRecord(items=frozenset({'pasta', 'french fries', 'escalope', 'mushroom cream sauce'}), support=0.0010666666666666667, ordered_statistics=[OrderedStatistic(items_base=frozenset({'french fries', 'pasta', 'mushroom cream sauce'}), items_add=frozenset({'escalope'}), confidence=1.0, lift=12.605042016806722)]), RelationRecord(items=frozenset({'herb & pepper', 'rice', 'ground beef', 'mineral water'}), support=0.0013333333333333333, ordered_statistics=[OrderedStatistic(items_base=frozenset({'herb & pepper', 'rice', 'mineral water'}), items_add=frozenset({'ground beef'}), confidence=0.9090909090909091, lift=9.251264339459725)]), RelationRecord(items=frozenset({'olive oil', 'mineral water', 'ground beef', 'light cream'}), support=0.0012, ordered_statistics=[OrderedStatistic(items_base=frozenset({'olive oil', 'ground beef', 'light cream'}), items_add=frozenset({'mineral water'}), confidence=1.0, lift=4.196978175713486)]), RelationRecord(items=frozenset({'pancakes', 'ground beef', 'whole wheat rice', 'mineral water'}), support=0.0013333333333333333, ordered_statistics=[OrderedStatistic(items_base=frozenset({'pancakes', 'ground beef', 'whole wheat rice'}), items_add=frozenset({'mineral water'}), confidence=0.9090909090909091, lift=3.8154347051940785)])]
end
second beautified:
frozenset({'pasta', 'mushroom cream sauce'}) frozenset({'escalope'})
Support: 0.002533333333333333; Confidence: 0.95; Lift: 11.974789915966385;
frozenset({'red wine', 'soup'}) frozenset({'mineral water'})
Support: 0.0018666666666666666; Confidence: 0.9333333333333333; Lift: 3.917179630665921;
frozenset({'meatballs', 'cake', 'mineral water'}) frozenset({'milk'})
Support: 0.0010666666666666667; Confidence: 1.0; Lift: 7.71604938271605;
frozenset({'cake', 'olive oil', 'shrimp'}) frozenset({'mineral water'})
Support: 0.0012; Confidence: 1.0; Lift: 4.196978175713486;
frozenset({'pasta', 'eggs', 'mineral water'}) frozenset({'shrimp'})
Support: 0.0013333333333333333; Confidence: 0.9090909090909091; Lift: 12.744265080713678;
frozenset({'french fries', 'pasta', 'mushroom cream sauce'}) frozenset({'escalope'})
Support: 0.0010666666666666667; Confidence: 1.0; Lift: 12.605042016806722;
frozenset({'herb & pepper', 'rice', 'mineral water'}) frozenset({'ground beef'})
Support: 0.0013333333333333333; Confidence: 0.9090909090909091; Lift: 9.251264339459725;
frozenset({'olive oil', 'ground beef', 'light cream'}) frozenset({'mineral water'})
Support: 0.0012; Confidence: 1.0; Lift: 4.196978175713486;
frozenset({'pancakes', 'ground beef', 'whole wheat rice'}) frozenset({'mineral water'})
Support: 0.0013333333333333333; Confidence: 0.9090909090909091; Lift: 3.8154347051940785;
end
third:
{mushroom cream sauce, pasta} -> {escalope} (conf: 0.950, supp: 0.003, lift: 11.975, conv: 18.413)
{red wine, soup} -> {mineral water} (conf: 0.933, supp: 0.002, lift: 3.917, conv: 11.426)
{cake, meatballs, mineral water} -> {milk} (conf: 1.000, supp: 0.001, lift: 7.716, conv: 870400000.000)
{cake, olive oil, shrimp} -> {mineral water} (conf: 1.000, supp: 0.001, lift: 4.197, conv: 761733333.333)
{eggs, mineral water, pasta} -> {shrimp} (conf: 0.909, supp: 0.001, lift: 12.744, conv: 10.215)
{french fries, mushroom cream sauce, pasta} -> {escalope} (conf: 1.000, supp: 0.001, lift: 12.605, conv: 920666666.667)
{herb & pepper, mineral water, rice} -> {ground beef} (conf: 0.909, supp: 0.001, lift: 9.251, conv: 9.919)
{ground beef, light cream, olive oil} -> {mineral water} (conf: 1.000, supp: 0.001, lift: 4.197, conv: 761733333.333)
{ground beef, pancakes, whole wheat rice} -> {mineral water} (conf: 0.909, supp: 0.001, lift: 3.815, conv: 8.379)
{chocolate, frozen vegetables, olive oil, shrimp} -> {mineral water} (conf: 0.900, supp: 0.001, lift: 3.777, conv: 7.617)
{frozen vegetables, milk, spaghetti, turkey} -> {mineral water} (conf: 0.900, supp: 0.001, lift: 3.777, conv: 7.617)
end
Conclusion
The task is to teach the machine to find associations between things (search for relationships, associations, search
for associations between things). For example: 10 people went to the grocery store - the task is to conduct an
inference to find connections between customers and purchases. Dataset - transactions of people, we need to find
associations.
ARL algorithms - association rules learning - learning algorithms according to association rules.
The first algorithm: rules - algorithm responses, minSup - minimum support from 0 to 1 - a measure of reliability
with which an associative rule expresses the association between a condition and a consequence, minimal because the
task is to put as little as possible in order to find the point from which to start ; minConf - minimum confidence -
the indicator characterizes that the association a to b is an associative rule - the algorithm is so confident that a
is associated with b.
Each dataset has its own values, you need to define them so that the number of rules is not too small and not too
large; the smaller both values are, the more time is needed for the process; if it is too large (closer to 1), then
connections may not be found - you need to find a balance - the smallest at which it gives out associations; minSup -
it is he who influences the main thing. Both values must be between 0 and 1; minConf is so high because you need the
10 most accurate results.
The first algorithm shows an association (rule) with an accuracy percentage [{'chocolate', 'frozen vegetables',
'olive oil', 'shrimp'}, {'mineral water'}, 0.9] - when buying these 4 things, they also bought water - The algorithm
is 90% confident in this, 0.9 - confidence - at least 80% (minConf) is confident of producing results.
Second algorithm: min_lift is added - excludes the output of independent rules - more than 1 is needed so that it
weeds out answers with a parameter less than 1, which show independence; min_lift - the ratio of the dependence of
things to their independence - how dependent things are on each other - if equal to 1 - things are independent, if
more than 1, then there is a dependence, the more 1 the better, less than 1 - a negative impact. Also a different
type of output. A frozen set is unchangeable. The output contains 9 rules, since some associations come out with the
same accuracy - exactly 10 are not obtained.
Third algorithm: conv in the output - persuasiveness - conviction - frequency of errors rules - how often they bought
beer without diapers and vice versa - the higher the 1, the better. The rule (association) {mushroom cream sauce,
pasta} -> {escalope} means that the algorithm is sure that escalope is also bought with cream and pasta.
MinSup - how often elements occur in the data set, minConf - how often the rule will fire, minLift - how much
better (result = rule) is compared to the random frequency.
Apply the FP-Growth algorithm from the fpgrowth_py library. Select hyperparameters for the algorithm so that about 10 best rules are output.
t4
Output
[{'ground beef', 'light cream', 'olive oil'}, {'mineral water'}, 1.0]
[{'mineral water', 'pasta', 'eggs'}, {'shrimp'}, 0.9090909090909091]
[{'mushroom cream sauce', 'pasta', 'french fries'}, {'escalope'}, 1.0]
[{'mushroom cream sauce', 'pasta'}, {'escalope'}, 0.95]
[{'rice', 'mineral water', 'herb & pepper'}, {'ground beef'}, 0.9090909090909091]
[{'cake', 'meatballs', 'mineral water'}, {'milk'}, 1.0]
[{'red wine', 'soup'}, {'mineral water'}, 0.9333333333333333]
[{'ground beef', 'pancakes', 'whole wheat rice'}, {'mineral water'}, 0.9090909090909091]
[{'shrimp', 'olive oil', 'cake'}, {'mineral water'}, 1.0]
Conclusion
FP-Growth is a newer algorithm, works using a tree - the slowest, because its advantage is offset by the fairly good
optimization of the Apriori algorithm in the libraries used; minSupRatio = minSup. In the rule [{'light cream',
'olive oil', 'ground beef'}, {'mineral water'}, 1.0] - 1.0 is confidence.
Compare the execution time of all algorithms and build a histogram.
t5
Output
[39.639390939000805, 3.898362795999674, 0.28718328700051643, 59.603256348000286]
Time for apriori from apriori_python: 39.639390939000805
Time for apriori from apryori: 3.898362795999674
Time for apriori from efficient_apriori: 0.28718328700051643
Time for fpgrowth: 59.603256348000286
Conclusion
The fastest algorithm is efficient apriori. All algorithms a priori scan the entire dataset, count all supports at
all levels - it loads the RAM very heavily on large datasets, but it still works relatively quickly. The first three
algorithms differ, essentially, in nothing - in each implementation the same algorithm, but in different ways - in
each implementation the algorithm itself is implemented differently - mathematically better or worse. In fpgrowth
there is a different algorithm. A priori - knowledge obtained before deduction, before the randomness of choice -
knowledge obtained before experience and independently of it, that is, knowledge as if known in advance.
In subsequent tasks everything is the same, but with different data. The implementations of the algorithms themselves are generalized, separated into separate functions, and duplicated for convenience below.
Load data.
t6
Output
Bread Unnamed: 1 Unnamed: 2 Unnamed: 3 \
0 Scandinavian Scandinavian NaN NaN
1 Hot chocolate Jam Cookies NaN
2 Muffin NaN NaN NaN
3 Coffee Pastry Bread NaN
4 Medialuna Pastry Muffin NaN
... ... ... ... ...
9525 Bread NaN NaN NaN
9526 Truffles Tea Spanish Brunch Christmas common
9527 Muffin Tacos/Fajita Coffee Tea
9528 Coffee Pastry NaN NaN
9529 Smoothies NaN NaN NaN
Unnamed: 4 Unnamed: 5 Unnamed: 6 Unnamed: 7 Unnamed: 8 Unnamed: 9 \
0 NaN NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ...
9525 NaN NaN NaN NaN NaN NaN
9526 NaN NaN NaN NaN NaN NaN
9527 NaN NaN NaN NaN NaN NaN
9528 NaN NaN NaN NaN NaN NaN
9529 NaN NaN NaN NaN NaN NaN
Unnamed: 10 Unnamed: 11
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 NaN NaN
4 NaN NaN
... ... ...
9525 NaN NaN
9526 NaN NaN
9527 NaN NaN
9528 NaN NaN
9529 NaN NaN
[9530 rows x 12 columns]
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9530 entries, 0 to 9529
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Bread 9207 non-null object
1 Unnamed: 1 5840 non-null object
2 Unnamed: 2 2959 non-null object
3 Unnamed: 3 1505 non-null object
4 Unnamed: 4 596 non-null object
5 Unnamed: 5 245 non-null object
6 Unnamed: 6 91 non-null object
7 Unnamed: 7 36 non-null object
8 Unnamed: 8 13 non-null object
9 Unnamed: 9 9 non-null object
10 Unnamed: 10 4 non-null object
11 Unnamed: 11 1 non-null object
dtypes: object(12)
memory usage: 893.6+ KB
Visualize data (display relative and actual frequency of occurrence for the 20 most popular products on histograms).
t7
Output
Coffee 5471
Bread 3324
Tea 1435
Cake 1025
Pastry 856
Sandwich 771
Medialuna 616
Hot chocolate 590
Cookies 540
Brownie 379
Farm House 374
Muffin 370
Alfajores 369
Juice 369
Soup 342
Scone 327
Toast 318
Scandinavian 277
Truffles 193
Coke 185
Name: count, dtype: int64
Coffee 0.574082
Bread 0.348793
Tea 0.150577
Cake 0.107555
Pastry 0.089822
Sandwich 0.080902
Medialuna 0.064638
Hot chocolate 0.061910
Cookies 0.056663
Brownie 0.039769
Farm House 0.039244
Muffin 0.038825
Alfajores 0.038720
Juice 0.038720
Soup 0.035887
Scone 0.034313
Toast 0.033368
Scandinavian 0.029066
Truffles 0.020252
Coke 0.019412
Name: count, dtype: float64
Apply the Apriori algorithm using 3 different libraries (apriori_python, apyori, efficient_apriori). Select hyperparameters for the algorithms so that about 10 best rules are output.
t8
Output
Scandinavian ['Scandinavian', 'Scandinavian']
first:
[[{'Salad', 'Cake'}, {'Coffee'}, 0.7692307692307693], [{'Juice', 'Pastry'}, {'Coffee'}, 0.7727272727272727], [{'Scone', 'Cookies'}, {'Coffee'}, 0.7894736842105263], [{'Keeping It Local'}, {'Coffee'}, 0.8095238095238095], [{'Extra Salami or Feta'}, {'Coffee'}, 0.8157894736842105], [{'Salad', 'Sandwich'}, {'Coffee'}, 0.8333333333333334], [{'Vegan mincepie', 'Cake'}, {'Coffee'}, 0.8333333333333334], [{'Sandwich', 'Hearty & Seasonal'}, {'Coffee'}, 0.8571428571428571], [{'Pastry', 'Toast'}, {'Coffee'}, 0.8666666666666667], [{'Salad', 'Extra Salami or Feta'}, {'Coffee'}, 0.875]]
end
second:
[RelationRecord(items=frozenset({'Extra Salami or Feta', 'Coffee'}), support=0.0032528856243441762, ordered_statistics=[OrderedStatistic(items_base=frozenset({'Extra Salami or Feta'}), items_add=frozenset({'Coffee'}), confidence=0.8157894736842104, lift=1.7169774037567413)]), RelationRecord(items=frozenset({'Coffee', 'Keeping It Local'}), support=0.005351521511017838, ordered_statistics=[OrderedStatistic(items_base=frozenset({'Keeping It Local'}), items_add=frozenset({'Coffee'}), confidence=0.8095238095238095, lift=1.7037901733131415)]), RelationRecord(items=frozenset({'Salad', 'Coffee', 'Cake'}), support=0.001049317943336831, ordered_statistics=[OrderedStatistic(items_base=frozenset({'Salad', 'Cake'}), items_add=frozenset({'Coffee'}), confidence=0.7692307692307692, lift=1.6189861375373742)]), RelationRecord(items=frozenset({'Coffee', 'Cake', 'Vegan mincepie'}), support=0.001049317943336831, ordered_statistics=[OrderedStatistic(items_base=frozenset({'Vegan mincepie', 'Cake'}), items_add=frozenset({'Coffee'}), confidence=0.8333333333333334, lift=1.7539016489988222)]), RelationRecord(items=frozenset({'Coffee', 'Scone', 'Cookies'}), support=0.0015739769150052466, ordered_statistics=[OrderedStatistic(items_base=frozenset({'Scone', 'Cookies'}), items_add=frozenset({'Coffee'}), confidence=0.7894736842105262, lift=1.6615910358936208)]), RelationRecord(items=frozenset({'Extra Salami or Feta', 'Coffee', 'Salad'}), support=0.0014690451206715634, ordered_statistics=[OrderedStatistic(items_base=frozenset({'Extra Salami or Feta', 'Salad'}), items_add=frozenset({'Coffee'}), confidence=0.875, lift=1.8415967314487631)]), RelationRecord(items=frozenset({'Sandwich', 'Coffee', 'Hearty & Seasonal'}), support=0.0012591815320041973, ordered_statistics=[OrderedStatistic(items_base=frozenset({'Sandwich', 'Hearty & Seasonal'}), items_add=frozenset({'Coffee'}), confidence=0.8571428571428572, lift=1.8040131246845028)]), RelationRecord(items=frozenset({'Coffee', 'Juice', 'Pastry'}), support=0.0017838405036726128, ordered_statistics=[OrderedStatistic(items_base=frozenset({'Juice', 'Pastry'}), items_add=frozenset({'Coffee'}), confidence=0.7727272727272728, lift=1.6263451654352716)]), RelationRecord(items=frozenset({'Coffee', 'Pastry', 'Toast'}), support=0.0013641133263378805, ordered_statistics=[OrderedStatistic(items_base=frozenset({'Pastry', 'Toast'}), items_add=frozenset({'Coffee'}), confidence=0.8666666666666667, lift=1.8240577149587751)]), RelationRecord(items=frozenset({'Salad', 'Coffee', 'Sandwich'}), support=0.0015739769150052466, ordered_statistics=[OrderedStatistic(items_base=frozenset({'Salad', 'Sandwich'}), items_add=frozenset({'Coffee'}), confidence=0.8333333333333333, lift=1.753901648998822)])]
end
second beautified:
frozenset({'Extra Salami or Feta'}) frozenset({'Coffee'})
Support: 0.0032528856243441762; Confidence: 0.8157894736842104; Lift: 1.7169774037567413;
frozenset({'Keeping It Local'}) frozenset({'Coffee'})
Support: 0.005351521511017838; Confidence: 0.8095238095238095; Lift: 1.7037901733131415;
frozenset({'Salad', 'Cake'}) frozenset({'Coffee'})
Support: 0.001049317943336831; Confidence: 0.7692307692307692; Lift: 1.6189861375373742;
frozenset({'Vegan mincepie', 'Cake'}) frozenset({'Coffee'})
Support: 0.001049317943336831; Confidence: 0.8333333333333334; Lift: 1.7539016489988222;
frozenset({'Scone', 'Cookies'}) frozenset({'Coffee'})
Support: 0.0015739769150052466; Confidence: 0.7894736842105262; Lift: 1.6615910358936208;
frozenset({'Extra Salami or Feta', 'Salad'}) frozenset({'Coffee'})
Support: 0.0014690451206715634; Confidence: 0.875; Lift: 1.8415967314487631;
frozenset({'Sandwich', 'Hearty & Seasonal'}) frozenset({'Coffee'})
Support: 0.0012591815320041973; Confidence: 0.8571428571428572; Lift: 1.8040131246845028;
frozenset({'Juice', 'Pastry'}) frozenset({'Coffee'})
Support: 0.0017838405036726128; Confidence: 0.7727272727272728; Lift: 1.6263451654352716;
frozenset({'Pastry', 'Toast'}) frozenset({'Coffee'})
Support: 0.0013641133263378805; Confidence: 0.8666666666666667; Lift: 1.8240577149587751;
frozenset({'Salad', 'Sandwich'}) frozenset({'Coffee'})
Support: 0.0015739769150052466; Confidence: 0.8333333333333333; Lift: 1.753901648998822;
end
third:
{Extra Salami or Feta} -> {Coffee} (conf: 0.816, supp: 0.003, lift: 1.717, conv: 2.849)
{Keeping It Local} -> {Coffee} (conf: 0.810, supp: 0.005, lift: 1.704, conv: 2.756)
{Cake, Salad} -> {Coffee} (conf: 0.769, supp: 0.001, lift: 1.619, conv: 2.274)
{Cake, Vegan mincepie} -> {Coffee} (conf: 0.833, supp: 0.001, lift: 1.754, conv: 3.149)
{Cookies, Scone} -> {Coffee} (conf: 0.789, supp: 0.002, lift: 1.662, conv: 2.493)
{Extra Salami or Feta, Salad} -> {Coffee} (conf: 0.875, supp: 0.001, lift: 1.842, conv: 4.199)
{Hearty & Seasonal, Sandwich} -> {Coffee} (conf: 0.857, supp: 0.001, lift: 1.804, conv: 3.674)
{Juice, Pastry} -> {Coffee} (conf: 0.773, supp: 0.002, lift: 1.626, conv: 2.309)
{Pastry, Toast} -> {Coffee} (conf: 0.867, supp: 0.001, lift: 1.824, conv: 3.937)
{Salad, Sandwich} -> {Coffee} (conf: 0.833, supp: 0.002, lift: 1.754, conv: 3.149)
end
Apply the FP-Growth algorithm from the fpgrowth_py library. Select hyperparameters for the algorithm so that about 10 best rules are output.
t9
Output
[{'Mighty Protein'}, {'Coffee'}, 0.8181818181818182]
[{'Extra Salami or Feta', 'Salad'}, {'Coffee'}, 0.875]
[{'Extra Salami or Feta'}, {'Coffee'}, 0.8157894736842105]
[{'Extra Salami or Feta'}, {'Coffee'}, 0.8157894736842105]
[{'Vegan mincepie', 'Cake'}, {'Coffee'}, 0.8333333333333334]
[{'Salad', 'Sandwich'}, {'Coffee'}, 0.8333333333333334]
[{'Sandwich', 'Hearty & Seasonal'}, {'Coffee'}, 0.8571428571428571]
[{'Pastry', 'Toast'}, {'Coffee'}, 0.8666666666666667]
[{'Sandwich', 'Cake', 'Soup'}, {'Coffee'}, 0.8181818181818182]
[{'Sandwich', 'Hot chocolate', 'Cookies'}, {'Coffee'}, 0.8571428571428571]
[{'Sandwich', 'Pastry'}, {'Coffee'}, 0.8181818181818182]
Compare the execution time of all algorithms and build a histogram.
t10
Output
[1.8197862629995143, 0.06134650999956648, 0.016998387000057846, 3.7827288939997743]
Time for apriori from apriori_python: 1.8197862629995143
Time for apriori from apryori: 0.06134650999956648
Time for apriori from efficient_apriori: 0.016998387000057846
Time for fpgrowth: 3.7827288939997743
Conclusion
Detailed findings for each task type are described above. As a result, it can be noted that the algorithm in the
implementation of the efficient apriori library works many times faster than the algorithms from the implementations
of the apriori python and fpgrowth libraries, the difference in time with the algorithm in the implementation of the
apriori library works slightly slower than efficient apriori. All algorithms produce rules/associations with a
sufficient level of quality and confirm each other's results, which indicates the correctness of their work.