diff --git a/notebooks/01-ams-inital_data_exploration.html b/notebooks/01-ams-inital_data_exploration.html new file mode 100644 index 0000000..e93dc38 --- /dev/null +++ b/notebooks/01-ams-inital_data_exploration.html @@ -0,0 +1,9259 @@ +Pandas Profiling Report

Overview

Dataset statistics

Number of variables7
Number of observations2064
Missing cells2034
Missing cells (%)14.1%
Duplicate rows0
Duplicate rows (%)0.0%
Total size in memory113.0 KiB
Average record size in memory56.1 B

Variable types

Numeric1
Categorical6

Alerts

Brand has a high cardinality: 324 distinct values High cardinality
Variety has a high cardinality: 1945 distinct values High cardinality
Review # is highly correlated with Country and 1 other fieldsHigh correlation
Country is highly correlated with Review # and 1 other fieldsHigh correlation
Top Ten is highly correlated with Review # and 1 other fieldsHigh correlation
Top Ten has 2032 (98.4%) missing values Missing
Variety is uniformly distributed Uniform
Top Ten is uniformly distributed Uniform
Review # has unique values Unique

Reproduction

Analysis started2021-11-21 22:40:05.312515
Analysis finished2021-11-21 22:40:07.357062
Duration2.04 seconds
Software versionpandas-profiling v3.1.0
Download configurationconfig.json

Variables

Review #
Real number (ℝ≥0)

HIGH CORRELATION
UNIQUE

Distinct2064
Distinct (%)100.0%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean1295.616279
Minimum1
Maximum2580
Zeros0
Zeros (%)0.0%
Negative0
Negative (%)0.0%
Memory size16.2 KiB
2021-11-21T14:40:07.443900image/svg+xmlMatplotlib v3.5.0, https://matplotlib.org/

Quantile statistics

Minimum1
5-th percentile127.15
Q1646.75
median1298.5
Q31945.25
95-th percentile2456.85
Maximum2580
Range2579
Interquartile range (IQR)1298.5

Descriptive statistics

Standard deviation748.8958659
Coefficient of variation (CV)0.57802289
Kurtosis-1.208799672
Mean1295.616279
Median Absolute Deviation (MAD)650.5
Skewness-0.005266261832
Sum2674152
Variance560845.018
MonotonicityNot monotonic
2021-11-21T14:40:07.552694image/svg+xmlMatplotlib v3.5.0, https://matplotlib.org/
Histogram with fixed size bins (bins=50)
ValueCountFrequency (%)
2841
 
< 0.1%
19601
 
< 0.1%
14391
 
< 0.1%
15021
 
< 0.1%
1551
 
< 0.1%
22951
 
< 0.1%
8641
 
< 0.1%
5161
 
< 0.1%
4031
 
< 0.1%
17311
 
< 0.1%
Other values (2054)2054
99.5%
ValueCountFrequency (%)
11
< 0.1%
31
< 0.1%
41
< 0.1%
61
< 0.1%
91
< 0.1%
101
< 0.1%
111
< 0.1%
121
< 0.1%
131
< 0.1%
141
< 0.1%
ValueCountFrequency (%)
25801
< 0.1%
25791
< 0.1%
25781
< 0.1%
25771
< 0.1%
25761
< 0.1%
25751
< 0.1%
25741
< 0.1%
25731
< 0.1%
25721
< 0.1%
25711
< 0.1%

Brand
Categorical

HIGH CARDINALITY

Distinct324
Distinct (%)15.7%
Missing0
Missing (%)0.0%
Memory size16.2 KiB
Nissin
314 
Nongshim
 
75
Maruchan
 
60
Mama
 
57
Paldo
 
52
Other values (319)
1506 

Length

Max length31
Median length7
Mean length7.769864341
Min length1

Characters and Unicode

Total characters0
Distinct characters0
Distinct categories0 ?
Distinct scripts0 ?
Distinct blocks0 ?
The Unicode Standard assigns character properties to each code point, which can be used to analyse textual variables.

Unique

Unique111 ?
Unique (%)5.4%

Sample

1st rowNissin
2nd rowNissin
3rd rowNissin
4th rowChikara
5th rowSuperMi

Common Values

ValueCountFrequency (%)
Nissin314
 
15.2%
Nongshim75
 
3.6%
Maruchan60
 
2.9%
Mama57
 
2.8%
Paldo52
 
2.5%
Indomie43
 
2.1%
Myojo42
 
2.0%
Samyang Foods41
 
2.0%
Ottogi36
 
1.7%
Vina Acecook29
 
1.4%
Other values (314)1315
63.7%

Length

2021-11-21T14:40:07.660245image/svg+xmlMatplotlib v3.5.0, https://matplotlib.org/
Histogram of lengths of the category
ValueCountFrequency (%)
nissin314
 
11.1%
mama83
 
2.9%
foods79
 
2.8%
nongshim75
 
2.6%
maruchan60
 
2.1%
noodle60
 
2.1%
samyang59
 
2.1%
paldo55
 
1.9%
wai44
 
1.6%
indomie43
 
1.5%
Other values (391)1962
69.2%

Most occurring characters

ValueCountFrequency (%)
No values found.

Most occurring categories

ValueCountFrequency (%)
No values found.

Most frequent character per category

Most occurring scripts

ValueCountFrequency (%)
No values found.

Most frequent character per script

Most occurring blocks

ValueCountFrequency (%)
No values found.

Most frequent character per block

Variety
Categorical

HIGH CARDINALITY
UNIFORM

Distinct1945
Distinct (%)94.2%
Missing0
Missing (%)0.0%
Memory size16.2 KiB
Beef
 
7
Yakisoba
 
6
Vegetable
 
6
Artificial Chicken
 
6
Miso Ramen
 
5
Other values (1940)
2034 

Length

Max length96
Median length28
Mean length29.5494186
Min length3

Characters and Unicode

Total characters0
Distinct characters0
Distinct categories0 ?
Distinct scripts0 ?
Distinct blocks0 ?
The Unicode Standard assigns character properties to each code point, which can be used to analyse textual variables.

Unique

Unique1858 ?
Unique (%)90.0%

Sample

1st rowCup Noodles Seafood
2nd rowDonbei Tensoba
3rd rowDemae Ramen Shoyu
4th rowShrimp Udon
5th rowMi Keriting Rasa Ayam Bawang

Common Values

ValueCountFrequency (%)
Beef7
 
0.3%
Yakisoba6
 
0.3%
Vegetable6
 
0.3%
Artificial Chicken6
 
0.3%
Miso Ramen5
 
0.2%
Artificial Beef Flavor4
 
0.2%
Artificial Spicy Beef4
 
0.2%
Chicken4
 
0.2%
Imitation Chicken Vegetarian3
 
0.1%
Soy Sauce3
 
0.1%
Other values (1935)2016
97.7%

Length

2021-11-21T14:40:07.771837image/svg+xmlMatplotlib v3.5.0, https://matplotlib.org/
Histogram of lengths of the category
ValueCountFrequency (%)
noodles550
 
5.7%
noodle411
 
4.3%
instant364
 
3.8%
flavour323
 
3.3%
ramen274
 
2.8%
chicken267
 
2.8%
flavor258
 
2.7%
spicy213
 
2.2%
beef188
 
1.9%
soup157
 
1.6%
Other values (1278)6654
68.9%

Most occurring characters

ValueCountFrequency (%)
No values found.

Most occurring categories

ValueCountFrequency (%)
No values found.

Most frequent character per category

Most occurring scripts

ValueCountFrequency (%)
No values found.

Most frequent character per script

Most occurring blocks

ValueCountFrequency (%)
No values found.

Most frequent character per block

Style
Categorical

Distinct7
Distinct (%)0.3%
Missing2
Missing (%)0.1%
Memory size16.2 KiB
Pack
1247 
Bowl
366 
Cup
358 
Tray
 
84
Box
 
5
Other values (2)
 
2

Length

Max length4
Median length4
Mean length3.822987391
Min length3

Characters and Unicode

Total characters0
Distinct characters0
Distinct categories0 ?
Distinct scripts0 ?
Distinct blocks0 ?
The Unicode Standard assigns character properties to each code point, which can be used to analyse textual variables.

Unique

Unique2 ?
Unique (%)0.1%

Sample

1st rowCup
2nd rowBowl
3rd rowPack
4th rowPack
5th rowPack

Common Values

ValueCountFrequency (%)
Pack1247
60.4%
Bowl366
 
17.7%
Cup358
 
17.3%
Tray84
 
4.1%
Box5
 
0.2%
Can1
 
< 0.1%
Bar1
 
< 0.1%
(Missing)2
 
0.1%

Length

2021-11-21T14:40:07.873402image/svg+xmlMatplotlib v3.5.0, https://matplotlib.org/
Histogram of lengths of the category

Pie chart

2021-11-21T14:40:07.938790image/svg+xmlMatplotlib v3.5.0, https://matplotlib.org/
ValueCountFrequency (%)
pack1247
60.5%
bowl366
 
17.7%
cup358
 
17.4%
tray84
 
4.1%
box5
 
0.2%
can1
 
< 0.1%
bar1
 
< 0.1%

Most occurring characters

ValueCountFrequency (%)
No values found.

Most occurring categories

ValueCountFrequency (%)
No values found.

Most frequent character per category

Most occurring scripts

ValueCountFrequency (%)
No values found.

Most frequent character per script

Most occurring blocks

ValueCountFrequency (%)
No values found.

Most frequent character per block

Country
Categorical

HIGH CORRELATION

Distinct38
Distinct (%)1.8%
Missing0
Missing (%)0.0%
Memory size16.2 KiB
Japan
269 
USA
258 
South Korea
243 
Taiwan
186 
Thailand
162 
Other values (33)
946 

Length

Max length13
Median length7
Mean length6.861434109
Min length2

Characters and Unicode

Total characters0
Distinct characters0
Distinct categories0 ?
Distinct scripts0 ?
Distinct blocks0 ?
The Unicode Standard assigns character properties to each code point, which can be used to analyse textual variables.

Unique

Unique3 ?
Unique (%)0.1%

Sample

1st rowHong Kong
2nd rowJapan
3rd rowJapan
4th rowUSA
5th rowIndonesia

Common Values

ValueCountFrequency (%)
Japan269
13.0%
USA258
12.5%
South Korea243
11.8%
Taiwan186
9.0%
Thailand162
7.8%
China130
 
6.3%
Malaysia120
 
5.8%
Hong Kong110
 
5.3%
Indonesia104
 
5.0%
Singapore92
 
4.5%
Other values (28)390
18.9%

Length

2021-11-21T14:40:08.127636image/svg+xmlMatplotlib v3.5.0, https://matplotlib.org/
Histogram of lengths of the category
ValueCountFrequency (%)
japan269
11.1%
usa258
10.7%
south243
10.0%
korea243
10.0%
taiwan186
 
7.7%
thailand162
 
6.7%
china130
 
5.4%
malaysia120
 
5.0%
hong110
 
4.5%
kong110
 
4.5%
Other values (31)587
24.3%

Most occurring characters

ValueCountFrequency (%)
No values found.

Most occurring categories

ValueCountFrequency (%)
No values found.

Most frequent character per category

Most occurring scripts

ValueCountFrequency (%)
No values found.

Most frequent character per script

Most occurring blocks

ValueCountFrequency (%)
No values found.

Most frequent character per block

Stars
Categorical

Distinct39
Distinct (%)1.9%
Missing0
Missing (%)0.0%
Memory size16.2 KiB
4
316 
5
302 
3.75
275 
3.5
268 
3.25
145 
Other values (34)
758 

Length

Max length7
Median length3
Mean length2.523255814
Min length1

Characters and Unicode

Total characters0
Distinct characters0
Distinct categories0 ?
Distinct scripts0 ?
Distinct blocks0 ?
The Unicode Standard assigns character properties to each code point, which can be used to analyse textual variables.

Unique

Unique10 ?
Unique (%)0.5%

Sample

1st row4.5
2nd row4
3rd row4
4th row4.5
5th row3.75

Common Values

ValueCountFrequency (%)
4316
15.3%
5302
14.6%
3.75275
13.3%
3.5268
13.0%
3.25145
7.0%
3143
6.9%
4.5110
 
5.3%
4.25105
 
5.1%
2.7572
 
3.5%
4.7553
 
2.6%
Other values (29)275
13.3%

Length

2021-11-21T14:40:08.205395image/svg+xmlMatplotlib v3.5.0, https://matplotlib.org/
Histogram of lengths of the category
ValueCountFrequency (%)
4316
15.3%
5302
14.6%
3.75275
13.3%
3.5268
13.0%
3.25145
7.0%
3143
6.9%
4.5110
 
5.3%
4.25105
 
5.1%
2.7572
 
3.5%
4.7553
 
2.6%
Other values (29)275
13.3%

Most occurring characters

ValueCountFrequency (%)
No values found.

Most occurring categories

ValueCountFrequency (%)
No values found.

Most frequent character per category

Most occurring scripts

ValueCountFrequency (%)
No values found.

Most frequent character per script

Most occurring blocks

ValueCountFrequency (%)
No values found.

Most frequent character per block

Top Ten
Categorical

HIGH CORRELATION
MISSING
UNIFORM

Distinct31
Distinct (%)96.9%
Missing2032
Missing (%)98.4%
Memory size16.2 KiB
 
2
2012 #7
 
1
2014 #5
 
1
2012 #3
 
1
2014 #7
 
1
Other values (26)
26 

Length

Max length8
Median length7
Mean length6.71875
Min length1

Characters and Unicode

Total characters2
Distinct characters1
Distinct categories1 ?
Distinct scripts1 ?
Distinct blocks1 ?
The Unicode Standard assigns character properties to each code point, which can be used to analyse textual variables.

Unique

Unique30 ?
Unique (%)93.8%

Sample

1st row2016 #7
2nd row2013 #4
3rd row2014 #4
4th row2014 #6
5th row2013 #1

Common Values

ValueCountFrequency (%)
2
 
0.1%
2012 #71
 
< 0.1%
2014 #51
 
< 0.1%
2012 #31
 
< 0.1%
2014 #71
 
< 0.1%
2012 #41
 
< 0.1%
2013 #21
 
< 0.1%
2013 #31
 
< 0.1%
2015 #81
 
< 0.1%
2012 #11
 
< 0.1%
Other values (21)21
 
1.0%
(Missing)2032
98.4%

Length

2021-11-21T14:40:08.289147image/svg+xmlMatplotlib v3.5.0, https://matplotlib.org/
Histogram of lengths of the category
ValueCountFrequency (%)
20129
15.0%
20147
11.7%
20156
10.0%
20135
8.3%
74
 
6.7%
44
 
6.7%
14
 
6.7%
83
 
5.0%
63
 
5.0%
103
 
5.0%
Other values (5)12
20.0%

Most occurring characters

ValueCountFrequency (%)
2
100.0%

Most occurring categories

ValueCountFrequency (%)
Control2
100.0%

Most frequent character per category

Control
ValueCountFrequency (%)
2
100.0%

Most occurring scripts

ValueCountFrequency (%)
Common2
100.0%

Most frequent character per script

Common
ValueCountFrequency (%)
2
100.0%

Most occurring blocks

ValueCountFrequency (%)
ASCII2
100.0%

Most frequent character per block

ASCII
ValueCountFrequency (%)
2
100.0%

Interactions

2021-11-21T14:40:06.829304image/svg+xmlMatplotlib v3.5.0, https://matplotlib.org/

Correlations

2021-11-21T14:40:08.359061image/svg+xmlMatplotlib v3.5.0, https://matplotlib.org/

Spearman's ρ

The Spearman's rank correlation coefficient (ρ) is a measure of monotonic correlation between two variables, and is therefore better in catching nonlinear monotonic correlations than Pearson's r. It's value lies between -1 and +1, -1 indicating total negative monotonic correlation, 0 indicating no monotonic correlation and 1 indicating total positive monotonic correlation.

To calculate ρ for two variables X and Y, one divides the covariance of the rank variables of X and Y by the product of their standard deviations.
2021-11-21T14:40:08.442298image/svg+xmlMatplotlib v3.5.0, https://matplotlib.org/

Pearson's r

The Pearson's correlation coefficient (r) is a measure of linear correlation between two variables. It's value lies between -1 and +1, -1 indicating total negative linear correlation, 0 indicating no linear correlation and 1 indicating total positive linear correlation. Furthermore, r is invariant under separate changes in location and scale of the two variables, implying that for a linear function the angle to the x-axis does not affect r.

To calculate r for two variables X and Y, one divides the covariance of X and Y by the product of their standard deviations.
2021-11-21T14:40:08.522615image/svg+xmlMatplotlib v3.5.0, https://matplotlib.org/

Kendall's τ

Similarly to Spearman's rank correlation coefficient, the Kendall rank correlation coefficient (τ) measures ordinal association between two variables. It's value lies between -1 and +1, -1 indicating total negative correlation, 0 indicating no correlation and 1 indicating total positive correlation.

To calculate τ for two variables X and Y, one determines the number of concordant and discordant pairs of observations. τ is given by the number of concordant pairs minus the discordant pairs divided by the total number of pairs.
2021-11-21T14:40:08.599314image/svg+xmlMatplotlib v3.5.0, https://matplotlib.org/

Cramér's V (φc)

Cramér's V is an association measure for nominal random variables. The coefficient ranges from 0 to 1, with 0 indicating independence and 1 indicating perfect association. The empirical estimators used for Cramér's V have been proved to be biased, even for large samples. We use a bias-corrected measure that has been proposed by Bergsma in 2013 that can be found here.
2021-11-21T14:40:08.680834image/svg+xmlMatplotlib v3.5.0, https://matplotlib.org/

Phik (φk)

Phik (φk) is a new and practical correlation coefficient that works consistently between categorical, ordinal and interval variables, captures non-linear dependency and reverts to the Pearson correlation coefficient in case of a bivariate normal input distribution. There is extensive documentation available here.

Missing values

2021-11-21T14:40:06.992225image/svg+xmlMatplotlib v3.5.0, https://matplotlib.org/
A simple visualization of nullity by column.
2021-11-21T14:40:07.127140image/svg+xmlMatplotlib v3.5.0, https://matplotlib.org/
Nullity matrix is a data-dense display which lets you quickly visually pick out patterns in data completion.
2021-11-21T14:40:07.238112image/svg+xmlMatplotlib v3.5.0, https://matplotlib.org/
The correlation heatmap measures nullity correlation: how strongly the presence or absence of one variable affects the presence of another.
2021-11-21T14:40:07.291756image/svg+xmlMatplotlib v3.5.0, https://matplotlib.org/
The dendrogram allows you to more fully correlate variable completion, revealing trends deeper than the pairwise ones visible in the correlation heatmap.

Sample

First rows

Review #BrandVarietyStyleCountryStarsTop Ten
0284NissinCup Noodles SeafoodCupHong Kong4.5NaN
1976NissinDonbei TensobaBowlJapan4NaN
274NissinDemae Ramen ShoyuPackJapan4NaN
3425ChikaraShrimp UdonPackUSA4.5NaN
4870SuperMiMi Keriting Rasa Ayam BawangPackIndonesia3.75NaN
51503NongshimBowl Noodle Soup Shrimp Habanero Lime FlavorBowlUSA3.25NaN
6647SunleeTom Yum Shrimp NoodleBowlThailand3.5NaN
72254NissinDisney Cuties Instant Noodle Seaweed FlavourCupThailand3NaN
81181Samyang FoodsStar Popeye Ramyun SnackPackSouth Korea4NaN
91211NissinDemae Iccho Instant Noodle With Soup Base Artificial Chicken FlavourBowlHong Kong3NaN

Last rows

Review #BrandVarietyStyleCountryStarsTop Ten
20542533NongshimShin Ramyun BlackPackSouth Korea5NaN
2055419MyojoRamen Desse ShioBowlJapan4.25NaN
20562484NissinDemae Ramen Tokyo Soy SaucePackGermany4NaN
2057819MaruchanTempura SobaPackJapan4NaN
2058987TridentSingapore Soft NoodlesPackAustralia2.75NaN
20591433Maggi2 Minute Noodles Curry FlavourPackSingapore3.75NaN
2060426VifonTu quy ChickenPackVietnam3NaN
2061814IndomieBeefPackIndonesia3.5NaN
20621458NissinPremium Instant Noodles Roasted Beef FlavourBowlSingapore3.75NaN
20631234Sainsbury'sBarbecue Beef Flavour Instant NoodlesPackUK2.75NaN
\ No newline at end of file diff --git a/notebooks/01-ams-inital_data_exploration.ipynb b/notebooks/01-ams-inital_data_exploration.ipynb index ed11d0a..2729d15 100644 --- a/notebooks/01-ams-inital_data_exploration.ipynb +++ b/notebooks/01-ams-inital_data_exploration.ipynb @@ -23,7 +23,7 @@ "#rerouting to data folder for training dataset\n", "path = os.getcwd()\n", "\n", - "path = path.replace('src', 'data')" + "path = path.replace('notebooks', 'data')" ] }, { @@ -421,7 +421,7 @@ { "data": { "application/vnd.jupyter.widget-view+json": { - "model_id": "f41fe588c15b4042b0048e93496ee58c", + "model_id": "e82d812f9e414a45bd70cec0ff90537c", "version_major": 2, "version_minor": 0 }, @@ -435,7 +435,7 @@ { "data": { "application/vnd.jupyter.widget-view+json": { - "model_id": "2e5ac07d28544758abdbe69494e3e6f1", + "model_id": "9b7ac900806b4ff985e09327759bae95", "version_major": 2, "version_minor": 0 }, @@ -449,7 +449,7 @@ { "data": { "application/vnd.jupyter.widget-view+json": { - "model_id": "4eb959a19019402981f313587ce8dae2", + "model_id": "de09abe410f5425ba61d2d258fe670e2", "version_major": 2, "version_minor": 0 }, @@ -463,7 +463,7 @@ { "data": { "text/html": [ - "