Skip to content

Latest commit

 

History

History
1733 lines (1315 loc) · 34.1 KB

Readme.md

File metadata and controls

1733 lines (1315 loc) · 34.1 KB

Wine Quality Data Analysis

https://archive.ics.uci.edu/ml/datasets/Wine+Quality

Citation Request: This dataset is public available for research. The details are described in [Cortez et al., 2009]. Please include this citation if you plan to use this database:

P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.

Available at: [@Elsevier] http://dx.doi.org/10.1016/j.dss.2009.05.016 [Pre-press (pdf)] http://www3.dsi.uminho.pt/pcortez/winequality09.pdf [bib] http://www3.dsi.uminho.pt/pcortez/dss09.bib

  1. Title: Wine Quality

  2. Sources Created by: Paulo Cortez (Univ. Minho), Antonio Cerdeira, Fernando Almeida, Telmo Matos and Jose Reis (CVRVV) @ 2009

  3. Past Usage:

P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.

In the above reference, two datasets were created, using red and white wine samples. The inputs include objective tests (e.g. PH values) and the output is based on sensory data (median of at least 3 evaluations made by wine experts). Each expert graded the wine quality between 0 (very bad) and 10 (very excellent). Several data mining methods were applied to model these datasets under a regression approach. The support vector machine model achieved the best results. Several metrics were computed: MAD, confusion matrix for a fixed error tolerance (T), etc. Also, we plot the relative importances of the input variables (as measured by a sensitivity analysis procedure).

  1. Relevant Information:

    The two datasets are related to red and white variants of the Portuguese "Vinho Verde" wine. For more details, consult: http://www.vinhoverde.pt/en/ or the reference [Cortez et al., 2009]. Due to privacy and logistic issues, only physicochemical (inputs) and sensory (the output) variables are available (e.g. there is no data about grape types, wine brand, wine selling price, etc.).

    These datasets can be viewed as classification or regression tasks. The classes are ordered and not balanced (e.g. there are munch more normal wines than excellent or poor ones). Outlier detection algorithms could be used to detect the few excellent or poor wines. Also, we are not sure if all input variables are relevant. So it could be interesting to test feature selection methods.

  2. Number of Instances: red wine - 1599; white wine - 4898.

  3. Number of Attributes: 11 + output attribute

    Note: several of the attributes may be correlated, thus it makes sense to apply some sort of feature selection.

  4. Attribute information:

    For more information, read [Cortez et al., 2009].

    Input variables (based on physicochemical tests): 1 - fixed acidity 2 - volatile acidity 3 - citric acid 4 - residual sugar 5 - chlorides 6 - free sulfur dioxide 7 - total sulfur dioxide 8 - density 9 - pH 10 - sulphates 11 - alcohol Output variable (based on sensory data): 12 - quality (score between 0 and 10)

  5. Missing Attribute Values: None

content of the current directory:

% ls
Readme.md                    winequality-red.csv
Wine Quality Analysis.ipynb  winequality-white.csv
Wine+Quality+Analysis.html   winequality.names.txt
# import libs
%matplotlib inline
import pandas as pd
import seaborn as sns; sns.set(style="whitegrid", palette="muted")
import numpy as np
import matplotlib.pyplot as plt

Create DataFrames for white and red wines

white_wine_df = pd.read_csv('winequality-white.csv', sep=";")
red_wine_df = pd.read_csv('winequality-red.csv', sep=";")

DataFrames for red and white wines combined

ww = white_wine_df.loc[:]
ww["color"] = "white"
rw = red_wine_df.loc[:]
rw["color"] = "red"
wine_df = pd.concat([ww, rw], ignore_index=True)

Data

white_wine_df.head()
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality color
0 7.0 0.27 0.36 20.7 0.045 45.0 170.0 1.0010 3.00 0.45 8.8 6 white
1 6.3 0.30 0.34 1.6 0.049 14.0 132.0 0.9940 3.30 0.49 9.5 6 white
2 8.1 0.28 0.40 6.9 0.050 30.0 97.0 0.9951 3.26 0.44 10.1 6 white
3 7.2 0.23 0.32 8.5 0.058 47.0 186.0 0.9956 3.19 0.40 9.9 6 white
4 7.2 0.23 0.32 8.5 0.058 47.0 186.0 0.9956 3.19 0.40 9.9 6 white
red_wine_df.head()
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality color
0 7.4 0.70 0.00 1.9 0.076 11.0 34.0 0.9978 3.51 0.56 9.4 5 red
1 7.8 0.88 0.00 2.6 0.098 25.0 67.0 0.9968 3.20 0.68 9.8 5 red
2 7.8 0.76 0.04 2.3 0.092 15.0 54.0 0.9970 3.26 0.65 9.8 5 red
3 11.2 0.28 0.56 1.9 0.075 17.0 60.0 0.9980 3.16 0.58 9.8 6 red
4 7.4 0.70 0.00 1.9 0.076 11.0 34.0 0.9978 3.51 0.56 9.4 5 red
assert white_wine_df.columns.all() == red_wine_df.columns.all()
",".join(list(white_wine_df.columns))
'fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,color'

test for null values and check correct datatypes

assert white_wine_df.notnull().all().all()
white_wine_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4898 entries, 0 to 4897
Data columns (total 13 columns):
fixed acidity           4898 non-null float64
volatile acidity        4898 non-null float64
citric acid             4898 non-null float64
residual sugar          4898 non-null float64
chlorides               4898 non-null float64
free sulfur dioxide     4898 non-null float64
total sulfur dioxide    4898 non-null float64
density                 4898 non-null float64
pH                      4898 non-null float64
sulphates               4898 non-null float64
alcohol                 4898 non-null float64
quality                 4898 non-null int64
color                   4898 non-null object
dtypes: float64(11), int64(1), object(1)
memory usage: 497.5+ KB

no null values in white wine dataframe found

assert red_wine_df.notnull().all().all()
red_wine_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1599 entries, 0 to 1598
Data columns (total 13 columns):
fixed acidity           1599 non-null float64
volatile acidity        1599 non-null float64
citric acid             1599 non-null float64
residual sugar          1599 non-null float64
chlorides               1599 non-null float64
free sulfur dioxide     1599 non-null float64
total sulfur dioxide    1599 non-null float64
density                 1599 non-null float64
pH                      1599 non-null float64
sulphates               1599 non-null float64
alcohol                 1599 non-null float64
quality                 1599 non-null int64
color                   1599 non-null object
dtypes: float64(11), int64(1), object(1)
memory usage: 162.5+ KB

no null values in red wine dataframe found

All datatypes are numeric.

Build categoricals

# this can be crucial :)
white_wine_df["color"] = white_wine_df["color"].astype("category")
red_wine_df["color"] = red_wine_df["color"].astype("category")
wine_df["color"] = wine_df["color"].astype("category")

Means

White Wines:

white_wine_df.mean()
fixed acidity             6.854788
volatile acidity          0.278241
citric acid               0.334192
residual sugar            6.391415
chlorides                 0.045772
free sulfur dioxide      35.308085
total sulfur dioxide    138.360657
density                   0.994027
pH                        3.188267
sulphates                 0.489847
alcohol                  10.514267
quality                   5.877909
dtype: float64

Red Wines:

red_wine_df.mean()
fixed acidity            8.319637
volatile acidity         0.527821
citric acid              0.270976
residual sugar           2.538806
chlorides                0.087467
free sulfur dioxide     15.874922
total sulfur dioxide    46.467792
density                  0.996747
pH                       3.311113
sulphates                0.658149
alcohol                 10.422983
quality                  5.636023
dtype: float64

Differences between red and white wine means that are greater than 1.0

mean_diff = white_wine_df.mean() - red_wine_df.mean()
mean_diff_abs = mean_diff.apply(lambda x: abs(x))
mean_diff[mean_diff_abs >= 1.0]
fixed acidity           -1.464850
residual sugar           3.852609
free sulfur dioxide     19.433163
total sulfur dioxide    91.892865
dtype: float64

Distribution of Quality

sns.countplot(data=white_wine_df, x="quality")
sns.plt.title("White Wines")
<matplotlib.text.Text at 0x1112b1ac8>

png

white_wine_df.quality.describe()
count    4898.000000
mean        5.877909
std         0.885639
min         3.000000
25%         5.000000
50%         6.000000
75%         6.000000
max         9.000000
Name: quality, dtype: float64
x = sns.countplot(data=red_wine_df, x="quality")
sns.plt.title("Red Wines")
<matplotlib.text.Text at 0x11144de80>

png

red_wine_df.quality.describe()
count    1599.000000
mean        5.636023
std         0.807569
min         3.000000
25%         5.000000
50%         6.000000
75%         6.000000
max         8.000000
Name: quality, dtype: float64

What may be important for a high quality rating?

To find out, the percentual mean differences for low quality to high quality wines over the total mean are calculated, resulting in percentual changes.

For white wines:

x = white_wine_df.groupby(["quality"]).mean()
lower_quals = x.loc[:4].mean()
higher_quals = x.loc[7:].mean()
ww_perc_means = (higher_quals - lower_quals) / white_wine_df.mean() * 100
ww_perc_means
alcohol                 14.068993
chlorides              -38.372538
citric acid              7.758984
density                 -0.254610
fixed acidity           -6.235608
free sulfur dioxide    -10.177344
pH                       1.934262
quality                       NaN
residual sugar          -8.100321
sulphates                1.999155
total sulfur dioxide   -18.439304
volatile acidity       -27.979143
dtype: float64

Comparing low quality means to high quality ones, the following attributes differ more than 5 per cent:

ww_perc_means[abs(ww_perc_means) > 5]
alcohol                 14.068993
chlorides              -38.372538
citric acid              7.758984
fixed acidity           -6.235608
free sulfur dioxide    -10.177344
residual sugar          -8.100321
total sulfur dioxide   -18.439304
volatile acidity       -27.979143
dtype: float64

Comparing low quality means to high quality ones, the following attributes differ more than 10 per cent:

ww_perc_means[abs(ww_perc_means) > 10]
alcohol                 14.068993
chlorides              -38.372538
free sulfur dioxide    -10.177344
total sulfur dioxide   -18.439304
volatile acidity       -27.979143
dtype: float64

For red wines:

x = red_wine_df.groupby(["quality"]).mean()
lower_quals = x.loc[:4].mean()
higher_quals = x.loc[7:].mean()
rw_perc_means = (higher_quals - lower_quals) / red_wine_df.mean() * 100
rw_perc_means
alcohol                 16.023546
chlorides              -38.955960
citric acid             77.707371
density                 -0.134937
fixed acidity            7.811538
free sulfur dioxide     12.783852
pH                      -3.345302
quality                       NaN
residual sugar          -0.609713
sulphates               26.028988
total sulfur dioxide     7.875629
volatile acidity       -71.161438
dtype: float64

Comparing low quality means to high quality ones, the following attributes differ more than 5 per cent:

rw_perc_means[abs(rw_perc_means) > 5]
alcohol                 16.023546
chlorides              -38.955960
citric acid             77.707371
fixed acidity            7.811538
free sulfur dioxide     12.783852
sulphates               26.028988
total sulfur dioxide     7.875629
volatile acidity       -71.161438
dtype: float64

Comparing low quality means to high quality ones, the following attributes differ more than 10 per cent:

rw_perc_means[abs(rw_perc_means) > 10]
alcohol                16.023546
chlorides             -38.955960
citric acid            77.707371
free sulfur dioxide    12.783852
sulphates              26.028988
volatile acidity      -71.161438
dtype: float64

What will be taken a closer look at:

  • Alcohol
  • Chlorides
  • Citric Acid
  • Sulphates
  • Sulfur Dioxides
  • Volatile Acidity

Sulfur Dioxides and Quality

White Wines

fig, (ax1, ax2) = plt.subplots(1,2)
fig.set_size_inches(14.5, 4.5)
fig.dpi = 300
sns.stripplot(data=white_wine_df, x="quality", y="total sulfur dioxide", jitter=True, ax=ax1)
sns.stripplot(data=white_wine_df, x="quality", y="free sulfur dioxide", jitter=True, ax=ax2)
<matplotlib.axes._subplots.AxesSubplot at 0x114f6d4e0>

png

high_qual_ww_tsd_mean = white_wine_df[white_wine_df["quality"] >= 7]["total sulfur dioxide"].mean()
high_qual_ww_tsd_mean = format(high_qual_ww_tsd_mean, '.1f')
print(f"The mean for higher quality white wines (quality >= 7) is {high_qual_ww_tsd_mean}")
The mean for higher quality white wines (quality >= 7) is 125.2

Interpretation White Wines

Both plots show, that higher quality white wines tend to have less total sulfur dioxide in it.

red wine

fig, (ax1, ax2) = plt.subplots(1,2)
fig.set_size_inches(14.5, 4.5)
fig.dpi = 300
sns.stripplot(data=red_wine_df, x="quality", y="total sulfur dioxide", jitter=True, ax=ax1)
sns.stripplot(data=red_wine_df, x="quality", y="free sulfur dioxide", jitter=True, ax=ax2)
<matplotlib.axes._subplots.AxesSubplot at 0x114e00cc0>

png

high_qual_rw_tsd_mean = red_wine_df[red_wine_df["quality"] >= 7]["total sulfur dioxide"].mean()
high_qual_rw_tsd_mean = format(high_qual_rw_tsd_mean, '.1f')
print(f"The mean for higher quality red wines (quality >= 7) is {high_qual_rw_tsd_mean}")
The mean for higher quality red wines (quality >= 7) is 34.9

Interpretation Red Wines

For the red wines, there are much lower concentrations of sulfur dioxides. Additionally, there seems to be no direct correlation between sulfur dioxide concentration and percepted quality.

Conclusion: Sulfur Dioxides and Quality

Regarding high quality white wines (>= 7), those wines have a mean of sulfur dioxides of around 125. Respectively high quality Red Wines (>=7) have a mean concentration of sulfur dioxide of 35.

Sulphates and Quality

sns.stripplot(data=wine_df, x="quality", y="sulphates", jitter=True, hue="color", split=True)
<matplotlib.axes._subplots.AxesSubplot at 0x117849240>

png

Alcohol in Wine

White Wine

white_wine_df.groupby("quality")["alcohol"].describe()
count mean std min 25% 50% 75% max
quality
3 20.0 10.345000 1.224089 8.0 9.55 10.45 11.00 12.6
4 163.0 10.152454 1.003217 8.4 9.40 10.10 10.75 13.5
5 1457.0 9.808840 0.847065 8.0 9.20 9.50 10.30 13.6
6 2198.0 10.575372 1.147776 8.5 9.60 10.50 11.40 14.0
7 880.0 11.367936 1.246536 8.6 10.60 11.40 12.30 14.2
8 175.0 11.636000 1.280138 8.5 11.00 12.00 12.60 14.0
9 5.0 12.180000 1.013410 10.4 12.40 12.50 12.70 12.9

Red Wine

red_wine_df.groupby("quality")["alcohol"].describe()
count mean std min 25% 50% 75% max
quality
3 10.0 9.955000 0.818009 8.4 9.725 9.925 10.575 11.0
4 53.0 10.265094 0.934776 9.0 9.600 10.000 11.000 13.1
5 681.0 9.899706 0.736521 8.5 9.400 9.700 10.200 14.9
6 638.0 10.629519 1.049639 8.4 9.800 10.500 11.300 14.0
7 199.0 11.465913 0.961933 9.2 10.800 11.500 12.100 14.0
8 18.0 12.094444 1.224011 9.8 11.325 12.150 12.875 14.0

Plotting Alcohol to Quality

sns.barplot(data=wine_df, x="quality", y="alcohol", hue="color")
<matplotlib.axes._subplots.AxesSubplot at 0x114e342e8>

png

sns.lmplot(data=wine_df, x="quality", y="alcohol", hue="color")
<seaborn.axisgrid.FacetGrid at 0x118bd3160>

png

Alcohol to Quality relation for Wines equal or greater than 7

hq_wines = wine_df[wine_df.quality >= 7]
sns.lmplot(data=hq_wines, x="quality", y="alcohol", hue="color")
sns.plt.title("Quality >= 7")
<matplotlib.text.Text at 0x118fe82e8>

png

Heatmap Alcohol to Quality

heat_table = wine_df[["quality", "alcohol"]].copy()
heat_table["alcohol"] = heat_table.alcohol.apply(func=lambda x: round(x * 2) / 2)
heat_table = heat_table.groupby(["quality", "alcohol"])["alcohol"].count().reset_index(name='counts')
sns.heatmap(heat_table.pivot("quality", "alcohol", "counts"))
<matplotlib.axes._subplots.AxesSubplot at 0x118e0eb70>

png

Chlorides

sns.barplot(data=wine_df, hue="color", x="quality", y="chlorides")
<matplotlib.axes._subplots.AxesSubplot at 0x1192a8c18>

png

The less chlorides in a wine the higher the quality.

Chlorides and Alcohol

g = sns.PairGrid(wine_df[["alcohol", "chlorides", "quality"]], hue="quality")
g = g.map_diag(plt.hist)
g = g.map_offdiag(plt.scatter)
g = g.add_legend()

png

Acids

sns.pairplot(white_wine_df[["volatile acidity", "citric acid", "quality"]], hue="quality")
<seaborn.axisgrid.PairGrid at 0x1194682b0>

png

Bringing the relevant attributes together

sns.pairplot(white_wine_df[["volatile acidity", "citric acid", "quality", "free sulfur dioxide", "chlorides"]], hue="quality")
<seaborn.axisgrid.PairGrid at 0x11a544128>

png

white_wine_df[["volatile acidity", "citric acid", "quality"]].groupby("quality").describe(percentiles=[])
citric acid volatile acidity
count mean std min 50% max count mean std min 50% max
quality
3 20.0 0.336000 0.081460 0.21 0.345 0.47 20.0 0.333250 0.140827 0.17 0.26 0.640
4 163.0 0.304233 0.163857 0.00 0.290 0.88 163.0 0.381227 0.173463 0.11 0.32 1.100
5 1457.0 0.337653 0.140814 0.00 0.320 1.00 1457.0 0.302011 0.100066 0.10 0.28 0.905
6 2198.0 0.338025 0.119325 0.00 0.320 1.66 2198.0 0.260564 0.088142 0.08 0.25 0.965
7 880.0 0.325625 0.079183 0.01 0.310 0.74 880.0 0.262767 0.091106 0.08 0.25 0.760
8 175.0 0.326514 0.085439 0.04 0.320 0.74 175.0 0.277400 0.108029 0.12 0.26 0.660
9 5.0 0.386000 0.082037 0.29 0.360 0.49 5.0 0.298000 0.057619 0.24 0.27 0.360
white_wine_df[["quality", "free sulfur dioxide", "chlorides"]].groupby("quality").describe(percentiles=[])
chlorides free sulfur dioxide
count mean std min 50% max count mean std min 50% max
quality
3 20.0 0.054300 0.046468 0.022 0.041 0.244 20.0 53.325000 69.420776 5.0 33.5 289.0
4 163.0 0.050098 0.025888 0.013 0.046 0.290 163.0 23.358896 20.391349 3.0 18.0 138.5
5 1457.0 0.051546 0.026496 0.009 0.047 0.346 1457.0 36.432052 18.145991 2.0 35.0 131.0
6 2198.0 0.045217 0.020453 0.015 0.043 0.255 2198.0 35.650591 15.735679 3.0 34.0 112.0
7 880.0 0.038191 0.010697 0.012 0.037 0.135 880.0 34.125568 13.244737 5.0 33.0 108.0
8 175.0 0.038314 0.013164 0.014 0.036 0.121 175.0 36.720000 16.203675 6.0 35.0 105.0
9 5.0 0.027400 0.007436 0.018 0.031 0.035 5.0 33.400000 13.427584 24.0 28.0 57.0

Final Conclusion

no attribute alone is strong enough to define a high quality wine, but as the figures show. For a wine to score high, having the acids and sulfur dioxide values all within in a certain range can help.

Best vs. Worst

when comparing the best (8,9) vs. worst (3,4) we can see that they well overlap each other.

qual3 = white_wine_df[white_wine_df["quality"] == 3]
qual4 = white_wine_df[white_wine_df["quality"] == 4].copy()
qual4.quality = 3
qual8 = white_wine_df[white_wine_df["quality"] == 8]
qual9 = white_wine_df[white_wine_df["quality"] == 9].copy()
qual9.quality = 8
white_wine_sample = pd.concat([qual3, qual4, qual8, qual9], ignore_index=True)
white_wine_sample
sns.pairplot(white_wine_sample[["volatile acidity", "citric acid", "quality"]], hue="quality")
<seaborn.axisgrid.PairGrid at 0x11cc89ef0>

png

sns.pairplot(white_wine_sample[["volatile acidity", "citric acid", "quality", "sulphates", "chlorides"]], hue="quality")
<seaborn.axisgrid.PairGrid at 0x11d07cb00>

png

Seems like labs can't measure a wine's inner spirit (yet).

But if you have to pick a wine only based on specs, i would suggest white wines close to this values:

qual8[["quality", "chlorides", "alcohol", "citric acid", "sulphates"]].describe(percentiles=[])
quality chlorides alcohol citric acid sulphates
count 175.0 175.000000 175.000000 175.000000 175.000000
mean 8.0 0.038314 11.636000 0.326514 0.486229
std 0.0 0.013164 1.280138 0.085439 0.147073
min 8.0 0.014000 8.500000 0.040000 0.250000
50% 8.0 0.036000 12.000000 0.320000 0.460000
max 8.0 0.121000 14.000000 0.740000 0.950000

and red wines close to this values:

rqual8 = red_wine_df[red_wine_df["quality"] >= 8]
rqual8[["quality", "chlorides", "alcohol", "citric acid", "sulphates"]].describe(percentiles=[])
quality chlorides alcohol citric acid sulphates
count 18.0 18.000000 18.000000 18.000000 18.000000
mean 8.0 0.068444 12.094444 0.391111 0.767778
std 0.0 0.011678 1.224011 0.199526 0.115379
min 8.0 0.044000 9.800000 0.030000 0.630000
50% 8.0 0.070500 12.150000 0.420000 0.740000
max 8.0 0.086000 14.000000 0.720000 1.100000