This project aims to predict the quality of wines based on various features using linear regression. The dataset used for this project is sourced from Kaggle, a popular platform for data science competitions and datasets.
The dataset used for this project can be found on Kaggle under the name "Red Wine Quality". It contains a collection of red and white wine samples with their corresponding quality ratings. The dataset includes various chemical and sensory features that describe the properties of each wine sample. The dataset can be downloaded from the following link: Kaggle Dataset
The two datasets are related to red and white variants of the Portuguese "Vinho Verde" wine. For more details, consult the reference [Cortez et al., 2009]. Due to privacy and logistic issues, only physicochemical (inputs) and sensory (the output) variables are available (e.g. there is no data about grape types, wine brand, wine selling price, etc.).
These datasets can be viewed as classification or regression tasks. The classes are ordered and not balanced (e.g. there are much more normal wines than excellent or poor ones).
To run this project, the following dependencies are required:
- Python 3.6 or above
- NumPy
- Pandas
- Matplotlib
- Scikit-learn
You can install the required packages using pip:
pip install numpy pandas matplotlib scikit-learn
-
Download the dataset from the provided Kaggle link and save it in the project directory.
-
Run the
Red_Wine_Quality.ipynb
file, which contains the code for data preprocessing, model training, and evaluation. -
The script will load the dataset, preprocess the data, split it into training and testing sets, train a linear regression model, and evaluate its performance.
-
After the model training is completed, it will predict the wine quality for the testing set and display the evaluation metrics such as mean absolute error, mean squared error, and R-squared score.
- fixed acidity
- volatile acidity
- citric acid
- residual sugar
- chlorides
- free sulfur dioxide
- total sulfur dioxide
- density
- pH
- sulphates
- alcohol
Output variable (based on sensory data): quality (score between 0 and 10)
Here are three common evaluation metrics for regression problems: Mean Absolute Error (MAE) is the mean of the absolute value of the errors:
Mean Squared Error (MSE) is the mean of the squared errors:
Root Mean Squared Error (RMSE) is the square root of the mean of the squared errors:
Comparing these metrics:
- MAE is the easiest to understand because it’s the average error.
- MSE is more popular than MAE because MSE “punishes” larger errors, which tends to be useful in the real world.
- RMSE is even more popular than MSE because RMSE is interpretable in the “y” units.
The results of the wine quality prediction are displayed after the model training and evaluation. These results include the evaluation metrics and can be used to assess the accuracy and performance of the linear regression model.
The dataset used in this project is subject to the licensing terms provided by Kaggle. Please refer to the dataset's license for more details.
Contributions to this project are welcome. If you find any issues or have suggestions for improvements, please feel free to open an issue or submit a pull request.
For any questions or inquiries, please contact [[email protected]].