This repository contains the code and resources for a supervised machine learning project aimed at predicting bike share demand in Seoul, South Korea. The dataset used is SeoulBikeData.csv
.
Bike share demand prediction is a critical aspect of urban transportation planning. This project focuses on using machine learning techniques to predict bike rental demand in Seoul, aiding in efficient resource allocation and city planning.
The dataset SeoulBikeData.csv
is included in the 📁 data
directory. It contains information about bike rentals, including weather conditions, temperature, humidity, and other relevant features.
The SeoulBikeData.csv
file contains the following columns:
- Date: Year-Month-Day
- Rented Bike count: Count of bikes rented at each hour
- Hour: Hour of the day
- Temperature: Temperature in Celsius
- Humidity: %
- Windspeed: m/s
- Visibility: 10m
- Dew point temperature: Celsius
- Solar radiation: MJ/m2
- Rainfall: mm
- Snowfall: cm
- Seasons: Winter, Spring, Summer, Autumn
- Holiday: Holiday/No holiday
- Functional Day: NonFunctional/Functional Day
The project is developed using Python and relies on the following libraries:
- NumPy
- Pandas
- Matplotlib
- Seaborn
- Scikit-learn
The project involves the following steps:
- Data Cleaning and Preparation
- Exploratory Data Analysis
- Visualization and Insights
- Hypothesis Testing
- Feature Enginerring & Data Pre-processing
- ML Model Training , Implementation and Evaluation
The first step in this project involves cleaning and preparing the data. This includes checking for missing data, removing duplicates, and converting data types. Some of the specific tasks involved in this step include:
- Handling missing data
- Removing duplicates
- Converting data types
- TimeSeries Analysis
The next step in the project is to conduct exploratory data analysis. This involves examining the data to understand its distribution, central tendencies, and correlations between variables.
Hypothesis testing , a statistical method used to make inferences about a population based on a sample of data. To perform hypothesis testing on the 'SeoulBikeData.csv' dataset, we first start with a null hypothesis (H0) and an alternative hypothesis (H1), then use statistical tests to determine if there is enough evidence to reject the null hypothesis in favor of the alternative hypothesis.
Below is general step-by-step guide on to perform hypothesis testing on a dataset like SeoulBikeData.csv:
- Define the Hypotheses
- Choose a Significance Level (α)
- Select the Test
- Perform the Test
- Analyze the Results
- Draw Conclusions
- Handling Missing Values
- Handling Outliers
- Label Encoding
- Textual Data Preprocessing
- Feature Manipulation & Selection
- Data Transformation
- Data Scaling
- Dimesionality Reduction
- Data Splitting
The dependent variable is Rented Bike Count is a contionus variable. Hence to Regression ML algorithms are used to train the model to predict the depedent variable.
Following are the ML algorithms on which the model is trained
- Ridge Regression (Logistic Regression + L2 Regularization )
- Decision Tree Regression
- Random Forest Regression