Principles of Data Analytics Project: Housing Prices Prediction

Problem Statements

Predicting housing prices is a critical task for understanding market trends and making informed investment decisions. This project uses nominal and rent price data across different locations to build a model that forecasts housing prices.

Project Overview

This project leverages data preprocessing, feature scaling, and machine learning techniques to predict housing prices using nominal and rent data. The dataset includes prices across multiple locations and time periods, with a focus on data cleaning, handling missing values, and performing exploratory data analysis (EDA). Various models, including clustering with k-means, are employed to uncover insights about housing price trends.

Key features of the project include:

Handling and cleaning of real-world data
Time series data analysis for price prediction
K-means clustering for identifying market patterns
Visualizations including bar plots, box plots, and correlation plots for better data understanding

This project can be used by real estate professionals, data scientists, or analysts who aim to gain insights into housing price trends and build predictive models.

Key Features

Data Preprocessing: Removal of irrelevant columns, handling missing data, and feature scaling.
Exploratory Data Analysis (EDA): Visualizations using ggplot2, box plots, and correlation analysis.
Machine Learning: K-means clustering for identifying patterns in housing prices across different locations and time.
Time Series Analysis: Focusing on quarterly data to understand trends and predict future values.

Technologies Used

R Programming Language for data processing and visualization
ggplot2 for data visualization
caTools for data splitting
Hmisc and corrplot for correlation analysis
k-means clustering for unsupervised learning

Data Preparation

Data Preparation and Preprocessing

Original Dataset:

Figure 1: Original dataset
Dataframe that Have Undergone Preprocessing:

Figure 2: Dataframe that have undergone preprocessing

Figure 1 shows the data read from csv and stored into dataframe, df. It contains 698 entries with 8 columns. Figure 2 shows the dataframe df that have undergone preprocessing. It has 90 entries with 10 total columns now.
Locations and Total Numbers of Missing Values:

Figure 3: Locations and total numbers of missing values
Structure of the Dataframe Before Preprocessing:

Figure 4: Structure of the dataframe before preprocessing
Structure of the Dataframe After Preprocessing:

Figure 5: Structure of the dataframe after preprocessing
First Few Rows of the Dataframe for df3, Training_Set and Test_Set:

Figure 6: First few rows of the dataframe for df3, training_set and test_set
Summary of the Dataframe for df3, Training_Set and Test_Set:

Figure 7: Summary of the dataframe for df3, training_set and test_set
Training Set for Housing_Prices After Preprocessing:

Figure 8: Training set for housing_prices after preprocessing
Figure 8 shows the training set. It has 72 entries with 9 columns.
Test Set for Housing_Prices After Preprocessing:

Figure 9: Test set for housing_prices after preprocessing
Figure 9 shows the test set. It has 18 entries with 10 columns. The preprocessed dataframe is split into training and test set with the ratio of 0.8 and 0.2 respectively.

Exploratory Data Analysis (EDA)

Barplot for the Value of Nominal and Rent Housing Prices by Locations in 2020-Q4:

Figure 10: Barplot for the value of nominal and rent housing prices by locations in 2020-q4
Boxplot for the Housing Value Prices by Months for Nominal and Rent:

Figure 11: Boxplot for the housing value prices by months for nominal
Figure 12: Boxplot for the housing value prices by months for rent
Correlation Value and the P-Value of Nominal and Rent:

Figure 13: Correlation value and the p-value of nominal and rent
Correlation Value and the P-Value of Nominal Only:

Figure 14: Correlation value and the p-value of nominal only
Correlation Value and the P-Value of Rent Only:

Figure 15: Correlation value and the p-value of rent only

Correlaltion Plot for the Value of Nominal and Rent, Nominal Only, and Rent Only:

Figure 16: Correlation plot for the value of nominal and rent

Figure 17: Correlation plot for the value of nominal only

Figure 18: Correlation plot for the value of rent only

Prediction of the Housing Prices Hotspot Location to Invest via Clustering

We picked k-means clustering for the clustering model since it is good for huge datasets and has a low time complexity. The variables index in 2022 Q1, Q2, and Q3 are selected for this cluster model to identify where is the best location to invest. With the aid of Within Sum of Squares (WSS), the ideal k-value was established to be 10.

Graph of Squares against Number of Clusters:

Figure 19: Graph of squares against number of clusters

The between SS/total SS ratio percentage for the clustering model is 96.4%, indicating that it has both internal cohesiveness and exterior separation. The centroids are close to the data points within the clusters, while each cluster is far from the others. Cluster graphs are displayed in the figures below.

km within ss, Total of km within ss and km Between ss:

Figure 20: km within ss, total of km within ss and km between ss
between_ss/total_ss Ratio:

Figure 21: between_ss/total_ss ratio
Clusters Graphs of 2022-Q1, 2022-Q2 and 2022-Q3

Figure 22: Clusters graphs of 2022-Q1
Figure 23: Clusters graphs of 2022-Q2
Figure 24: Clusters graphs of 2022-Q3

From the figure above we can see that the hotspots with the highest index value are location around 15-25.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
readme-assets		readme-assets
Housing Prices Prediction.R		Housing Prices Prediction.R
Housing_Prices.csv		Housing_Prices.csv
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Principles of Data Analytics Project: Housing Prices Prediction

Problem Statements

Project Overview

Key Features

Technologies Used

Data Preparation

Data Preparation and Preprocessing

Exploratory Data Analysis (EDA)

Prediction of the Housing Prices Hotspot Location to Invest via Clustering

About

Languages

derekgan08/housing-prices-prediction

Folders and files

Latest commit

History

Repository files navigation

Principles of Data Analytics Project: Housing Prices Prediction

Problem Statements

Project Overview

Key Features

Technologies Used

Data Preparation

Data Preparation and Preprocessing

Exploratory Data Analysis (EDA)

Prediction of the Housing Prices Hotspot Location to Invest via Clustering

About

Topics

Resources

Stars

Watchers

Forks

Languages