Predicting housing prices is a critical task for understanding market trends and making informed investment decisions. This project uses nominal and rent price data across different locations to build a model that forecasts housing prices.
This project leverages data preprocessing, feature scaling, and machine learning techniques to predict housing prices using nominal and rent data. The dataset includes prices across multiple locations and time periods, with a focus on data cleaning, handling missing values, and performing exploratory data analysis (EDA). Various models, including clustering with k-means, are employed to uncover insights about housing price trends.
Key features of the project include:
- Handling and cleaning of real-world data
- Time series data analysis for price prediction
- K-means clustering for identifying market patterns
- Visualizations including bar plots, box plots, and correlation plots for better data understanding
This project can be used by real estate professionals, data scientists, or analysts who aim to gain insights into housing price trends and build predictive models.
- Data Preprocessing: Removal of irrelevant columns, handling missing data, and feature scaling.
- Exploratory Data Analysis (EDA): Visualizations using ggplot2, box plots, and correlation analysis.
- Machine Learning: K-means clustering for identifying patterns in housing prices across different locations and time.
- Time Series Analysis: Focusing on quarterly data to understand trends and predict future values.
- R Programming Language for data processing and visualization
- ggplot2 for data visualization
- caTools for data splitting
- Hmisc and corrplot for correlation analysis
- k-means clustering for unsupervised learning
-
Original Dataset:
-
Dataframe that Have Undergone Preprocessing:
Figure 2: Dataframe that have undergone preprocessingFigure 1 shows the data read from csv and stored into dataframe, df. It contains 698 entries with 8 columns. Figure 2 shows the dataframe df that have undergone preprocessing. It has 90 entries with 10 total columns now.
-
Locations and Total Numbers of Missing Values:
-
Structure of the Dataframe Before Preprocessing:
-
Structure of the Dataframe After Preprocessing:
-
First Few Rows of the Dataframe for df3, Training_Set and Test_Set:
Figure 6: First few rows of the dataframe for df3, training_set and test_set -
Summary of the Dataframe for df3, Training_Set and Test_Set:
Figure 7: Summary of the dataframe for df3, training_set and test_set -
Training Set for Housing_Prices After Preprocessing:
Figure 8: Training set for housing_prices after preprocessing -
Test Set for Housing_Prices After Preprocessing:
Figure 9 shows the test set. It has 18 entries with 10 columns. The preprocessed dataframe is split into training and test set with the ratio of 0.8 and 0.2 respectively.
-
Barplot for the Value of Nominal and Rent Housing Prices by Locations in 2020-Q4:
Figure 10: Barplot for the value of nominal and rent housing prices by locations in 2020-q4 -
Boxplot for the Housing Value Prices by Months for Nominal and Rent:
Figure 11: Boxplot for the housing value prices by months for nominal
Figure 12: Boxplot for the housing value prices by months for rent -
Correlation Value and the P-Value of Nominal and Rent:
Figure 13: Correlation value and the p-value of nominal and rent -
Correlation Value and the P-Value of Nominal Only:
Figure 14: Correlation value and the p-value of nominal only -
Correlation Value and the P-Value of Rent Only:
-
Correlaltion Plot for the Value of Nominal and Rent, Nominal Only, and Rent Only:
Figure 16: Correlation plot for the value of nominal and rent
Figure 17: Correlation plot for the value of nominal only
Figure 18: Correlation plot for the value of rent only
We picked k-means clustering for the clustering model since it is good for huge datasets and has a low time complexity. The variables index in 2022 Q1, Q2, and Q3 are selected for this cluster model to identify where is the best location to invest. With the aid of Within Sum of Squares (WSS), the ideal k-value was established to be 10.
Graph of Squares against Number of Clusters:
Figure 19: Graph of squares against number of clusters
The between SS/total SS ratio percentage for the clustering model is 96.4%, indicating that it has both internal cohesiveness and exterior separation. The centroids are close to the data points within the clusters, while each cluster is far from the others. Cluster graphs are displayed in the figures below.
-
km within ss, Total of km within ss and km Between ss:
Figure 20: km within ss, total of km within ss and km between ss -
between_ss/total_ss Ratio:
-
Clusters Graphs of 2022-Q1, 2022-Q2 and 2022-Q3
Figure 22: Clusters graphs of 2022-Q1
Figure 23: Clusters graphs of 2022-Q2
Figure 24: Clusters graphs of 2022-Q3From the figure above we can see that the hotspots with the highest index value are location around 15-25.