Skip to content

Predicting housing prices with nominal and rent data for different locations using K-Means Clustering

Notifications You must be signed in to change notification settings

derekgan08/housing-prices-prediction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Principles of Data Analytics Project: Housing Prices Prediction

Problem Statements

Predicting housing prices is a critical task for understanding market trends and making informed investment decisions. This project uses nominal and rent price data across different locations to build a model that forecasts housing prices.

Project Overview

This project leverages data preprocessing, feature scaling, and machine learning techniques to predict housing prices using nominal and rent data. The dataset includes prices across multiple locations and time periods, with a focus on data cleaning, handling missing values, and performing exploratory data analysis (EDA). Various models, including clustering with k-means, are employed to uncover insights about housing price trends.

Key features of the project include:

  • Handling and cleaning of real-world data
  • Time series data analysis for price prediction
  • K-means clustering for identifying market patterns
  • Visualizations including bar plots, box plots, and correlation plots for better data understanding

This project can be used by real estate professionals, data scientists, or analysts who aim to gain insights into housing price trends and build predictive models.

Key Features

  • Data Preprocessing: Removal of irrelevant columns, handling missing data, and feature scaling.
  • Exploratory Data Analysis (EDA): Visualizations using ggplot2, box plots, and correlation analysis.
  • Machine Learning: K-means clustering for identifying patterns in housing prices across different locations and time.
  • Time Series Analysis: Focusing on quarterly data to understand trends and predict future values.

Technologies Used

  • R Programming Language for data processing and visualization
  • ggplot2 for data visualization
  • caTools for data splitting
  • Hmisc and corrplot for correlation analysis
  • k-means clustering for unsupervised learning

Data Preparation

Data Preparation and Preprocessing

  1. Original Dataset:

    original dataset
    Figure 1: Original dataset

  2. Dataframe that Have Undergone Preprocessing:

    dataframe that have undergone preprocessing
    Figure 2: Dataframe that have undergone preprocessing

    Figure 1 shows the data read from csv and stored into dataframe, df. It contains 698 entries with 8 columns. Figure 2 shows the dataframe df that have undergone preprocessing. It has 90 entries with 10 total columns now.

  3. Locations and Total Numbers of Missing Values:

    locations and total numbers of missing values
    Figure 3: Locations and total numbers of missing values

  4. Structure of the Dataframe Before Preprocessing:

    structure of the dataframe before preprocessing
    Figure 4: Structure of the dataframe before preprocessing

  5. Structure of the Dataframe After Preprocessing:

    structure of the dataframe after preprocessing
    Figure 5: Structure of the dataframe after preprocessing

  6. First Few Rows of the Dataframe for df3, Training_Set and Test_Set:

    first few rows of the dataframe for df3, training_set and test_set
    Figure 6: First few rows of the dataframe for df3, training_set and test_set

  7. Summary of the Dataframe for df3, Training_Set and Test_Set:

    summary of the dataframe for df3, training_set and test_set
    Figure 7: Summary of the dataframe for df3, training_set and test_set

  8. Training Set for Housing_Prices After Preprocessing:

    training set for housing_prices after preprocessing
    Figure 8: Training set for housing_prices after preprocessing

    Figure 8 shows the training set. It has 72 entries with 9 columns.
  9. Test Set for Housing_Prices After Preprocessing:

    test set for housing_prices after preprocessing
    Figure 9: Test set for housing_prices after preprocessing

    Figure 9 shows the test set. It has 18 entries with 10 columns. The preprocessed dataframe is split into training and test set with the ratio of 0.8 and 0.2 respectively.

Exploratory Data Analysis (EDA)

  1. Barplot for the Value of Nominal and Rent Housing Prices by Locations in 2020-Q4:

    barplot for the value of nominal and rent housing prices by locations in 2020-q4
    Figure 10: Barplot for the value of nominal and rent housing prices by locations in 2020-q4

  2. Boxplot for the Housing Value Prices by Months for Nominal and Rent:

    boxplot for the housing value prices by months for nominal
    Figure 11: Boxplot for the housing value prices by months for nominal
    boxplot for the housing value prices by months for rent
    Figure 12: Boxplot for the housing value prices by months for rent
  3. Correlation Value and the P-Value of Nominal and Rent:

    correlation value and the p-value of nominal and rent
    Figure 13: Correlation value and the p-value of nominal and rent

  4. Correlation Value and the P-Value of Nominal Only:

    correlation value and the p-value of nominal only
    Figure 14: Correlation value and the p-value of nominal only

  5. Correlation Value and the P-Value of Rent Only:

    correlation value and the p-value of rent only
    Figure 15: Correlation value and the p-value of rent only

  6. Correlaltion Plot for the Value of Nominal and Rent, Nominal Only, and Rent Only:

    correlation plot for the value of nominal and rent
    Figure 16: Correlation plot for the value of nominal and rent
    correlation plot for the value of nominal only
    Figure 17: Correlation plot for the value of nominal only
    correlation plot for the value of rent only
    Figure 18: Correlation plot for the value of rent only

Prediction of the Housing Prices Hotspot Location to Invest via Clustering

We picked k-means clustering for the clustering model since it is good for huge datasets and has a low time complexity. The variables index in 2022 Q1, Q2, and Q3 are selected for this cluster model to identify where is the best location to invest. With the aid of Within Sum of Squares (WSS), the ideal k-value was established to be 10.

Graph of Squares against Number of Clusters:

graph of squares against number of clusters
Figure 19: Graph of squares against number of clusters

The between SS/total SS ratio percentage for the clustering model is 96.4%, indicating that it has both internal cohesiveness and exterior separation. The centroids are close to the data points within the clusters, while each cluster is far from the others. Cluster graphs are displayed in the figures below.

  1. km within ss, Total of km within ss and km Between ss:

    km within ss, total of km within ss and km between ss
    Figure 20: km within ss, total of km within ss and km between ss

  2. between_ss/total_ss Ratio:

    between ss over total ss ratio
    Figure 21: between_ss/total_ss ratio

  3. Clusters Graphs of 2022-Q1, 2022-Q2 and 2022-Q3

    clusters graphs of 2022-q1
    Figure 22: Clusters graphs of 2022-Q1
    clusters graphs of 2022-q2
    Figure 23: Clusters graphs of 2022-Q2
    clusters graphs of 2022-q3
    Figure 24: Clusters graphs of 2022-Q3

    From the figure above we can see that the hotspots with the highest index value are location around 15-25.

About

Predicting housing prices with nominal and rent data for different locations using K-Means Clustering

Topics

Resources

Stars

Watchers

Forks

Languages