Predicting Houses Built Before 1980

Lucas Soto

Project Objective

The main purpose of this project is to create a model that is able to predict if a house was built before 1980. I used the data from the state of Colorado. I also used a Tree Decision Classifier model for this project.

Sample Data View

	parcel	abstrprd	livearea	yrbuilt	totunits	stories	nocars	numbdrm	numbaths	sprice	netprice	tasp	smonth	syear	condition_AVG	quality_C	gartype_Att	gartype_Att/Det	arcstyle_END UNIT	arcstyle_MIDDLE UNIT	qualified_Q	status_I
0	00102-08-065-065	1130	1346	2004	1	2	2	2	2	100000	100000	100000	2	2012	1	1	0	1	0	1	1	1
1	00102-08-073-073	1130	1249	2005	1	1	1	2	2	94700	94700	94700	4	2011	1	1	1	0	1	0	1	1
2	00102-08-078-078	1130	1346	2005	1	2	1	2	2	89500	89500	89500	10	2010	1	1	1	0	0	1	1	1

Finding Relationships

Two charts that evaluate potential relationships between livearea, price, and the before1980 variables

I created two charts using variables that could show some relationship with the target. I found that there is not relevant information or relationship between the variables livearea and before1980. We can observe that the distribution is very similar for all the years with some exeptions after the year 2000, where there are some outliers. In the second graph, the selling price variable does not show relevant information either. We can observe that before the year 2000 most of the house prices are below $2 million.

Technical Details

## First graph
graph=(alt.Chart(dwellings_ml).encode(
    x=alt.X('yrbuilt', scale=alt.Scale(zero=False),axis=alt.Axis(title='Year the House was Built')),
    y=alt.X('livearea', axis=alt.Axis(title='Square Footage that is Liveable'))
  
).mark_point().properties(title="Living Area as an Indicator for Houses Before 1980")
)
  
graph.save('graph1.png')
  
## Second graph
graph2=(alt.Chart(dwellings_ml).encode(
    x=alt.X('yrbuilt',scale=alt.Scale(zero=False),axis=alt.Axis(title="Year the House was Built")),
    y=alt.X('netprice', axis=alt.Axis(title="House Prices"))
).mark_bar().properties(title=" House Price as an Indicator for Houses Before 1980")
)
graph2.save("graph2.png")

Model

Classification Model with 90% Accuracy

The first step to build this model is to split the data into two parts, the target and the features. The features will help us predict our target. The second step is to split again the data into the training and the test data. For this model I decided to use a test size of 0.34, which means that 66% of the data will be use to train the model and 34% will be use to test the model. Finally, I used the deicision tree classifier model because the features are categorical and discrete. I found that this model works better with these type of features.

Technical Details

x=dwellings_ml.drop(['yrbuilt','parcel', 'before1980'],axis=1)
y=dwellings_ml.filter(['before1980' ], axis=1)
  
#%%
X_train, X_test, y_train, y_test= train_test_split(
    x,
    y,
    test_size=0.34,
    random_state= 76
)
#%%
clf=tree.DecisionTreeClassifier(criterion="entropy", random_state= 0)
clf.fit(X_train,y_train)
y_predictions=clf.predict(X_test)
#%%
##looking at the accuraency of the test
metrics.accuracy_score(y_test, y_predictions)

The quality of the classification model using evaluation metrics

The first and most simple way to evaluate the performance of the method is accuracy. Which is the total amount of correct predictions divided by the total amount of data points. In this case the accuracy is 0.90. Another useful tool we have is Precision. Which is the ability of the model to identify only the relevant data points. This is calculated by dividing the number of true positives by the sum of the number of true positives with the number of false positives. In this case the precision is 0.93.

print(classification_report(y_test,y_predictions))

Detailing most important features in the model

The following chart shows the feature with most influence in the model

Technical Details

df_features = pd.DataFrame(
    {'f_names': X_train.columns,
    'f_values': clf.feature_importances_}).sort_values('f_values', ascending = False)
  
df_features_top= df_features.query('f_values > 0.02')
df_features_top
  
variablesChart = alt.Chart(df_features_top).mark_bar(color='black').encode(
    x=alt.X('f_values', axis=alt.Axis(title='Percentage Each Feature Affect the Model')),
    y=alt.Y('f_names', axis= alt.Axis(title='Feature Name'),sort='-x')
).properties(
    title='How Each Feature Affect the Model ?'
  
)
variablesChart.save("variable.png")

APPENDIX A (PYTHON CODE)

  
#%%
import pandas as pd 
import numpy as np
import seaborn as sns
import altair as alt
#%%
# the from imports
from sklearn.model_selection import train_test_split
from sklearn import tree
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import GradientBoostingClassifier
from sklearn import metrics
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report
alt.data_transformers.enable('json')
#%%
  
#reading and importing the data. 
  
dwellings_ml = pd.read_csv("https://github.com/byuidatascience/data4dwellings/raw/master/data-raw/dwellings_ml/dwellings_ml.csv")
  
dwellings_ml.head(5)
  
#%%
  
graph=(alt.Chart(dwellings_ml).encode(
    x=alt.X('yrbuilt', scale=alt.Scale(zero=False),axis=alt.Axis(title='Year the House was Built')),
    y=alt.X('livearea', axis=alt.Axis(title='Square Footage that is Liveable'))
  
).mark_point().properties(title="Living Area as an Indicator for Houses Before 1980")
)
  
graph.save('graph1.png')
# %%
graph2=(alt.Chart(dwellings_ml).encode(
    x=alt.X('yrbuilt',scale=alt.Scale(zero=False),axis=alt.Axis(title="Year the House was Built")),
    y=alt.X('netprice', axis=alt.Axis(title="House Prices"))
).mark_bar().properties(title=" House Price as an Indicator for Houses Before 1980")
)
graph2.save("graph2.png")
  
## building a clasification model
#%%
## set the variables
x=dwellings_ml.drop(['yrbuilt','parcel', 'before1980'],axis=1)
y=dwellings_ml.filter(['before1980' ], axis=1)
  
#%%
X_train, X_test, y_train, y_test= train_test_split(
    x,
    y,
    test_size=0.34,
    random_state= 76
  
)
  
#%%
clf=tree.DecisionTreeClassifier(criterion="entropy", random_state= 0)
clf.fit(X_train,y_train)
y_predictions=clf.predict(X_test)
#%%
##looking at the accuraency of the test
metrics.accuracy_score(y_test, y_predictions)
#%%
#ANSWER QUESTION 3
##this shows what variables have most impact in the test. 
df_features = pd.DataFrame(
    {'f_names': X_train.columns,
    'f_values': clf.feature_importances_}).sort_values('f_values', ascending = False)
  
df_features_top= df_features.query('f_values > 0.02')
df_features_top
  
#%%
## now creates a graph to show how each variable affected the test. 
##WHY ALL MY GRAPHS DONT WORK ?
  
variablesChart = alt.Chart(df_features_top).mark_bar(color='black').encode(
    x=alt.X('f_values', axis=alt.Axis(title='Percentage Each Feature Affect the Model')),
    y=alt.Y('f_names', axis= alt.Axis(title='Feature Name'),sort='-x')
).properties(
    title='How Each Feature Affect the Model ?'
  
)
variablesChart.save("variable.png")
#%%
metrics.plot_roc_curve(clf, X_test, y_test)
  
#%%
#ASWER QUESTION 4
print(classification_report(y_test,y_predictions))
  
#%%

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

project4_.md

project4_.md

Predicting Houses Built Before 1980

Project Objective

Sample Data View

Finding Relationships

Two charts that evaluate potential relationships between livearea, price, and the before1980 variables

Technical Details

Model

Classification Model with 90% Accuracy

Technical Details

The quality of the classification model using evaluation metrics

Detailing most important features in the model

The following chart shows the feature with most influence in the model

Technical Details

APPENDIX A (PYTHON CODE)

Files

project4_.md

Latest commit

History

project4_.md

File metadata and controls

Predicting Houses Built Before 1980

Project Objective

Sample Data View

Finding Relationships

Two charts that evaluate potential relationships between livearea, price, and the before1980 variables

Technical Details

Model

Classification Model with 90% Accuracy

Technical Details

The quality of the classification model using evaluation metrics

Detailing most important features in the model

The following chart shows the feature with most influence in the model

Technical Details

APPENDIX A (PYTHON CODE)