Lucas Soto
The main purpose of this project is to create a model that is able to predict if a house was built before 1980. I used the data from the state of Colorado. I also used a Tree Decision Classifier model for this project.
parcel | abstrprd | livearea | finbsmnt | basement | yrbuilt | totunits | stories | nocars | numbdrm | numbaths | sprice | deduct | netprice | tasp | smonth | syear | condition_AVG | condition_Excel | condition_Fair | condition_Good | condition_VGood | quality_A | quality_B | quality_C | quality_D | quality_X | gartype_Att | gartype_Att/Det | gartype_CP | gartype_Det | gartype_None | gartype_att/CP | gartype_det/CP | arcstyle_BI-LEVEL | arcstyle_CONVERSIONS | arcstyle_END UNIT | arcstyle_MIDDLE UNIT | arcstyle_ONE AND HALF-STORY | arcstyle_ONE-STORY | arcstyle_SPLIT LEVEL | arcstyle_THREE-STORY | arcstyle_TRI-LEVEL | arcstyle_TRI-LEVEL WITH BASEMENT | arcstyle_TWO AND HALF-STORY | arcstyle_TWO-STORY | qualified_Q | qualified_U | status_I | status_V | before1980 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 00102-08-065-065 | 1130 | 1346 | 0 | 0 | 2004 | 1 | 2 | 2 | 2 | 2 | 100000 | 0 | 100000 | 100000 | 2 | 2012 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 |
1 | 00102-08-073-073 | 1130 | 1249 | 0 | 0 | 2005 | 1 | 1 | 1 | 2 | 2 | 94700 | 0 | 94700 | 94700 | 4 | 2011 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 |
2 | 00102-08-078-078 | 1130 | 1346 | 0 | 0 | 2005 | 1 | 2 | 1 | 2 | 2 | 89500 | 0 | 89500 | 89500 | 10 | 2010 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 |
Two charts that evaluate potential relationships between livearea, price, and the before1980 variables
I created two charts using variables that could show some relationship with the target. I found that there is not relevant information or relationship between the variables livearea and before1980. We can observe that the distribution is very similar for all the years with some exeptions after the year 2000, where there are some outliers. In the second graph, the selling price variable does not show relevant information either. We can observe that before the year 2000 most of the house prices are below $2 million.
## First graph
graph=(alt.Chart(dwellings_ml).encode(
x=alt.X('yrbuilt', scale=alt.Scale(zero=False),axis=alt.Axis(title='Year the House was Built')),
y=alt.X('livearea', axis=alt.Axis(title='Square Footage that is Liveable'))
).mark_point().properties(title="Living Area as an Indicator for Houses Before 1980")
)
graph.save('graph1.png')
## Second graph
graph2=(alt.Chart(dwellings_ml).encode(
x=alt.X('yrbuilt',scale=alt.Scale(zero=False),axis=alt.Axis(title="Year the House was Built")),
y=alt.X('netprice', axis=alt.Axis(title="House Prices"))
).mark_bar().properties(title=" House Price as an Indicator for Houses Before 1980")
)
graph2.save("graph2.png")
The first step to build this model is to split the data into two parts, the target and the features. The features will help us predict our target. The second step is to split again the data into the training and the test data. For this model I decided to use a test size of 0.34, which means that 66% of the data will be use to train the model and 34% will be use to test the model. Finally, I used the deicision tree classifier model because the features are categorical and discrete. I found that this model works better with these type of features.
x=dwellings_ml.drop(['yrbuilt','parcel', 'before1980'],axis=1)
y=dwellings_ml.filter(['before1980' ], axis=1)
#%%
X_train, X_test, y_train, y_test= train_test_split(
x,
y,
test_size=0.34,
random_state= 76
)
#%%
clf=tree.DecisionTreeClassifier(criterion="entropy", random_state= 0)
clf.fit(X_train,y_train)
y_predictions=clf.predict(X_test)
#%%
##looking at the accuraency of the test
metrics.accuracy_score(y_test, y_predictions)
The first and most simple way to evaluate the performance of the method is accuracy. Which is the total amount of correct predictions divided by the total amount of data points. In this case the accuracy is 0.90. Another useful tool we have is Precision. Which is the ability of the model to identify only the relevant data points. This is calculated by dividing the number of true positives by the sum of the number of true positives with the number of false positives. In this case the precision is 0.93.
print(classification_report(y_test,y_predictions))
df_features = pd.DataFrame(
{'f_names': X_train.columns,
'f_values': clf.feature_importances_}).sort_values('f_values', ascending = False)
df_features_top= df_features.query('f_values > 0.02')
df_features_top
variablesChart = alt.Chart(df_features_top).mark_bar(color='black').encode(
x=alt.X('f_values', axis=alt.Axis(title='Percentage Each Feature Affect the Model')),
y=alt.Y('f_names', axis= alt.Axis(title='Feature Name'),sort='-x')
).properties(
title='How Each Feature Affect the Model ?'
)
variablesChart.save("variable.png")
#%%
import pandas as pd
import numpy as np
import seaborn as sns
import altair as alt
#%%
# the from imports
from sklearn.model_selection import train_test_split
from sklearn import tree
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import GradientBoostingClassifier
from sklearn import metrics
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report
alt.data_transformers.enable('json')
#%%
#reading and importing the data.
dwellings_ml = pd.read_csv("https://github.com/byuidatascience/data4dwellings/raw/master/data-raw/dwellings_ml/dwellings_ml.csv")
dwellings_ml.head(5)
#%%
graph=(alt.Chart(dwellings_ml).encode(
x=alt.X('yrbuilt', scale=alt.Scale(zero=False),axis=alt.Axis(title='Year the House was Built')),
y=alt.X('livearea', axis=alt.Axis(title='Square Footage that is Liveable'))
).mark_point().properties(title="Living Area as an Indicator for Houses Before 1980")
)
graph.save('graph1.png')
# %%
graph2=(alt.Chart(dwellings_ml).encode(
x=alt.X('yrbuilt',scale=alt.Scale(zero=False),axis=alt.Axis(title="Year the House was Built")),
y=alt.X('netprice', axis=alt.Axis(title="House Prices"))
).mark_bar().properties(title=" House Price as an Indicator for Houses Before 1980")
)
graph2.save("graph2.png")
## building a clasification model
#%%
## set the variables
x=dwellings_ml.drop(['yrbuilt','parcel', 'before1980'],axis=1)
y=dwellings_ml.filter(['before1980' ], axis=1)
#%%
X_train, X_test, y_train, y_test= train_test_split(
x,
y,
test_size=0.34,
random_state= 76
)
#%%
clf=tree.DecisionTreeClassifier(criterion="entropy", random_state= 0)
clf.fit(X_train,y_train)
y_predictions=clf.predict(X_test)
#%%
##looking at the accuraency of the test
metrics.accuracy_score(y_test, y_predictions)
#%%
#ANSWER QUESTION 3
##this shows what variables have most impact in the test.
df_features = pd.DataFrame(
{'f_names': X_train.columns,
'f_values': clf.feature_importances_}).sort_values('f_values', ascending = False)
df_features_top= df_features.query('f_values > 0.02')
df_features_top
#%%
## now creates a graph to show how each variable affected the test.
##WHY ALL MY GRAPHS DONT WORK ?
variablesChart = alt.Chart(df_features_top).mark_bar(color='black').encode(
x=alt.X('f_values', axis=alt.Axis(title='Percentage Each Feature Affect the Model')),
y=alt.Y('f_names', axis= alt.Axis(title='Feature Name'),sort='-x')
).properties(
title='How Each Feature Affect the Model ?'
)
variablesChart.save("variable.png")
#%%
metrics.plot_roc_curve(clf, X_test, y_test)
#%%
#ASWER QUESTION 4
print(classification_report(y_test,y_predictions))
#%%