18-ml.Rmd

# Machine Learning {#ml}


## Machine Learning Overview

In machine learning the point is always prediction or classification. 

Machine learning is divided into supervised learning or unsupervised learning. 

In supervised learning we have the outcome variable in the data set as well as the variables of interest called features, 

The __outcome variable__ also called the __target variable__, also called the __variable of interest__, __dependent variables__, etc. We choose this variable from the data set. We determine what is of interest to us and what variable will be the best choice for whatever question we are trying to answer or address. 

The __features__ (computer science lingo) are also called __covariates__ (statistical lingo), __explanatory variables__ (statistical lingo), __predictive variables__ (statistical lingo), __independent variables__ (statistical lingo).

In data science in general there is always a real world question we want to answer using the data that we have. We then look at the data and think of which variable in the data set will be the best option for answering the question we are interested in.

We have to think how to formulate the real world problem into a machine learning question that can be answered using the data.

So similarly as we decide which variable to choose as the __variable of interest__, i.e. the __target variable__ we also choose which variables to use from the data as the __features__ or __explanatory variables__.

There are two main methods to machine learning / statistical machine learning: parametric and non-parametric. Both have advantages and disadvantages.

### Trade-off Between Prediction Accuray and Model Interpretability.

### Trade-off Between Bias and Variance.

The bias and variance trade-off for a model can be thought of in terms of how flexible the model is versus how accurate the model is. The more we increase the model's flexibility the better it is likely to generalize and decrease the testing or production error, but will decrease the overall general accuracy of the model. 

So one reason to have less flexible model is to decrease the overfitting of the model. 

Another reason to have less flexible model would be if the goal is inference and interpretation of the model. Less flexible models are more interpretable.


### Model Assessment

The main metric of performance evaluation for models for regression problems is called the mean squared error (MSE) which is the average of the distance squared between the predicted value by the model and the actual value of the outcome variable. 


$$ MSE =  1/n \sum^n_{i=1} (y_i - \hat{f}(x_i))^2 $$


## Machine Learning Coding Conventions


The data set, which has dimensions N x (p + 1), is first and foremost separated into three groups: train, test, validate. In code we typically label them as dfTrain, dfTest, dfVal. Where dfTrain has number of rows n_train, dfTest has number of rows n_test, and similarly n_val for dfVal. 

Therefore n_train + n_test + n_val = N

Further:

1. n_train = n * .5
2. n_test  = n * .25
3. n_val   = n * .25

Further, these three are separated into __target variable__ __features__ and label them in the code as follows: 

1. y_train (n_train x 1) and X_train (n_train x p)
2. y_test (n_test x 1) and X_test (n_test x p)
3. y_val (n_val x 1) and X_val (n_val x p)


Where the y_* variables are a vectors or 1 dimensional columns. It has the same number of rows as the number of rows in the __cleaned and transformed__ data set. So the dimensions of y_* will be n x 1 where n
is the number of rows. 

X_* will have dimensions n x p where p is the number of __features__ or __explanatory variables__ we want to use to predict the __target variable__. 

How you name these things are we code the __target variable__ as


## Algorithms