Skip to content

Latest commit

 

History

History
88 lines (63 loc) · 8.71 KB

classification.md

File metadata and controls

88 lines (63 loc) · 8.71 KB

Classification

Some approaches for classification

  1. Using linear regression of a Indicator Matrix
  2. Linear Discriminant Analysis
  3. Quadratic Discriminant Analysis
  4. Regularized Discriminant Analysis [Source: ESLR Page-112]
  5. Logistic Regression

What is log odds?

  1. Log Oddds Definition
  2. What and Why of Log Odds

The odds ratio is the probability of success/probability of failure. As an equation, that’s P(A)/P(-A), where P(A) is the probability of A, and P(-A) the probability of ‘not A’ (i.e. the complement of A).

Taking the logarithm of the odds ratio gives us the log odds of A, which can be written as

log(A) = log(P(A)/P(-A)), Since the probability of an event happening, P(-A) is equal to the probability of an event not happening, 1 – P(A), we can write the log odds as

Where: p = the probability of an event happening 1 – p = the probability of an event not happening

MLE Estimation for Logistic Regression

Although we could use (non-linear) least squares to fit the logistic model , the more general method of maximum likelihood is preferred, since it has better statistical properties. In logistic regression, we use the logistic function,

Dividing (1) by (2) & taking log on both sides,

Let us make following parametric assumption:

MLE is used to find the model parameters while maximizing,

P(observed data|model parameters)

For Logistic Regression, we need to find the model parameter w that maximizes conditional probability,

Resources to understand MLE estimation for Logistic Regression

  1. lecture05.pdf (zstevenwu.com)
  2. Logit.dvi (rutgers.edu)
  3. ADAfaEPoV (cmu.edu)
  4. A Gentle Introduction to Logistic Regression With Maximum Likelihood Estimation (machinelearningmastery.com)
  5. Logistic Regression and Maximum Likelihood Estimation Function | by Puja P. Pathak | CodeX | Medium

Difference b/w Logistic Regression & Linear Discriminant Analysis

[Source: ISLR Page-151]

Logistic Regression Linear DA
Parameters are estimated using Maximum Likelihood estimation Parameters are estimated using estimated mean & variance from normal distribution
Decision boundary- Linear Decision boundary- Linear
logistic regression can outperform LDA if the Gaussian assumptions are not met LDA assumes that the observations are drawn from a Gaussian distribution with a common covariance matrix in each class, and so can provide some improvements over logistic regression when this assumption approximately holds.

Difference b/w Linear & Quadratic Discriminant Analysis

[Source: ESLR Page-109] and lecture9-stanford

Linear DA Quadratic DA
All the classes have common covariance matrix Σk = Σ ∀ k Each class has its own covariance matrix, Σk
Decision boundary- Linear Decision boundary- Quadratic
Discriminant Function

Discriminant Function

Since covariance matrices is common for all classes no such problem Since separate covariance matrices must be computed for each class, when p (#Features) is large, number of parameters increases dramatically.
[Source: ISLR Page-142] LDA classifier results from assuming that the observations within each class come from a normal distribution with a class-specific mean vector and a common variance σ2 [Source: ISLR Page-142] QDA classifier results from assuming that the observations within each class come from a normal distribution with a class-specific mean vector and covariance matrix Σk
With p predictors, estimating a covariance matrix requires estimating p(p+1)/2 parameters. With p predictors and K classses, estimating a covariance matrix requires estimating K.p(p+1)/2 parameters
LDA is a much less flexible classifier QDA is a more flexible classifier
Can have low variance high bias Can have high variance low bias

What happens when the classes are well separated in Logistic Regression?

When the classes are well-separated, the parameter estimates for the logistic regression model are surprisingly unstable. Linear discriminant analysis does not suffer from this problem. [Source: ESLR, Page-128] If the data in a two-class logistic regression model can be perfectly separated by a hyperplane, the maximum likelihood estimates of the parameters are undefined (i.e., infinite; see Exercise 4.5). The LDA coefficients for the same data will be well defined, since the marginal likelihood will not permit these degeneracies. https://stats.stackexchange.com/questions/224863/understanding-complete-separation-for-logistic-regression https://stats.stackexchange.com/questions/239928/is-there-any-intuitive-explanation-of-why-logistic-regression-will-not-work-for

Compare SVM & Logistic Regression

[Source: ISLR Page-357] SVM loss function is exactly zero for observations for which these correspond to observations that are on the correct side of the margin. In contrast, the loss function for logistic regression is not exactly zero anywhere. But it is very small for observations that are far from the decision boundary. Due to the similarities between their loss functions, logistic regression and the support vector classifier often give very similar results. When the classes are well separated, SVMs tend to behave better than logistic regression; in more overlapping regimes, logistic regression is often preferred.