- Using linear regression of a Indicator Matrix
- Linear Discriminant Analysis
- Quadratic Discriminant Analysis
- Regularized Discriminant Analysis [Source: ESLR Page-112]
- Logistic Regression
The odds ratio is the probability of success/probability of failure. As an equation, that’s P(A)/P(-A), where P(A) is the probability of A, and P(-A) the probability of ‘not A’ (i.e. the complement of A).
Taking the logarithm of the odds ratio gives us the log odds of A, which can be written as
log(A) = log(P(A)/P(-A)), Since the probability of an event happening, P(-A) is equal to the probability of an event not happening, 1 – P(A), we can write the log odds as
Where: p = the probability of an event happening 1 – p = the probability of an event not happening
Although we could use (non-linear) least squares to fit the logistic model , the more general method of maximum likelihood is preferred, since it has better statistical properties. In logistic regression, we use the logistic function,
Dividing (1) by (2) & taking log on both sides,
Let us make following parametric assumption:
MLE is used to find the model parameters while maximizing,
P(observed data|model parameters)
For Logistic Regression, we need to find the model parameter w that maximizes conditional probability,
- lecture05.pdf (zstevenwu.com)
- Logit.dvi (rutgers.edu)
- ADAfaEPoV (cmu.edu)
- A Gentle Introduction to Logistic Regression With Maximum Likelihood Estimation (machinelearningmastery.com)
- Logistic Regression and Maximum Likelihood Estimation Function | by Puja P. Pathak | CodeX | Medium
[Source: ISLR Page-151]
[Source: ESLR Page-109] and lecture9-stanford
When the classes are well-separated, the parameter estimates for the logistic regression model are surprisingly unstable. Linear discriminant analysis does not suffer from this problem. [Source: ESLR, Page-128] If the data in a two-class logistic regression model can be perfectly separated by a hyperplane, the maximum likelihood estimates of the parameters are undefined (i.e., infinite; see Exercise 4.5). The LDA coefficients for the same data will be well defined, since the marginal likelihood will not permit these degeneracies. https://stats.stackexchange.com/questions/224863/understanding-complete-separation-for-logistic-regression https://stats.stackexchange.com/questions/239928/is-there-any-intuitive-explanation-of-why-logistic-regression-will-not-work-for
[Source: ISLR Page-357] SVM loss function is exactly zero for observations for which these correspond to observations that are on the correct side of the margin. In contrast, the loss function for logistic regression is not exactly zero anywhere. But it is very small for observations that are far from the decision boundary. Due to the similarities between their loss functions, logistic regression and the support vector classifier often give very similar results. When the classes are well separated, SVMs tend to behave better than logistic regression; in more overlapping regimes, logistic regression is often preferred.