This is a concept that is well known in the context of supervised learning where we have some labelled data, and we want to estimate an unknown function
Wikipedia's definitions for bias and variance are as follows:
-
The bias is an error from erroneous assumptions in the learning algorithm. High bias can cause an algorithm to miss the relevant relations between features and target outputs (underfitting).
-
The variance is an error from sensitivity to small fluctuations in the training set. High variance can cause an algorithm to model the random noise in the training data rather than the intended outputs (overfitting)
Consider
Machine learning algorithms are influenced differently based on their assumptions (bias) about input and output and consequently have different error variances. The characteristics of the training data strongly influence algorithms that have a high variance. This means that the characteristics of the data have influenced the number and types of parameters used to characterise the hypothesis function [https://machinelearningmastery.com].
The bias-variance trade-off is a central problem in supervised learning. Ideally, one wants to choose a model that accurately captures the regularities in its training data and generalises well to unseen data. Unfortunately, it is typically impossible to do both simultaneously. High-variance learning methods may be able to represent their training set well but are at risk of overfitting to noisy or unrepresentative training data. In contrast, algorithms with low variance typically produce simpler models that don't tend to overfit but may underfit their training data, failing to capture important regularities [Wikipedia].
Usually, algorithms in Computer Science are analysed based on the size of the input and how much (memory) space they need to run. At the same time, this still holds for Machine Learning algorithms; there is one more criterion which needs to be answered for ML algorithms, namely, that for ML algorithms, there may be a boundary on how many input instances are necessary for the learning process to succeed. This is important to consider because, in practice, there may not be enough needed samples for the learning process to guarantee the intended results.
SVM (Support Vector Machine) is a supervised machine learning algorithm for classification and regression analysis. The goal of the SVM algorithm is to find the best line (or hyperplane) that can split the data into two classes in a multi-class classification problem.
The math behind SVM involves finding the optimal hyperplane that maximises the margin between the classes, meaning it has the largest distance between the nearest data points of each class. These nearest data points are support vectors and play a crucial role in defining the optimal hyperplane.
Mathematically, the SVM algorithm tries to find the hyperplane that maximises the following objective function:
where w is the normal vector to the hyperplane and b is the bias term. The term tn represents the class label (-1 or 1) of the data point xn. The function returns the maximum margin, which is the distance between the hyperplane and the nearest data points.
In summary, SVM works by finding the hyperplane with the maximum margin that separates the classes, and the margin is defined by the support vectors which are the nearest data points to the hyperplane.
L1 and L2 regularisation are methods to prevent overfitting in machine learning models by adding a penalty term to the loss function.
L1 regularisation, also known as Lasso, adds the absolute value of the coefficients as a penalty term to the loss function. This results in some coefficients being exactly equal to zero, effectively performing feature selection.
L2 regularisation, also known as Ridge Regression, adds the squared values of the coefficients as a penalty term to the loss function. This has the effect of shrinking the coefficients towards zero, but unlike L1 regularisation, it does not result in any coefficients being exactly equal to zero.
Regularisation is necessary in many cases because it helps to prevent overfitting. Overfitting occurs when a model fits the training data too well and becomes overly complex, capturing the noise and random fluctuations in the data rather than the underlying patterns. This results in poor generalisation performance on unseen data.
Regularisation helps to avoid overfitting by adding a penalty term to the loss function that discourages the model from assigning too much importance to any one feature. This forces the model to strike a balance between fitting the training data well and having a simple, generalisable structure.
Decision Trees is a tree-based algorithm used for both classification and regression tasks. It works by recursively partitioning the data into smaller subsets based on the values of the features, with the goal of producing subsets that contain samples that belong to the same class or have similar target values.
The algorithm starts at the root of the tree and selects a feature to split the data based on some criterion, such as information gain or gini impurity. The samples are then divided into branches based on the values of the selected feature, and this process is repeated at each internal node of the tree until a stopping condition is reached.
For example, in a binary classification problem, if a certain feature is selected at a node, the samples are divided into two branches: one for samples with values below a certain threshold and another for samples with values above the threshold. This process continues until a node contains samples of only one class, or until a predetermined maximum depth is reached.
The final result of the Decision Trees algorithm is a tree-like model that can be used to make predictions for new data points. To make a prediction, the algorithm follows a path through the tree based on the values of the features in the sample, until it reaches a leaf node. The prediction is then the class label or target value associated with that leaf node.
In summary, Decision Trees works by recursively dividing the data into smaller subsets based on feature values, with the goal of creating nodes that contain samples with the same class label or target value. The final result is a tree-like model that can be used for making predictions.
8. What are Features, Labels, Training samples, Validation samples, Test samples, Loss function, Hypothesis set and Examples?
Hashing trick (an analogy to another useful technique called kernel trick) is an efficient method of representing (usually a very huge collection of discrete) features as numerical vectors (called feature vectorising).
To understand the main idea behind this method, consider this somehow oversimplified example:
Assume we have two document samples called document A and document B containing a few words each:
Document A | Document B |
---|---|
This is a short sentence. | This is a kinda longer sentence that I am writing here. |
Now we need a hash function and an array for each data sample, our hash function maps the input string into an integer ranging from 1 to the size of the array.
For example, our hash function looks like something like this:
Input word | Output |
---|---|
This | 1 |
is | 2 |
a | 3 |
short | 4 |
sentence | 5 |
kinda | 6 |
longer | 7 |
that | 8 |
I | 9 |
am | 10 |
writing | 11 |
here | 12 |
Now if we hash each word (ignoring the punctuation and space character) from document A and document B using our hash function and use the output integer as the index for the array, we can represent two documents as two same-sized bit-arrays (these arrays are having only 0 and 1s as values). If a word is present in the document, we put 1 into the array cell with the corresponding index for that word and 0 otherwise.
Via application of this scheme vectotized representation of documents A and B would look like this:
Document | index 1 | index 2 | index 3 | index 4 | index 5 | index 6 | index 7 | index 8 | index 9 | index 10 | index 11 | index 12 |
---|---|---|---|---|---|---|---|---|---|---|---|---|
A | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
B | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
As you can see, in our example we represented two documents containing words (complex feature set) by two very compact arrays of bits (111110000000 for document A and 11101111111 for document B) which each only needs 12 bits to represent each document.
Hashing trick is a mainly useful method for dimensionality reduction on large datasets because i) it can easily vectorise complex features (i.e. words or terms in the text); ii) it can be very efficient when the feature space is very large, and it may not be feasible to hold everything into the main memory during the learning or working with the data because we could use smaller arrays and still get a good representation. One of the famous use-cases of Hashing trick (also known as Feature hashing) is its successful application to the problem of detecting spam emails by using it to embed each sample into a lower dimension.
There are various strategies for implementing Hashing trick including the application of different hash functions and use of more complex values instead of just 0 and 1s, you could check out Wikipedia article on Hashing trick and this aticle for more information.
33. Describe various strategies to tune up the hyper-parameters of a particular learning algorithm in general
51. What is the difference between In-sample evaluation and Holdout evaluation of a learning algorithm?
57. What is K-Scheme (Categorical Feature) Encoding? What are some drawbacks of one-hot categorical feature encoding?
77. How does Machine Learning modelling fare against old-school mathematical modelling such as modelling a real-world system via the use of differential equations?
A tensor is a high dimensional array (3 or more dimensions; actually scalars, 1-dimensional arrays, and matrices are considered low order tensors - order in here means the number of dimensions an array has) of numerical values, in other words, tensors are the generalisation of matrices so that usual Lineal Algebraic rules and operations that are applicable to simple arrays and matrices could be used for working in higher dimensions.
Tensors were introduced early in the 20th century but did not receive much attention in Computer Science until recently. Tensor is relatively low-level mathematical objects (much like 1-dimensional arrays and matrices), this article contains a self-contained and thorough introduction to some applications of tensors that demonstrates their various useful use-cases in various setups (from finding unique matrix decompositions more easily to learning on a generative latent or hidden variable model via use of tensor decomposition techniques).
Informally, Bonferroni's principle states that given a rather huge collection of samples representing a set of events, you can calculate the expected number of occurrences for each distinct event in the dataset assuming events are happening totally in random, now if the calculated expected number of a particular event is very larger than the actual number of occurrences that you hope to find you can be very much assured that you find many bogus or invalid events (events that can be like what you are looking for but are observed only due to possible existence of underlying randomness in the dataset).
This is a very good example of a practical Data Mining problem, the main idea is when you try to detect a rare event in a huge dataset you could easily be tricked into falsely believing that you have identified many more number of that particular event than actually there are present in the dataset. Bonferroni's principle helps us to be aware that sometimes we have to search only for very rare events that are very much unlikely to happen in the random data in order to be confident that there aren't!
You can find more information in Mining of Massive Datasets about the Bonferroni's principle.