-
Notifications
You must be signed in to change notification settings - Fork 202
Home
Below is a brief description of the contents of the repository.
-
Ridge and Lasso (L1 & L2 Regression) : Simple example of L1 and L2 regularization techniques are described here using the Boston House data and Breast Cancer data-set. Along with the complete workflow, why L1 regularization leads to feature selection are described in detail here. Codes .
-
Understanding PCA: An example of correctly applying PCA using the Breast cancer data-set was shown. How dimensionality reduction helps us in simplifying problems are discussed. Codes used are influenced and modified from the original example given in Prof. Andreas Muller's book. More description here. Codes
-
Example of Pipeline in ML: Using Scikit-Learn Pipeline to apply a list of transforms and a final estimator. A simple example of applying PCA + SVM using pipeline on the red wine quality data-set for classification task was shown. Since Grid Search method is also applied to find the best parameters, a detailed explanation of the workflow can be found here. Codes
-
Support Vector Machine: Theory of Support Vector machine algorithm were discussed in two separate posts in TDS, Decision Rule and Mercer's Theorem and Kernels. However, plotting the decision boundary including the support vectors can be little tricky. This is also discussed using the breast-cancer data-set. For more to read check TDS. Codes .
-
Classification using Density Based Clustering: DBSCAN algorithm was used to spatially cluster the weather stations in Canada. This was one of the projects in the IBM Data-Science course. Theory of DBSCAN algorithm was described in detail in TDS including codes etc. Codes.
-
Decision Tree : Using Bank Marketing data-set Decision Tree algorithm was used to classify whether a client will subscribe to term deposit or not. The classification was based on Gini Impurity and detailed description of the theory and explanations of the codes can be found in TDS post. Folder .
-
Consumer Complaint Classification : Perform classification using the consumer complaint data-set to determine which class a complain belongs to. Several classifiers -- SVM, Random Forest, Logistic Regression, Multinomial Naive Bayes were tested. For converting text to TF-IDF feature matrix, TfidfVectorizer is used. The original post was written by Susan Li in TDS.
-
Bayesian Approach to ML :
-
Conjugate Priors : Concept of conjugate priors, why and when they are useful are discussed in detail in a TDS post. The notebook should be pretty self-explanatory.
-
Expectation Maximization Algorithm : Possibly one of the most important concepts of probabilistic ML. Discussed in detail with a reference to Gaussian Mixture Model. For detailed post check TDS. The Notebook should be self-explanatory too!
-
-
Data Cleaning : This is a rather basic/intermediate intro to data cleaning using Pandas. IMDB Movie data-set was used to show various examples.