Skip to content

waqarahmed1/Data-Science

Repository files navigation

Data-Science

Covering topics across theoretical mathematical underpinings of ML and practical applications through PySpark, Pandas, Scikit , Matplotlib and Tensorflow.

Deep dive across four main sections:

    1. Python for data Science
    1. Statistics and Probability
    1. Machine Learning Fundamentals
    1. Big Data and Spark

1 Python for data Science

Covers foundational Python concepts utilised for data analysis, covering:

  • Jupyter Notebooks
  • NumPy
  • Pandas (Series & DataFrames)
  • Matplotlib
  • Scikit
  • NLTK
  • Plus some Unix for data analysis

2 Statistics and Probability

Reviewing statistical concepts with probability, foundational for machine learning, covering:

  • Set Theory

  • Combinatronics

  • Probability (Axiomatic Formulations)

  • Conditional Probability

  • Random variables, Cumulative Dist, Expectation, Variance and Covariance

  • Discrete Distribution Families

    • Bernoulli
    • Binomial
    • Poisson
    • Geometric
  • Continuous Distribution Families

    • Uniform
    • Exponential
    • Gaussian
  • Inequalities (Markov, Chebyshev, MGF & Law of large numbers) and Limits (Chernoff Bound and Central Limit Theorem)

  • Parameter Estimation and Confidence Interval

  • Regression (Linear and Polynomial) and Principal Component Analysis (PCA)

  • Hypothesis Testing (P Values, Z and T tests)

3 Machine Learning Fundamentals

Reviewing foundational ML concepts from mathematical perspective following up with practical applications in Python, Covering:

  • K Nearest Neighbour (K-NN) and Distance Functions
  • Generative Modelling; Probability Spaces and Bivariate Guassians
  • Further Generative Modelling; Multivariate Gaussians
  • Regression; Linear/Logistic and Regularisation
  • Optimisation: Unconstrained, Convexity, Positive Semidefinite Matrices
  • Linear Classification; Support Vector Machines (SVM) and Multiclass Linear Prediction
  • Further Classifiers; Kernels, Kernel SVM, Decision Trees, Boosting and Random Forests
  • Representational Learning; Clustering with; K-Means, Gaussians and Hierarchical Clustering
  • Further Representational Learning; Linear Projections, Principal Component Analysis (PCA) , Eigenvalues/Eigenvectors and Spectral Decomposition
  • Deep Learning; Autoencoders, Distributed Representation, Feedforward Neural Networks and Training

4 Big Data and Spark

Reviewing Machine Learning concepts and their practical application through Spark using PySpark, RDD's Spark SQL, DataFrames and TensorFlow.

  • Spark, PySpark infrastructure setup
  • MapReduce concepts and RDD's
  • Spark SQL and DataFrames
  • Covariance and principal Component Analysis (PCA) with Python
  • PCA Coefficients and PCA Residuals with Python
  • K-Means Clustering and Intrinsic Dimensions
  • Decision Trees,Ensembles and Boosting
  • Neural Networks (NN) review and TensorFlow (Base and Estimator API)

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published