Essentials for Data Science (2024/2025)

A master-level course, part of Statistics and Data Science master, Leiden University.

⚠️ ⚠️ ⚠️ Prepare your laptop as described in the installation section below.
⚠️ ⚠️ ⚠️ Charge your laptop battery before the lecture.

Teachers

Szymon M. Kiełbasa [LUMC/BDS], coordinator, [email protected]
Ramin Monajemi [LUMC/BDS]
Daniela Grandón Silva [LU/MATH]

Admission requirements

Elementary statistical skills and elements of linear algebra.

Description

The course offers a practical introduction to a few programming languages and tools currently used in data science:

Python is a general-purpose, high-level and easy to learn programming language. It provides a large number of data science libraries (e.g. machine learning, neural networks, data manipulation, data visualization).
SQL is a standard language used to create, query, update and manage relational databases. For example, such databases are used to store large tables with results of experiments.
Git is a tool that allows to track changes in files during development of programs. It is the current standard for collaborative code development.

During the course you will develop Python programs of growing complexity.
You will use state-of-the-art Python-specific data manipulation/visualization (e.g. pandas, Matplotlib) data science libraries. You will apply several standard machine learning methods.
After the course you will be able to program simple reproducible data analyses (consisting of data reading, cleaning, simple modelling, and reporting steps).
You will also learn about fundamentals of the relational databases and of the SQL language, and you will practice this knowledge on an example database (SQLite).

First, you will work alone and practice code development. You will submit your assignments through GitHub.
Later, shared code development will be practiced in groups. The members of the group will be requested to use git to track changes in their code and to share their code with other students through GitHub.

Course Objectives

During the course you will practice writing Python and SQL code. After the course you will be able to:

✍️ Use Python collections (list, tuple, set, dict).
✍️ Use Python flow control statements (if, for, while, exceptions), context managers (with) and define user functions.
✍️ Use Python standard libraries (reading/writing files in different formats; math, statistics, random).
✍️ Use common data science libraries (NumPy, pandas, Matplotlib).
✍️ Use common machine learning libraries to apply simple regression, classification, clustering and dimensionality reduction methods.
✍️ Understand Python classes (instance variables, methods, inheritance).
✍️ Understand relational databases and elementary SQL.
🚫 Use SQL to create, query, update a database.
✍️ Understand ideas behind project versioning with git and GitHub.
🚫 Use git and GitHub for individual and collaborative code development.

Timetable

The schedule given below might change:

The primary source for lecture, exam and retake dates/locations is Essentials for Data Science course 4433EDASCY schedule at https://rooster.universiteitleiden.nl/. The dates on this page are manually copied and may lag behind.
Important note: The currently assigned lecture room does not provide individual power sockets, so please bring your laptops with enough battery charge for 4h session.
The order/content of the future lectures might be adjusted.
The dates of the assignments and the group assignment might be adjusted if order of the lectures changes.

The schedule:

(01) Feb. 3rd, 2025 (Szymon/Ramin):
- General course introduction
- Python notebooks
- Python basic
- Python lists and tuples
- Memory organization
- Git/GitHub introduction
(02) Feb. 10th (Szymon/Daniela):
- Python sets and dictionaries
- Git/GitHub practice
- 📙 Assignment A (not graded): start
(03) Feb. 17th (Szymon/Daniela):
- Python flow control and user functions
- 📗 Assignment B (graded): start
(04) Feb. 24th (Szymon):
- Python object oriented programming
- Git/GitHub assignment preparation
- 📙 Assignment A: discussion of solutions
(05) Mar. 3rd (Szymon):
- Python standard libraries and scripts
(06) Mar. 10th (Ramin/Daniela):
- Data manipulation: NumPy [Exercises]
- 📗 Assignment B: deadline (end-of-day)
- 📘 Assignment C (graded): start
(07) Mar. 17th (Ramin/Daniela):
- Data manipulation: pandas [Exercises]
(08) Mar. 31st (Ramin):
- Data visualisation [Exercises]
- 📗 Assignment B: grades and feedback
- 📚 Group Assignment: create groups
(09) Apr. 7th (Szymon):
- Relational databases:
- SQL language:
  - Downloading and connecting to the example database
  - Querying and selecting data (SELECT, LIMIT, AS, ORDER, DISTINCT, WHERE, IN, BETWEEN, LIKE) [Exercises]
  - Grouping and summarising (GROUP BY, HAVING, COUNT, SUM, AVG, MIN, MAX, GROUP_CONCAT) [Exercises]
- 📚 Group Assignment (graded): start
(10) Apr. 14th (Szymon):
- Relational databases:
- SQL language:
  - Modification statements (UPDATE, INSERT, DELETE) [Exercises]
  - Data definition language (CREATE TABLE, DROP TABLE)
  - Joining tables 1 (INNER JOIN, LEFT JOIN, CREATE TEMP TABLE) [Exercises]
  - Joining tables 2 (UNION, EXCEPT, INTERSECT, self joins, CROSS JOIN, subqueries, EXIST) [Exercises]
- 📘 Assignment C: deadline (end-of-day)
(11) Apr. 28th (Daniela+Szymon):
- 📘 Assignment C: grades and feedback
(12) May 12th (Daniela):
(13) May 19th (Daniela):
- Machine learning libraries (examples)
  - scikit-learn
  - Keras
- 📚 Group Assignment: deadline (end-of-day)
(14) May 26th (Szymon):
- Git branching and merging
- General Q&A, programming practice
(--) June 13th:
- 🏢 Exam
(--) July 4th:
- 🏢 Retake
Extra materials (in case of interest):
- Python SQL Toolkit and Object Relational Mapper (SQLAlchemy)

Assessment method

Two homework assignments (each 10% of the final grade), a group assignment (20%), the final written exam (60%).

Components of the final grade:
- Assignments B, C (total weight 0.2):
  - Assignments B and C are separately graded.
  - The grade range is 1-10 for submissions before the deadline. The grade range is 1-7 for submissions after the deadline but before the feedback moment. No submissions will be accepted later (the grade is 1).
  - To pass the course, the mean of Assignment B and C grades must be greater than 5.5.
  - The Assignments B, C rounded mean grade has weight=0.2 in the final grade.
- Group Assignment (weight 0.2):
  - The grade range is 1-10 for submissions before the deadline. The grade range is 1-7 for submissions after the deadline but before the exam day. No submissions will be accepted later (the grade is 1).
  - To pass the course, the group assignment rounded grade must be greater than 5.5.
  - The group assignment rounded grade has weight=0.2 in the final grade.
- Exam/Retake (weight 0.6):
  - Usage of AI-based tools will be prohibited during the exam.
  - The exam consists of two parts: a pen-and-paper quiz and a programming part.
  - The grade range is 1-10.
  - To pass the course, the exam/retake grade must be greater than 5.5.
  - The exam/retake grade has weight=0.6 in the final grade.
  - The exam will cover the course objectives marked with ✍️.
  - The exam will not cover the course objectives marked with 🚫 - these objectives are evaluated in the group assignment.
Final grade:
- The final grade is calculated as a weighted mean of the component grades.
- To pass the course, the final grade needs to be greater or equal 6.0.
- The final grade is rounded to the nearest half integer.

Installation

For the course you will need to bring a laptop with properly installed Python and a development environment.
Install:

Microsoft Visual Code: A free source-code editor made by Microsoft for Windows, Linux and MacOS. Follow the instructions at https://code.visualstudio.com/. Run the editor and install extensions for Python development (possibly, you will not need to install Python and pip separately).

You may additionally need to install:

Python (version >= 3.9.?, optimally >= 3.12.?): Follow the download instructions at https://www.python.org/.
pip: The Python Package Installer. It should already be installed during Python installation. If that is not the case, follow https://pip.pypa.io/en/stable/installation/.
git: Free and open source distributed version control system. Follow the Downloads instructions provided at https://git-scm.com/. Visual Code extensions for git are recommended.

Name		Name	Last commit message	Last commit date
Latest commit History 92 Commits
01_python		01_python
02_python		02_python
03_python		03_python
04_python		04_python
05_python		05_python
06_np		06_np
07_pd		07_pd
08_dv		08_dv
09_sql		09_sql
10_sql		10_sql
11_sql		11_sql
12_git		12_git
13_ml		13_ml
14_fair		14_fair
envs		envs
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Essentials for Data Science (2024/2025)

Teachers

Admission requirements

Description

Course Objectives

Timetable

Assessment method

Installation

About

Releases

Packages

Contributors 2

Languages

License

LUMC/EfDS

Folders and files

Latest commit

History

Repository files navigation

Essentials for Data Science (2024/2025)

Teachers

Admission requirements

Description

Course Objectives

Timetable

Assessment method

Installation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages