Course materials for General Assembly's Data Science course in Washington, DC (8/18/15 - 10/29/15).
Instructor: Kevin Markham
Tuesday | Thursday |
---|---|
8/18: Introduction to Data Science | 8/20: Command Line and Version Control |
8/25: Data Reading and Cleaning | 8/27: Exploratory Data Analysis |
9/1: Visualization Project Discussion Deadline |
9/3: Machine Learning Project Question and Dataset Due |
9/8: Getting Data | 9/10: K-Nearest Neighbors |
9/15: Basic Model Evaluation | 9/17: Linear Regression |
9/22: First Project Presentation | 9/24: Logistic Regression |
9/29: Advanced Model Evaluation | 10/1: Naive Bayes and Text Data |
10/6: Natural Language Processing | 10/8: Kaggle Competition, Draft Paper Due |
10/13: Decision Trees | 10/15: Ensembling |
10/20: Regularization and Clustering, Peer Review Due |
10/22: Course Review and Bonus Topics |
10/27: Bonus Topics and Final Project Presentation |
10/29: Final Project Presentation |
- Install Git.
- Create an account on the GitHub website.
- It is not necessary to download "GitHub for Windows" or "GitHub for Mac"
- Install the Anaconda distribution of Python 2.7x.
- If you choose not to use Anaconda, here is a list of the Python packages you will need to install during the course.
- We would like to check the setup of your laptop before the course begins:
- You can have your laptop checked before the intermediate Python workshop on Tuesday 8/11 (5:30pm-6:30pm), at the 15th & K Starbucks on Saturday 8/15 (1pm-3pm), or before class on Tuesday 8/18 (5:30pm-6:30pm).
- Alternatively, you can walk through the setup checklist yourself.
- Once you receive an email invitation from Slack, join our "DAT8 team" and add your photo.
- Practice Python using the resources below.
- Codecademy's Python course: Good beginner material, including tons of in-browser exercises.
- DataQuest: Uses interactive exercises to teach Python in the context of data science.
- Google's Python Class: Slightly more advanced, including hours of useful lecture videos and downloadable exercises (with solutions).
- Introduction to Python: A series of IPython notebooks that do a great job explaining core Python concepts and data structures.
- Python for Informatics: A very beginner-oriented book, with associated slides and videos.
- A Crash Course in Python for Scientists: Read through the Overview section for a very quick introduction to Python.
- Python Quick Reference Guide: My beginner-oriented guide that demonstrates Python concepts through short, well-commented examples.
- Beginner and intermediate workshop code: Useful for review and reference.
- Python Tutor: Allows you to visualize the execution of Python code.
- Course overview (slides)
- Introduction to data science (slides)
- Discuss the course project: requirements and example projects
- Types of data (slides) and public data sources
- Welcome from General Assembly staff
Homework:
- Work through GA's friendly command line tutorial using Terminal (Linux/Mac) or Git Bash (Windows).
- Read through this command line reference, and complete the pre-class exercise at the bottom. (There's nothing you need to submit once you're done.)
- Watch videos 1 through 8 (21 minutes) of Introduction to Git and GitHub, or read sections 1.1 through 2.2 of Pro Git.
- If your laptop has any setup issues, please work with us to resolve them by Thursday. If your laptop has not yet been checked, you should come early on Thursday, or just walk through the setup checklist yourself (and let us know you have done so).
Resources:
- For a useful look at the different types of data scientists, read Analyzing the Analyzers (32 pages).
- For some thoughts on what it's like to be a data scientist, read these short posts from Win-Vector and Datascope Analytics.
- Quora has a data science topic FAQ with lots of interesting Q&A.
- Keep up with local data-related events through the Data Community DC event calendar or weekly newsletter.
- Slack tour
- Review the command line pre-class exercise (code)
- Git and GitHub (slides)
- Intermediate command line
Homework:
- Complete the command line homework assignment with the Chipotle data.
- Review the code from the beginner and intermediate Python workshops. If you don't feel comfortable with any of the content (excluding the "requests" and "APIs" sections), you should spend some time this weekend practicing Python:
- Introduction to Python does a great job explaining Python essentials and includes tons of example code.
- If you like learning from a book, Python for Informatics has useful chapters on strings, lists, and dictionaries.
- If you prefer interactive exercises, try these lessons from Codecademy: "Python Lists and Dictionaries" and "A Day at the Supermarket".
- If you have more time, try missions 2 and 3 from DataQuest's Learning Python course.
- If you've already mastered these topics and want more of a challenge, try solving Python Challenge number 1 (decoding a message) and send me your code in Slack.
- To give you a framework for thinking about your project, watch What is machine learning, and how does it work? (10 minutes). (This is the IPython notebook shown in the video.) Alternatively, read A Visual Introduction to Machine Learning, which focuses on a specific machine learning model called decision trees.
- Optional: Browse through some more example student projects, which may help to inspire your own project!
Git and Markdown Resources:
- Pro Git is an excellent book for learning Git. Read the first two chapters to gain a deeper understanding of version control and basic commands.
- If you want to practice a lot of Git (and learn many more commands), Git Immersion looks promising.
- If you want to understand how to contribute on GitHub, you first have to understand forks and pull requests.
- GitRef is my favorite reference guide for Git commands, and Git quick reference for beginners is a shorter guide with commands grouped by workflow.
- Cracking the Code to GitHub's Growth explains why GitHub is so popular among developers.
- Markdown Cheatsheet provides a thorough set of Markdown examples with concise explanations. GitHub's Mastering Markdown is a simpler and more attractive guide, but is less comprehensive.
Command Line Resources:
- If you want to go much deeper into the command line, Data Science at the Command Line is a great book. The companion website provides installation instructions for a "data science toolbox" (a virtual machine with many more command line tools), as well as a long reference guide to popular command line tools.
- If you want to do more at the command line with CSV files, try out csvkit, which can be installed via
pip
.
- Git and GitHub assorted tips (slides)
- Review command line homework (solution)
- Python:
Homework:
- Complete the Python homework assignment with the Chipotle data, add a commented Python script to your GitHub repo, and submit a link using the homework submission form. You have until Tuesday (9/1) to complete this assignment.
Resources:
- Want to understand Python's comprehensions? Think in Excel or SQL may be helpful if you are still confused by list comprehensions.
- My code isn't working is a great flowchart explaining how to debug Python errors.
- PEP 8 is Python's "classic" style guide, and is worth a read if you want to write readable code that is consistent with the rest of the Python community.
- If you want to understand Python at a deeper level, Ned Batchelder's Loop Like A Native and Python Names and Values are excellent presentations.
- Pandas (code):
- Project question exercise
Homework:
- Complete "Exercise Three" from today's Pandas script. Note: You do not need to submit this assignment.
- Read How Software in Half of NYC Cabs Generates $5.2 Million a Year in Extra Tips for an excellent example of exploratory data analysis.
- The deadline for discussing your project ideas with an instructor is Tuesday (9/1), and your project question write-up is due Thursday (9/3).
Resources:
- Browsing or searching the Pandas API Reference is an excellent way to locate a function even if you don't know its exact name.
- What I do when I get a new data set as told through tweets is a fun (yet enlightening) look at the process of exploratory data analysis.