- Packages to install
- Things to read
- Pandas: Your gateway to an Excel-free life
- NLTK: Natural language processing
- Sci-Kit Learn: Machine learning, classification
- IPython Notebook: Visuals and code sharing made easy
- Matplotlib: Data visualizations
You can also skip the Anaconda
install and install these packages separately:
(please raise an issue or submit a pull request here if you found dependencies you needed to install in addition to these)
pip install sklearn
pip install pandas
pip install "ipython[notebook]"
pip install matplotlib
pip install numpy
- Natural Language Processing with Python
- This Pandas tutorial
- Beautiful Visualizations - pdf download
Pandas is a robust Data Science package with a powerful data structure known as a DataFrame. Pandas documentation is excellent and provides a thorough variety of tutorials and examples. I recommend their 10 Minutes to pandas tutorial.
I strongly recommend using the Natural Language ToolKit during your introductory steps into Natural Language Processing. The Toolkit has an enjoyable to read book which simultaneously teaching natural language processing and the Python language with tangible, nontrivial examples in both regards. The NLTK creators have also added a list of resources to wiki, including a project ideas page.
The Natural Language Toolkit comes with a lot of features built into it, and still many others that you're probably never going to use (for example, a multitude of languages, or a series of tagged texts including parts of the Torah, Jane Austen's Emma, and an online chatroom in the 90s. You might use these to train a classifier for of a very specific set of text (eg, tweets), but because they are niche, you have to download them individually. Just use nltk.Download() from the command line while in interactive python mode.
The Sklearn package is exciting and deep. You can use it for almost any machine learning tasks you might be interested in. If you don't know what you're interested in, check out this chart to maybe get an idea of the possibilities and learn what vocabulary you might need to learn before/while moving forward. The Sklearn Documentation requires and bestows a deep level of statistical knowledge. I usually mention Naive Bayes classification when talking Intro to Data Science. Did you catch it? Do you know what it is? If you haven't heard of it before, go read the Naive Bayes docs and get your Google engines warmed up, you're about to research a lot of new vocabulary. I've lost count of the number of times I've read the documentation for certain algorithms. I always learn something new, and can always put that to use in my work. As you decide on what you want to accomplish, you will then search through what Sklearn has to offer until you find the right set of tools to use.
The easiest way to get started with Data Science is to download this Anaconda distribution.. In doing so, you will get the most popular science packages, including IPython Notebook. Alternatively, you can install IPython on your own, following these simple directions. IPython allows you to share your code more easily, and has become a popular way for hard-copy authors of coding books to share their code snippets as supplementat material for their books. Many of the files inside this GitHub repository are IPython Notebook files (*.ipynb
) which can be viewed directly in GitHub.
Visualizations are memorable, shareable, and, when done right, make data more understandable. Most PytData folks recommend using Matplotlib in conjunction with IPython Notebook to create your graphs on the fly. Matplotlib's website also has a ton of examples for anything you want to plot.
As you practice making visualizations with Matplotlib, I recommend you pair your efforts with Beautiful Visualization, which will teach you things about the ethics, philosophy, and efficacy of various visualization types. Did you know that humans over estimate how much larger a circle needs to be to double in volume? If your visualization compares population sizes using circle size, you will probably downplay the difference in size because of how humans reason spatially.