The CAR datas science toolkit is a collection of common data science tools and algorithms, implemented and documented as simply as possible for data journalists to learn from and understand.
- Clustering algorithms: DBSCAN; k-means clustering
- Classification: Naive Bayes classifier; k-nearest neighbors
- Similarity metrics: Euclidean distance; Jaccard similarity; cosine similarity; Pearson similarity; Hamming distance
- MapReduce workflow that calculates pairwise document similarity based on TF-IDF weights.