GitHub - azampagl/ai-ml-clustering: Implementation of multiple clustering algorithms (K-means, Bisecting K-means, Agglomerative Hierarchial Clustering with Intra-Cluster Similarity (IST), Centroid Similarity (CST), and UPGMA) for performance comparisons on different data sets.

azampagl / ai-ml-clustering Public

Notifications You must be signed in to change notification settings
Fork 4
Star 22

Implementation of multiple clustering algorithms (K-means, Bisecting K-means, Agglomerative Hierarchial Clustering with Intra-Cluster Similarity (IST), Centroid Similarity (CST), and UPGMA) for performance comparisons on different data sets.

22 stars 4 forks Branches Tags Activity

Star

Notifications

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
data		data
results		results
src		src
.gitignore		.gitignore
README.txt		README.txt
report.docx		report.docx
report.pdf		report.pdf

Repository files navigation

A Comparison of Document Clustering Techniques for performance comparisons.

Original algortihms are based on:

Michael Steinbach, George Karypis, Vipin Kumar 
Department of Computer Science and Egineering, 
University of Minnesota 
Technical Report #00-034 
{steinbac, karypis, kumar}@cs.umn.edu
@see http://www.cs.fit.edu/~pkc/classes/ml-internet/papers/steinbach00tr.pdf

The style guide follows the strict python PEP 8 guidelines.
@see http://www.python.org/dev/peps/pep-0008/

"Modules".
1. Preprocess: occurs in the init method.
2. Cluster: occurs via the classes "execute" method.
3. Evaluate: occurs during the classes "evaluate" method.


============================================
Arguments for python main.py
============================================


	The following are arguments required:

-t: the topic file.
-a: the clustering algorithm (agg-upgma-k-means, agg-cst, k-means, bi-k-means-size, agg-ist, bi-k-means-sim, agg-upgma).
-k: the number of clusters
-o: the TFIDF file.
-r: the result file.

	The following arguments are required for bisecting k-means algorithms:

-i: number of iterations.


============================================
Execution
============================================

	Execution is straightforward.  After choosing a topic file (-t), a clustering algorithm (-a), and the number of clusters (-k), the program will spit out the TFIDF vectors for each document (-o) and the results (-r).

======================
	Usage
======================

	The following are some example use cases.

> python main.py -t "../data/toy/toy-topics.txt" -a "k-means" -k 3 -o "../results/toy/tfidf.dat" -r "../results/toy/k-means.txt"