This code is based on an unfinished paper of mine that uses a version of topological sorting and Wu-Palmer measure and topological sentence selection to summarize a corpus. I've written a short script to demonstrate the methodology used in the paper.
Coming Soon.
You need the following libraries for Python to be installed in your computer.
You can install NLTK via pip by executing:
pip install nltk
Along with our summarizer script, I've attached a summarizer script Summa that uses Text Rank. We have used this script to compare results generated by Text Rank to our script.
You can install Summa via pip by executing:
pip install summa
Before running our summarizer script, you need to run the pre-processor script (which I've written yet). The preprocessor script performs the following tasks:
1. Remove stop words from the corpus using a standard list.
2. Prune off punctuations & numeric values as they do not affect the quality of sentence selection.
3. Remove symbolic short forms such as Mr.,Ms.,Dr.,Rs.,&,%,$ etc.
4. Expand texual short forms such as It's , That's , What's etc.
5. Form a list of sentences from the corpus after preprocessing through steps 1 to 4 are complete.
You can run the script simply by typing the following in terminal:
python v01.py
I've already attached a few manually pre-processed text files in the 'Corpus-Collection' folder. However, you can use any passage of your choice as long as it is parsable into a string by Python. You can simply edit the line:
file=open('Corpus-Collection/text4.txt','r')
By default the percentage_summarization value has been set to 0.5 indicating 50% summarization. You can change the factor by simply editing the line:
percentage_summarization=0.5