News Context Explorer (NCE) helps to find people, places and nouns referenced in a given text file. It was built thinking of translators and news readers in a second language. After uploading a text file or providing an article URL, NCE will find names, places, organizations and other relevant words with their links to Wikipedia, as well as images, maps and other contextual information in 10 languages.
- Uploads a txt file or processes a URL
- Supports English, German and Spanish as input languages
- Cleans HTML for text processing and display
- Provides the user with an editor to work and download the source text html
- Highlights found entities
- Allows the user to explore entities in a target language (10 languages currently supported) and download the references to a CSV file
- Geocodes found locations and marks them on a google map
- Retrieves photos of found entities and displays them on a gallery
NCE requires:
- Python 2.7.6 or later
- Flask
- Java (JRE) is required to run the Stanford NES and POS taggers
- A Google API key for geocoding and map display
- Memcache and memcached to store entities in cache
Python libraries listed in requirements.txt
-
Better to start with a virtual environment. To install virtualenv:
$ sudo pip install virtualenv
$ cd ~/code/myproject/
$ virtualenv env
To activate the virtual environment:
$ source env/bin/activate
-
Once you have a virtual environment Pip install the required libraries with requirements.txt
$ env/bin/pip install -r requirements.txt
-
Clone this repo into your project directory.
-
You need to add 2 keys:
- A Flask API key in controller.py
- A Google API key in templates/base.html
-
Download and unzip the Stanford NER 3.5.0 and Stanford POS English tagger 3.5.0 on your project directory. I renamed them stanford-ner and stanford-postagger inside the app, but you should double check the routes in german_processing.py and spanish_processing.py
-
Run the English NER file in java as a server in port 8080
java -mx1000m -cp stanford-ner.jar edu.stanford.nlp.ie.NERServer -loadClassifier classifiers/english.muc.7class.distsim.crf.ser.gz -port 8080 -outputFormat inlineXML
The German and Spanish NER files are not running as a server, but they are noticeably slower. You can adapt the code to run either way.
-
Create an 'uploads' and 'downloads' folder in your static directory.
-
Download and install memcached and run it. If memcached is not running the app will still work but it will be slower and make more requests to the Wikipedia API.
-
If you want to add or remove target languages, just add or remove the item from the templates/editor.html dropdown menu, and add new languages to the lancodes dictionary in controller.py. Wikipedia has articles in 128 locales.
I used excellent code and examples from:
Stanford NLP. Wikipedia API. jQuery Highlight plugin. Medium editor. Magnific Popup. HTML sanitizer. Front page tutorial.
This project was completed during Hackbright, a 10 week engineering fellowship for women.
If you want to know more about this project, find me on Twitter @lenazun