StrepHit is a Natural Language Processing pipeline that understands human language, extracts facts from text and produces Wikidata statements with references.
StrepHit is a IEG project funded by the Wikimedia Foundation.
StrepHit will enhance the data quality of Wikidata by suggesting references to validate statements, and will help Wikidata become the gold-standard hub of the Open Data landscape.
https://meta.wikimedia.org/wiki/Grants:IEG/StrepHit:_Wikidata_Statements_Validation_via_References
https://www.mediawiki.org/wiki/StrepHit
- Web spiders to collect a biographical corpus from a list of reliable sources
- Corpus analysis to understand the most meaningful verbs
- Extraction of sentences and semi-structured data from a corpus
- Train an automatic classifier through crowdsourcing
- Extract facts from text in 2 ways:
- Several utilities, ranging from NLP tasks like tokenization and part-of-speech tagging, to facilities for parallel processing, caching and logging
- Corpus Harvesting
- Corpus Analysis
- Sentence Extraction
- N-ary Relation Extraction
- Dataset Serialization
- Install Python 2.7 and pip
- Clone the repository and create the output folder:
$ git clone https://github.com/Wikidata/StrepHit.git
$ mkdir StrepHit/output
- Install all the Python requirements (preferably in a virtualenv)
$ cd StrepHit
$ pip install -r requirements.txt
- Install TreeTagger
- Register for a free account on the Dandelion APIs
- Create the file
strephit/commons/secret_keys.py
with your API token. You can find it in your dashboard
NEX_URL = 'https://api.dandelion.eu/datatxt/nex/v1/'
NEX_TOKEN = 'your API token here'
If you want to extract sentences via syntactic parsing, you will need to install:
- Java 8
- Stanford CoreNLP, through our utility:
$ python -m strephit commons download stanford_corenlp
You can run all the NLP pipeline components through a command line.
Do not specify any argument, or use --help
to see the available options.
Each command can have a set of sub-commands, depending on its granularity.
$ python -m strephit
Usage: __main__.py [OPTIONS] COMMAND [ARGS]...
Options:
--log-level <TEXT CHOICE>...
--cache-dir DIRECTORY
--help Show this message and exit.
Commands:
annotation Corpus annotation via crowdsourcing
classification Roles classification
commons Common utilities used by others
corpus_analysis Corpus analysis module
extraction Data extraction from the corpus
rule_based Unsupervised fact extraction
side_projects Side projects scripts
web_sources_corpus Corpus retrieval from the web
- Generate a dataset of Wikidata assertions (QuickStatements syntax) from semi-structured data in the corpus (takes time, and a good internet connection):
$ python -m strephit extraction process_semistructured -p 1 samples/corpus.jsonlines
- Produce a ranking of meaningful verbs:
$ python -m strephit commons pos_tag samples/corpus.jsonlines bio en
$ python -m strephit corpus_analysis rank_verbs output/pos_tagged.jsonlines bio en
- Extract sentences using the ranking and perform Entity Linking:
$ python -m strephit extraction extract_sentences samples/corpus.jsonlines output/verbs.json en
$ python -m strephit commons entity_linking -p 1 output/sentences.jsonlines en
- Extract facts with the rule-based classifier:
$ python -m strephit rule_based classify output/entity_linked.jsonlines samples/lexical_db.json en
- Train the supervised classifier and extract facts:
$ python -m strephit annotation parse_results samples/crowdflower_results.csv
$ python -m strephit classification train output/training_set.jsonlines en
$ python -m strephit classification classify output/entity_linked.jsonlines output/classifier_model.pkl en
- Serialize the supervised classification results into a dataset of Wikidata assertions (QuickStatements):
$ python -m strephit commons serialize -p 1 output/supervised_classified.jsonlines samples/lexical_db.json en
N.B.: you will find all the output files in the output
folder.
By default, StrepHit uses as many processes as the number of CPU cores in the machine where it runs.
Add the -p
parameter if you want to change the behavior.
Set -p 1
to disable parallel processing.
The source code is under the terms of the GNU General Public License, version 3.