simple script in python 2.7 that adds references to a wikidata .qs dump
https://meta.wikimedia.org/wiki/Grants:IEG/StrepHit:_Wikidata_Statements_Validation_via_References
https://www.mediawiki.org/wiki/StrepHit
- Refresh urls
- Refresh urls in case of redirection
- Refresh urls to https
- Refresh urls as query params
- Add references
- Export unmapped urls list
This is the most direct way to start (and debug) the script, customize main.py as you want and launch:
sh compile.sh
or
python main.py
- Install Python 2.7 and pip
- Clone the repository and install requirements.txt (preferably in a virtualenv)
$ git clone https://github.com/EdoardoLenzi9/Wikipedia.StrepHit.git
$ cd Wikipedia.StrepHit
$ pip install -r requirements.txt
- Install
setup.py
python -m pip install --editable <PATH to your folder>
strephit --help
strephit add_references
strephit --help
Usage: strephit [OPTIONS] COMMAND [ARGS]...
Simple script in python 2.7 that adds references to a wikidata .qs dump.
Assets:
(Default) INPUT FILE: assets\supervised_dataset.qs
(Default) OUTPUT FILE: supervised_dataset_output.qs
(Default) REFRESHED URLS LOG: supervised_dataset_refreshed_urls.json
(Default) ERRORS LOG FILE: supervised_dataset_errors.log
(Default) SOURCE MAPPINGS FILE: supervised_dataset_source_mappings.json
(Default) AUTOGENERATED MAPPINGS FILE: supervised_dataset_mappings.json
(Default) AUTOGENERATED UNKNOWN MAPPINGS FILE: supervised_dataset_unknown_mappings.json
Manage your configs in "domain/localizations.py":
* (bool) LOAD_MAPPINGS: loads autogenerated mappings from "assets/supervised_dataset_mappings.json" and "assets/supervised_dataset_unknown_mappings.json" files.
* (bool) MAP_ALL_RESPONSES: when you call add_references procedure, in case of an unmapped 'reference URL' (P854) inserts in "supervised_dataset_mappings.json" record of the result sparql-query (business/queries/sitelink_queries.py).
* (bool) IS_ASYNC_MODE: when you call add_references procedure,processes each row on a new thread.
* (bool) DELETE_ROW: when you call refresh procedure, deletes rows with unrechable 'reference URL' (P854).
* (bool) REFRESH_UNKNOWN_DOMAINS: when you call refresh procedure, replaces old urls with updated urls in case of site redirection.
* [Here you can also customize every default path of your assets]
Notes: * Async mode and MAP_ALL_RESPONSES mode need more testing
Options:
--help Show this message and exit.
Commands:
add_references For each row of your .qs input dump analyzes 'reference URL' (P854) property and generate a 'stated in' (P248) propery (where possible)
export_unmapped For each row of your .qs input dump checks if 'reference URL' (P854) is just mapped in your mapping files (and logs if not)
refresh Refresh URLs of your .qs input dump, for each row check if 'reference URL' (P854) is reachable and, in case of redirect, updates it
refresh_and_add Calls sequentially refresh and add_references commands
File | Path |
---|---|
input dump | ./assets/supervised_dataset.qs |
output dump | ./assets/supervised_dataset_output.qs |
error log | ./assets/supervised_dataset_errors.log |
json with mappings | ./assets/supervised_dataset_mappings.json |
json with unknown mappings | ./assets/supervised_dataset_unknown_mappings.json |
json with refreshed urls log | ./assets/supervised_dataset_refreshed_urls.json |
json with source mappings | ./assets/supervised_dataset_source_mappings.json |
Source mappings are defined by the user and never modified by the script; mappings and unknown mappings are autogenerated files, constantly updated by the script.
in source mappings the property (bool) "to_upper_case" specify the format of the output extracted references.
Here you can manage all constants, strings, paths and urls of the script.
ie
MAP_ALL_RESPONSES = False # this mode increase the memory usage but also the speed of the script (if
# your mappings aren't complete)
IS_ASYNC_MODE = False # this mode compute on a new thread every row of the dataset
# (this increase the speed of the script but the output rows order can change)
LOAD_MAPPINGS = True
DELETE_ROW = False # delete row when refresh a url which is not reachable (needs a stable internet connection!)
REFRESH_UNKNOWN_DOMAINS = True
1. async_mode and map_all_responses_mode need more testing and fixes