Skip to content

EdoardoLenzi9/Wikipedia.StrepHit

Repository files navigation

Wikipedia.StrepHit

simple script in python 2.7 that adds references to a wikidata .qs dump

Official Project Page

https://meta.wikimedia.org/wiki/Grants:IEG/StrepHit:_Wikidata_Statements_Validation_via_References

Documentation

https://www.mediawiki.org/wiki/StrepHit

Features

  • Refresh urls
    • Refresh urls in case of redirection
    • Refresh urls to https
    • Refresh urls as query params
  • Add references
  • Export unmapped urls list

Get Ready

Launch directly main.py

This is the most direct way to start (and debug) the script, customize main.py as you want and launch:

    sh compile.sh

or

    python main.py

Click and Virtualenv

    $ git clone https://github.com/EdoardoLenzi9/Wikipedia.StrepHit.git
    $ cd Wikipedia.StrepHit
    $ pip install -r requirements.txt

For Windows Users

  • Install setup.py
    python -m pip install --editable <PATH to your folder>

Launch script

    strephit --help
    strephit add_references

Click Description

strephit --help
Usage: strephit [OPTIONS] COMMAND [ARGS]...

  Simple script in python 2.7 that adds references to a wikidata .qs dump.

  Assets:

      (Default) INPUT FILE:        assets\supervised_dataset.qs

      (Default) OUTPUT FILE:       supervised_dataset_output.qs

      (Default) REFRESHED URLS LOG:      supervised_dataset_refreshed_urls.json

      (Default) ERRORS LOG FILE:   supervised_dataset_errors.log

      (Default) SOURCE MAPPINGS FILE:      supervised_dataset_source_mappings.json

      (Default) AUTOGENERATED MAPPINGS FILE:      supervised_dataset_mappings.json

      (Default) AUTOGENERATED UNKNOWN MAPPINGS FILE:      supervised_dataset_unknown_mappings.json

  Manage your configs in "domain/localizations.py":

      * (bool) LOAD_MAPPINGS: loads autogenerated mappings from "assets/supervised_dataset_mappings.json" and "assets/supervised_dataset_unknown_mappings.json" files.

      * (bool) MAP_ALL_RESPONSES: when you call add_references procedure, in case of an unmapped 'reference URL' (P854) inserts in "supervised_dataset_mappings.json"  record of the result sparql-query (business/queries/sitelink_queries.py).

      * (bool) IS_ASYNC_MODE: when you call add_references procedure,processes each row on a new thread.

      * (bool) DELETE_ROW: when you call refresh procedure, deletes rows with unrechable 'reference URL' (P854).

      * (bool) REFRESH_UNKNOWN_DOMAINS: when you call refresh procedure, replaces old urls with updated urls in case of site redirection.

      * [Here you can also customize every default path of your assets]

  Notes:     * Async mode and MAP_ALL_RESPONSES mode need more testing

Options:
  --help  Show this message and exit.

Commands:
  add_references   For each row of your .qs input dump analyzes 'reference URL' (P854) property and generate a 'stated in' (P248) propery (where possible)
  export_unmapped  For each row of your .qs input dump checks if 'reference URL' (P854) is just mapped in your mapping files (and logs if not)
  refresh          Refresh URLs of your .qs input dump, for each row check if 'reference URL' (P854) is reachable and, in case of redirect, updates it
  refresh_and_add  Calls sequentially refresh and add_references commands

Assets

File Path
input dump ./assets/supervised_dataset.qs
output dump ./assets/supervised_dataset_output.qs
error log ./assets/supervised_dataset_errors.log
json with mappings ./assets/supervised_dataset_mappings.json
json with unknown mappings ./assets/supervised_dataset_unknown_mappings.json
json with refreshed urls log ./assets/supervised_dataset_refreshed_urls.json
json with source mappings ./assets/supervised_dataset_source_mappings.json

Source mappings are defined by the user and never modified by the script; mappings and unknown mappings are autogenerated files, constantly updated by the script.

in source mappings the property (bool) "to_upper_case" specify the format of the output extracted references.

domain/localizations.py

Here you can manage all constants, strings, paths and urls of the script.

ie

    MAP_ALL_RESPONSES = False   # this mode increase the memory usage but also the speed of the script (if 
                                # your mappings aren't complete)
    IS_ASYNC_MODE = False       # this mode compute on a new thread every row of the dataset 
                                # (this increase the speed of the script but the output rows order can change)
    LOAD_MAPPINGS = True
    DELETE_ROW = False          # delete row when refresh a url which is not reachable (needs a stable internet connection!)
    REFRESH_UNKNOWN_DOMAINS = True

TODO

1. async_mode and map_all_responses_mode need more testing and fixes

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published