Skip to content

LeKonArD/bibl_parser

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

23 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Bibliographic Reference Parser for German Humanities Journals

Dissclaimer: Currently the model only supports the extraction of author names from references. The annotations in gold/dvjs_annot_references_full.xml contain much more information.

Model

The model consists of two bidirectional GRUs and two dense layers. The first GRU receives the output of the last layer of a multilingual BERT model as input. It is not sufficient to use a German BERT model, because it cannot be adapted to the mostly English data from the GROBID project. The Multilingual Model on the other hand can be trained on both English and German data and achieves better results in combination.

Training

The training has a two-stage structure. First, training is based on the gold data from the GROBID project. These will be adapted beforehand so that they are more similar to humanities references (6817 References, gold/grobid_hum.tsv). For this purpose, typical markers such as "vgl." or "siehe dazu" are inserted or the reference is completely embedded in continuous text and divided into segments. The second training step then uses labelled data (341 References) from the Deutsche Vierteljahreszeitschrift für Literaturwissenschaft und Geistesgeschichte (DVJS). More details about the training can be found in the script (code/train_model.py)

Usage

See code/predict.py

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages