Accepted at PLOS DH:
https://journals.plos.org/digitalhealth/article?id=10.1371/journal.pdig.0000086 (or see citation)
The DrNote annotation tool features a simple yet effective annotation tool for various purposes.
The annotation method is based on the Opentapioca (GitHub) codebase to provide a named entity linking functionality on unstructured text data.
The project leverages the data from Wikidata and Wikipedia without the requirement of any commercial components.
The annotation service provides a web-based UI as well as an API-based access.
The processing of PDF files is supported. Linked entities can be injected as hyperlinks into the uploaded PDF file.
Different languages (de, en, es etc.) are supported.
Update on Results:
A bug in the evaluation pipeline was found, leading to degraded results in the obtained scores. See the updated scores in the Errata section.
Demo:
Our demo instance is available at:
https://drnote.misit-augsburg.de
Note: Upload of large PDF files is not supported. Uploaded data is discarded after processing.
CLI Demo:
# Enter text
text="Die Diagnosen sind Hypothyreose bei Autoimmunthyreoiditis, Diabetes mellitus mit diabetische Nephropathie und akutes Nierenversagen."
# Annotate
curl -k https://drnote.misit-augsburg.de/annotate \
-F "inputType=plaintext" \
-F "outputType=html" \
-F \
"filterOptions={
\"pipeline\": \"de_core_news_sm\",
\"rules\": [
\"any pos[NOUN,PROPN] require\",
\"all non_stopwords require\"
]
}" \
-F \
"plaintext=$text"
Detected issues:
- For the GSC EMEA/Medline datasets, the labels were not correctly filtered for the
CHEM
label class in all instances. - Due to a too strict regular expression, detected
Chemical
entries for PubTator were only considered if a MeSH code was given. - For GSC EMEA/Medline datasets, in the cTAKES outputs, the UMLS tags were wrongfully used over the MedicationMentions tags.
- The character spans of cTAKES may yield broken values due to unsupported umlaut characters. The broken character spans are now fixed using a workaround.
The evaluation was re-run with a revised evaluation pipeline. However, due to constant changes in the WikiData, the results may vary. For instance, due to substantial changes in the WikiData graph structure, the SPARQL query for finding medication entities was changed from the previous query
(old SPARQL query)
SELECT DISTINCT ?entity WHERE
{
{?entity wdt:P279+ wd:Q12140 .}
UNION
{?entity wdt:P31+ wd:Q12140 .}
}
(new SPARQL query)
SELECT DISTINCT ?entity WHERE
{
{?entity wdt:P279+ wd:Q12140 .}
UNION
{?entity wdt:P31+ wd:Q12140 .}
UNION
{?entity wdt:P267 ?atccode .}
}
For comparisons, the (cached) original outputs from PubTator, cTAKES, and the original pre-trained DrNote model & index store was used. Also, the cached set of UMLS entities was used. The updated results are (as of 31.07.2024) as follows.
Dataset | Method | Precision | Recall | F1 score |
---|---|---|---|---|
GERNERMED | cTAKES | 0.858 | 0.512 | 0.641 |
GERNERMED | PubTator | 0.760 | 0.481 | 0.590 |
GERNERMED | DrNote | 0.935 | 0.624 | 0.749 |
Medline GSC | cTAKES | 0.806 | 0.307 | 0.444 |
Medline GSC | PubTator | 0.449 | 0.420 | 0.434 |
Medline GSC | DrNote | 0.693 | 0.139 | 0.232 |
EMEA GSC | cTAKES | 0.834 | 0.357 | 0.500 |
EMEA GSC | PubTator | 0.522 | 0.211 | 0.301 |
EMEA GSC | DrNote | 0.833 | 0.172 | 0.285 |
Medline GSC | DrNote (filtered) | 0.634 | 0.444 | 0.522 |
EMEA GSC | DrNote (filtered) | 0.604 | 0.636 | 0.620 |
Steps to spawn the service using pre-trained data:
# Assumed: Docker, Docker-compose installed and user added to Docker group
# follow guide from https://docs.docker.com/engine/install/ubuntu/
# sudo apt-get install -y docker docker-compose
# sudo usermod -aG docker $USER
# Clone repository
git clone https://github.com/frankkramer-lab/DrNote
cd DrNote/
# Retrieve pre-trained data
wget -O build/pretrained_data.tar.gz https://myweb.rz.uni-augsburg.de/~freijoha/DrNote/pretrained_data.tar.gz
# Spawn annotation service
./04_start_annotation_service.sh
The annotation service should be available at:
https://<DOCKER_HOST>/
Steps to automatically build the OpenTapioca data setup pipeline and spawn the annotation service.
Prestep: Setup the configuration:
- Modify the file
./cfg/opentapioca_profile.json
. - Modify the file
./cfg/load_config.json
.
Note: The language code should match the entry in./cfg/opentapioca_profile.json
.
Steps:
-
Check dependencies:
- Run
./01_checkDependencies.sh
- Run
-
Generate the NIF file:
- Run
./02_loadNIFFile.sh
- Run
-
Generate the OpenTapioca data:
- Run
./03_processForOpenTapioca.sh
- Run
-
Spawn the MISIT annotation service:
- Run
./04_start_annotation_service.sh
- Run
The annotation service should be available at:
https://<DOCKER_HOST>/
The paper is available at: https://journals.plos.org/digitalhealth/article?id=10.1371/journal.pdig.0000086 If you use our work or want to reference it, use the following bibtex lines:
@article{10.1371/journal.pdig.0000086,
doi = {10.1371/journal.pdig.0000086},
author = {Frei, Johann and Soto-Rey, Iñaki and Kramer, Frank},
journal = {PLOS Digital Health},
publisher = {Public Library of Science},
title = {DrNote: An open medical annotation service},
year = {2022},
month = {08},
volume = {1},
url = {https://doi.org/10.1371/journal.pdig.0000086},
pages = {1-18},
abstract = {In the context of clinical trials and medical research medical text mining can provide broader insights for various research scenarios by tapping additional text data sources and extracting relevant information that is often exclusively present in unstructured fashion. Although various works for data like electronic health reports are available for English texts, only limited work on tools for non-English text resources has been published that offers immediate practicality in terms of flexibility and initial setup. We introduce DrNote, an open source text annotation service for medical text processing. Our work provides an entire annotation pipeline with its focus on a fast yet effective and easy to use software implementation. Further, the software allows its users to define a custom annotation scope by filtering only for relevant entities that should be included in its knowledge base. The approach is based on OpenTapioca and combines the publicly available datasets from WikiData and Wikipedia, and thus, performs entity linking tasks. In contrast to other related work our service can easily be built upon any language-specific Wikipedia dataset in order to be trained on a specific target language. We provide a public demo instance of our DrNote annotation service at https://drnote.misit-augsburg.de/.},
number = {8},
}
- Annotation Service provides the Webservice for a given PDF/Text.
- Annotation NIF Generation extracts a NIF-compatible file from Wikipedia.
- OpenTapioca Wrapper wraps the OpenTapioca build/preprocessing/training pipeline (including Solr).
- PDF Processing Library implements the PDF Text extraction and link editing functionality.
Not required for smaller queries: