Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updates to the transformers conf docs and yaml file #1467

Merged
merged 3 commits into from
Oct 13, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
118 changes: 94 additions & 24 deletions docs/analyzer/nlp_engines/transformers.md
Original file line number Diff line number Diff line change
Expand Up @@ -55,7 +55,75 @@ Then, also download a spaCy pipeline/model:
python -m spacy download en_core_web_sm
```

#### Creating a configuration file

### Configuring the NER pipeline

Once the models are downloaded, one option to configure them is to create a YAML configuration file.
Note that the configuration needs to contain both a `spaCy` pipeline name and a transformers model name.
In addition, different configurations for parsing the results of the transformers model can be added.

The NER model configuration can be done in a YAML file or in Python:

#### Configuring the NER pipeline via code

Example configuration in Python:

```python
# Transformer model config
model_config = [
{"lang_code": "en",
"model_name": {
"spacy": "en_core_web_sm", # for tokenization, lemmatization
"transformers": "StanfordAIMI/stanford-deidentifier-base" # for NER
}
}]

# Entity mappings between the model's and Presidio's
mapping = dict(
PER="PERSON",
LOC="LOCATION",
ORG="ORGANIZATION",
AGE="AGE",
ID="ID",
EMAIL="EMAIL",
DATE="DATE_TIME",
PHONE="PHONE_NUMBER",
PERSON="PERSON",
LOCATION="LOCATION",
GPE="LOCATION",
ORGANIZATION="ORGANIZATION",
NORP="NRP",
PATIENT="PERSON",
STAFF="PERSON",
HOSP="LOCATION",
PATORG="ORGANIZATION",
TIME="DATE_TIME",
HCW="PERSON",
HOSPITAL="LOCATION",
FACILITY="LOCATION",
VENDOR="ORGANIZATION",
)

labels_to_ignore = ["O"]

ner_model_configuration = NerModelConfiguration(
model_to_presidio_entity_mapping=mapping,
alignment_mode="expand", # "strict", "contract", "expand"
aggregation_strategy="max", # "simple", "first", "average", "max"
labels_to_ignore = labels_to_ignore)

transformers_nlp_engine = TransformersNlpEngine(
models=model_config,
ner_model_configuration=ner_model_configuration)

# Transformer-based analyzer
analyzer = AnalyzerEngine(
nlp_engine=transformers_nlp_engine,
supported_languages=["en"]
)
```

#### Creating a YAML configuration file

Once the models are downloaded, one option to configure them is to create a YAML configuration file.
Note that the configuration needs to contain both a `spaCy` pipeline name and a transformers model name.
Expand All @@ -75,9 +143,9 @@ models:
ner_model_configuration:
labels_to_ignore:
- O
aggregation_strategy: simple # "simple", "first", "average", "max"
aggregation_strategy: max # "simple", "first", "average", "max"
stride: 16
alignment_mode: strict # "strict", "contract", "expand"
alignment_mode: expand # "strict", "contract", "expand"
model_to_presidio_entity_mapping:
PER: PERSON
LOC: LOCATION
Expand All @@ -92,33 +160,15 @@ ner_model_configuration:
DATE: DATE_TIME
PHONE: PHONE_NUMBER
HCW: PERSON
HOSPITAL: ORGANIZATION
HOSPITAL: LOCATION
VENDOR: ORGANIZATION

low_confidence_score_multiplier: 0.4
low_score_entity_names:
- ID
```

Where:

- `model_name.spacy` is a name of a spaCy model/pipeline, which would wrap the transformers NER model. For example, `en_core_web_sm`.
- The `model_name.transformers` is the full path for a huggingface model. Models can be found on [HuggingFace Models Hub](https://huggingface.co/models?pipeline_tag=token-classification). For example, `obi/deid_roberta_i2b2`

The `ner_model_configuration` section contains the following parameters:

- `labels_to_ignore`: A list of labels to ignore. For example, `O` (no entity) or entities you are not interested in returning.
- `aggregation_strategy`: The strategy to use when aggregating the results of the transformers model.
- `stride`: The value is the length of the window overlap in transformer tokenizer tokens.
- `alignment_mode`: The strategy to use when aligning the results of the transformers model to the original text.
- `model_to_presidio_entity_mapping`: A mapping between the transformers model labels and the Presidio entity types.
- `low_confidence_score_multiplier`: A multiplier to apply to the score of entities with low confidence.
- `low_score_entity_names`: A list of entity types to apply the low confidence score multiplier to.

See more information on parameters on the [spacy-huggingface-pipelines Github repo](https://github.com/explosion/spacy-huggingface-pipelines#token-classification).

Once created, see [the NLP configuration documentation](../customizing_nlp_models.md#Configure-Presidio-to-use-the-new-model) for more information.

#### Calling the new model
##### Calling the new model

Once the configuration file is created, it can be used to create a new `TransformersNlpEngine`:

Expand All @@ -143,6 +193,26 @@ Once the configuration file is created, it can be used to create a new `Transfor
print(results_english)
```

#### Explaning the configuration options

- `model_name.spacy` is a name of a spaCy model/pipeline, which would wrap the transformers NER model. For example, `en_core_web_sm`.
- The `model_name.transformers` is the full path for a huggingface model. Models can be found on [HuggingFace Models Hub](https://huggingface.co/models?pipeline_tag=token-classification). For example, `obi/deid_roberta_i2b2`

The `ner_model_configuration` section contains the following parameters:

- `labels_to_ignore`: A list of labels to ignore. For example, `O` (no entity) or entities you are not interested in returning.
- `aggregation_strategy`: The strategy to use when aggregating the results of the transformers model.
- `stride`: The value is the length of the window overlap in transformer tokenizer tokens.
- `alignment_mode`: The strategy to use when aligning the results of the transformers model to the original text.
- `model_to_presidio_entity_mapping`: A mapping between the transformers model labels and the Presidio entity types.
- `low_confidence_score_multiplier`: A multiplier to apply to the score of entities with low confidence.
- `low_score_entity_names`: A list of entity types to apply the low confidence score multiplier to.

See more information on parameters on the [spacy-huggingface-pipelines Github repo](https://github.com/explosion/spacy-huggingface-pipelines#token-classification).

Once created, see [the NLP configuration documentation](../customizing_nlp_models.md#Configure-Presidio-to-use-the-new-model) for more information.


### Training your own model

!!! note "Note"
Expand Down
3 changes: 2 additions & 1 deletion presidio-analyzer/presidio_analyzer/conf/transformers.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -36,8 +36,9 @@ ner_model_configuration:
TIME: DATE_TIME
PHONE: PHONE_NUMBER
HCW: PERSON
HOSPITAL: ORGANIZATION
HOSPITAL: LOCATION
FACILITY: LOCATION
VENDOR: ORGANIZATION

low_confidence_score_multiplier: 0.4
low_score_entity_names:
Expand Down
Loading