Some spans are being skipped by spacy-huggingface-pipelines, resulting in poor anonymisation #1262

aayushisanghi · 2024-01-23T16:14:36Z

Hi! I've been working on a transformer-based Presidio pipeline, and I noticed it was performing rather poorly. Upon inspecting the logs, I found this particular warning:
UserWarning: Skipping annotation, {'entity_group':'PASSWORD', 'score': 0.25415105, 'word': '##bmh78', 'start': 157, "end': 1623} is overlapping or can't be aligned for doc 'Standardized tests will be...'

The root cause of this issue is this line in the spacy-huggingface-pipelines package. I know this isn't directly Presidio related, but is there any configuration change I can make in Presidio, to prevent these spans from being skipped? Or is there another issue I'm not seeing here?

I followed the online tutorial, and am using a publicly available dataset to test different models. I'm not sure why this could be happening.

Any help to debug this will be super helpful, thanks!

The text was updated successfully, but these errors were encountered:

VMD7 · 2024-01-24T09:50:44Z

Hi @aayushisanghi
Could you please recreate the mentioned scenario and share me the example with code.

omri374 · 2024-01-28T07:27:05Z

@aayushisanghi, as @VMD7 mentioned, a reproducible example would definitely help. Thanks!

omri374 · 2024-02-11T08:59:41Z

@aayushisanghi, we'd be very interested to know more about this issue especially as the result is poor anonymization. Any feedback would be valuable.

thomas-moulin · 2024-02-13T10:45:31Z

Hello! I am having the same issue.
I am using a specific transformer model to be able to detect PII on french text.
With this approach it seems to be able to detect the PII but for some reason it only output it in a warning log ...

Here is my code:

import transformers
from huggingface_hub import snapshot_download
from transformers import AutoTokenizer, AutoModelForTokenClassification
from presidio_analyzer import AnalyzerEngine, RecognizerRegistry
from presidio_analyzer.nlp_engine import NlpEngineProvider


transformers_model = "Jean-Baptiste/camembert-ner-with-dates"
snapshot_download(repo_id=transformers_model)

AutoTokenizer.from_pretrained(transformers_model)
AutoModelForTokenClassification.from_pretrained(transformers_model)

conf_file = "/Users/thomasmoulin/Downloads/config_presidio_fr_transformer.yml"

provider = NlpEngineProvider(conf_file=conf_file)
nlp_engine = provider.create_engine()

analyzer = AnalyzerEngine(
    nlp_engine=nlp_engine, 
    supported_languages=["fr"]
)

result = analyzer.analyze(text="Je m'appelle Thomas Moulin", language="fr")

omri374 · 2024-02-13T14:22:16Z

Thanks @thomas-moulin. Would you mind sharing your conf_file? or is it standard?

thomas-moulin · 2024-02-13T14:24:06Z

Thanks for the quick reply !

sure !

nlp_engine_name: transformers
models:

lang_code: fr
model_name:
spacy: fr_core_news_sm
transformers: Jean-Baptiste/camembert-ner-with-dates

ner_model_configuration:
labels_to_ignore:

O
aggregation_strategy: simple # "simple", "first", "average", "max"
stride: 16
alignment_mode: strict # "strict", "contract", "expand"
model_to_presidio_entity_mapping:
PER: PERSON
LOC: LOCATION

low_confidence_score_multiplier: 0.4

omri374 · 2024-02-13T14:50:28Z

@thomas-moulin is warning the only thing the gets outputted? I tried to reproduce this and got [type: PERSON, start: 13, end: 26, score: 0.992917537689209]

thomas-moulin · 2024-02-13T14:52:29Z

Yes in my case I only have the warning that gets outputted. the result variable is an empty list

thomas-moulin · 2024-02-13T14:54:31Z

omri374 · 2024-02-13T14:59:40Z

If you change alignment_mode: strict to alignment_mode: expand, does it change the outcome?

thomas-moulin · 2024-02-13T15:06:18Z

yes @omri374 it works!
Thank you very much for your help :)

omri374 · 2024-02-13T15:07:13Z

Great. Leaving the issue open as there still could be corner cases where there's a wrong output.

thomas-moulin · 2024-02-13T15:08:56Z

Yes ! On longer inputs (OCR on french resume) I still have some warnings but not as many as before

omri374 · 2024-02-13T16:32:03Z

Warnings are inevitable (it's part of spacy-huggingface-pipelines) but I'd be interested to see if there are missing predictions.

fml09 · 2024-03-19T08:07:59Z

Hello @omri374, @VMD7 . I am currently experiencing the same issue. Below is the code that can reproduce the problem.

from presidio_analyzer import AnalyzerEngine, RecognizerResult
from presidio_analyzer.nlp_engine import NerModelConfiguration, TransformersNlpEngine

model_config = [
    {
        "lang_code": "en",
        "model_name": {
            "spacy": "en_core_web_sm",
            "transformers": "lakshyakh93/deberta_finetuned_pii",
        },
    }
]

mapping = dict(
    USERNAME="USERNAME",
    EMAIL="EMAIL",
    KEY="KEY",
    PASSWORD="PASSWORD",
    IP_ADDRESS="IP_ADDRESS",
    FIRSTNAME="FIRSTNAME",
    LASTNAME="LASTNAME",
    MIDDLENAME="MIDDLENAME",
    IPV4="IP_ADDRESS",
    IPV6="IP_ADDRESS",
    IP="IP_ADDRESS",
    PHONE_NUMBER="PHONE_NUMBER",
    SSN="SSN",
    ACCOUNTNUMBER="ACCOUNTNUMBER",
    CREDITCARDNUMBER="CREDITCARDNUMBER",
    CREDITCARDISSUER="CREDITCARDISSUER",
    CREDITCARDCVV="CREDITCARDCVV",
)
ner_model_configuration = NerModelConfiguration(
    model_to_presidio_entity_mapping=mapping,
)
nlp_engine = TransformersNlpEngine(models=model_config, ner_model_configuration=ner_model_configuration)
engine = AnalyzerEngine(
    nlp_engine=nlp_engine,
    supported_languages=[
        "en",
    ],
)


print(
    engine.analyze(
        "My name is Clara and I live in Berkeley. this is my ip address : 175.5.0.1. this is my password: sad$f-j?ss11FF. credit card is 1231-1231-1451-2134",
        language="en",
    )
)

error: 
/spacy_huggingface_pipelines/token_classification.py:129: UserWarning: Skipping annotation, {'entity_group': 'CREDITCARDNUMBER', 'score': 0.9902746, 'word': '31-1231-1451-2134', 'start': 130, 'end': 147} is overlapping or can't be aligned for doc 'My name is Clara and I live in Berkeley. this is my ip address : 175.5.0.1. this is my password: sad...'
  warnings.warn(

It seems like there might be an issue with using spacy and transformers together.

Related Issue: explosion/spaCy#12998

omri374 · 2024-03-19T11:48:01Z

@fml09 do you experience skipped entities or just warnings?

fml09 · 2024-03-19T15:13:23Z

@omri374
Both of them. It skips entities and makes warning messages as well.

Result:

[type: PASSWORD, start: 97, end: 111, score: 0.9996715188026428, type: CREDITCARDNUMBER, start: 128, end: 132, score: 0.9856985211372375, type: IP_ADDRESS, start: 65, end: 74, score: 0.95, type: FIRSTNAME, start: 11, end: 16, score: 0.9181937575340271, type: IN_PAN, start: 101, end: 111, score: 0.05]

fml09 · 2024-03-23T02:20:03Z

@omri374 any news?

omri374 · 2024-03-23T09:32:59Z

Looking into this. If we can't find a resolution, we will likely remove the dependency on spacy-huggingface-pipelines and call transformers directly.

omri374 · 2024-03-23T10:15:55Z

For your specific case, please try changing the aggregation strategy to max:

ner_model_configuration = NerModelConfiguration(
    model_to_presidio_entity_mapping=mapping,
    aggregation_strategy="max"
)

It will result in the credit card number to be fully identified.

omri374 mentioned this issue Mar 25, 2024

Changed default aggregation_strategy to max #1342

Merged

mikrzol mentioned this issue Jun 26, 2024

How to change alignment_mode in a hf pipelines in code? explosion/spaCy#13548

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some spans are being skipped by spacy-huggingface-pipelines, resulting in poor anonymisation #1262

Some spans are being skipped by spacy-huggingface-pipelines, resulting in poor anonymisation #1262

aayushisanghi commented Jan 23, 2024

VMD7 commented Jan 24, 2024

omri374 commented Jan 28, 2024

omri374 commented Feb 11, 2024

thomas-moulin commented Feb 13, 2024 •

edited by omri374

Loading

omri374 commented Feb 13, 2024

thomas-moulin commented Feb 13, 2024 •

edited

Loading

omri374 commented Feb 13, 2024

thomas-moulin commented Feb 13, 2024

thomas-moulin commented Feb 13, 2024

omri374 commented Feb 13, 2024

thomas-moulin commented Feb 13, 2024

omri374 commented Feb 13, 2024

thomas-moulin commented Feb 13, 2024

omri374 commented Feb 13, 2024

fml09 commented Mar 19, 2024 •

edited

Loading

omri374 commented Mar 19, 2024

fml09 commented Mar 19, 2024 •

edited

Loading

fml09 commented Mar 23, 2024

omri374 commented Mar 23, 2024

omri374 commented Mar 23, 2024

Some spans are being skipped by spacy-huggingface-pipelines, resulting in poor anonymisation #1262

Some spans are being skipped by spacy-huggingface-pipelines, resulting in poor anonymisation #1262

Comments

aayushisanghi commented Jan 23, 2024

VMD7 commented Jan 24, 2024

omri374 commented Jan 28, 2024

omri374 commented Feb 11, 2024

thomas-moulin commented Feb 13, 2024 • edited by omri374 Loading

omri374 commented Feb 13, 2024

thomas-moulin commented Feb 13, 2024 • edited Loading

nlp_engine_name: transformers models:

omri374 commented Feb 13, 2024

thomas-moulin commented Feb 13, 2024

thomas-moulin commented Feb 13, 2024

omri374 commented Feb 13, 2024

thomas-moulin commented Feb 13, 2024

omri374 commented Feb 13, 2024

thomas-moulin commented Feb 13, 2024

omri374 commented Feb 13, 2024

fml09 commented Mar 19, 2024 • edited Loading

omri374 commented Mar 19, 2024

fml09 commented Mar 19, 2024 • edited Loading

fml09 commented Mar 23, 2024

omri374 commented Mar 23, 2024

omri374 commented Mar 23, 2024

thomas-moulin commented Feb 13, 2024 •

edited by omri374

Loading

thomas-moulin commented Feb 13, 2024 •

edited

Loading

nlp_engine_name: transformers
models:

fml09 commented Mar 19, 2024 •

edited

Loading

fml09 commented Mar 19, 2024 •

edited

Loading