-
Notifications
You must be signed in to change notification settings - Fork 569
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Some spans are being skipped by spacy-huggingface-pipelines, resulting in poor anonymisation #1262
Comments
Hi @aayushisanghi |
@aayushisanghi, as @VMD7 mentioned, a reproducible example would definitely help. Thanks! |
@aayushisanghi, we'd be very interested to know more about this issue especially as the result is poor anonymization. Any feedback would be valuable. |
Hello! I am having the same issue. Here is my code: import transformers
from huggingface_hub import snapshot_download
from transformers import AutoTokenizer, AutoModelForTokenClassification
from presidio_analyzer import AnalyzerEngine, RecognizerRegistry
from presidio_analyzer.nlp_engine import NlpEngineProvider
transformers_model = "Jean-Baptiste/camembert-ner-with-dates"
snapshot_download(repo_id=transformers_model)
AutoTokenizer.from_pretrained(transformers_model)
AutoModelForTokenClassification.from_pretrained(transformers_model)
conf_file = "/Users/thomasmoulin/Downloads/config_presidio_fr_transformer.yml"
provider = NlpEngineProvider(conf_file=conf_file)
nlp_engine = provider.create_engine()
analyzer = AnalyzerEngine(
nlp_engine=nlp_engine,
supported_languages=["fr"]
)
result = analyzer.analyze(text="Je m'appelle Thomas Moulin", language="fr") |
Thanks @thomas-moulin. Would you mind sharing your conf_file? or is it standard? |
Thanks for the quick reply ! sure ! nlp_engine_name: transformers
|
@thomas-moulin is warning the only thing the gets outputted? I tried to reproduce this and got |
Yes in my case I only have the warning that gets outputted. the result variable is an empty list |
If you change |
yes @omri374 it works! |
Great. Leaving the issue open as there still could be corner cases where there's a wrong output. |
Yes ! On longer inputs (OCR on french resume) I still have some warnings but not as many as before |
Warnings are inevitable (it's part of spacy-huggingface-pipelines) but I'd be interested to see if there are missing predictions. |
Hello @omri374, @VMD7 . I am currently experiencing the same issue. Below is the code that can reproduce the problem. from presidio_analyzer import AnalyzerEngine, RecognizerResult
from presidio_analyzer.nlp_engine import NerModelConfiguration, TransformersNlpEngine
model_config = [
{
"lang_code": "en",
"model_name": {
"spacy": "en_core_web_sm",
"transformers": "lakshyakh93/deberta_finetuned_pii",
},
}
]
mapping = dict(
USERNAME="USERNAME",
EMAIL="EMAIL",
KEY="KEY",
PASSWORD="PASSWORD",
IP_ADDRESS="IP_ADDRESS",
FIRSTNAME="FIRSTNAME",
LASTNAME="LASTNAME",
MIDDLENAME="MIDDLENAME",
IPV4="IP_ADDRESS",
IPV6="IP_ADDRESS",
IP="IP_ADDRESS",
PHONE_NUMBER="PHONE_NUMBER",
SSN="SSN",
ACCOUNTNUMBER="ACCOUNTNUMBER",
CREDITCARDNUMBER="CREDITCARDNUMBER",
CREDITCARDISSUER="CREDITCARDISSUER",
CREDITCARDCVV="CREDITCARDCVV",
)
ner_model_configuration = NerModelConfiguration(
model_to_presidio_entity_mapping=mapping,
)
nlp_engine = TransformersNlpEngine(models=model_config, ner_model_configuration=ner_model_configuration)
engine = AnalyzerEngine(
nlp_engine=nlp_engine,
supported_languages=[
"en",
],
)
print(
engine.analyze(
"My name is Clara and I live in Berkeley. this is my ip address : 175.5.0.1. this is my password: sad$f-j?ss11FF. credit card is 1231-1231-1451-2134",
language="en",
)
)
It seems like there might be an issue with using spacy and transformers together. Related Issue: explosion/spaCy#12998 |
@fml09 do you experience skipped entities or just warnings? |
@omri374 Result:
|
@omri374 any news? |
Looking into this. If we can't find a resolution, we will likely remove the dependency on spacy-huggingface-pipelines and call transformers directly. |
For your specific case, please try changing the aggregation strategy to ner_model_configuration = NerModelConfiguration(
model_to_presidio_entity_mapping=mapping,
aggregation_strategy="max"
) It will result in the credit card number to be fully identified. |
Hi! I've been working on a transformer-based Presidio pipeline, and I noticed it was performing rather poorly. Upon inspecting the logs, I found this particular warning:
UserWarning: Skipping annotation, {'entity_group':'PASSWORD', 'score': 0.25415105, 'word': '##bmh78', 'start': 157, "end': 1623} is overlapping or can't be aligned for doc 'Standardized tests will be...'
The root cause of this issue is this line in the spacy-huggingface-pipelines package. I know this isn't directly Presidio related, but is there any configuration change I can make in Presidio, to prevent these spans from being skipped? Or is there another issue I'm not seeing here?
I followed the online tutorial, and am using a publicly available dataset to test different models. I'm not sure why this could be happening.
Any help to debug this will be super helpful, thanks!
The text was updated successfully, but these errors were encountered: