Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Custom Entity Detection Issue in Presidio Version 2 #1305

Open
surajsonee opened this issue Feb 19, 2024 · 9 comments
Open

Custom Entity Detection Issue in Presidio Version 2 #1305

surajsonee opened this issue Feb 19, 2024 · 9 comments

Comments

@surajsonee
Copy link

I'm encountering an issue with custom entity detection in Presidio Version 2. Despite defining custom entity rules for medical numbers, including Aadhar and Health Insurance Claim Numbers (HICNs), Presidio does not seem to recognize them correctly.

I've followed the steps to create a YAML file (medical_numbers.yaml) containing the custom entity rules and mounted it into the Presidio container. However, when I test the detection using sample text containing these medical numbers, Presidio either returns incorrect entity types or fails to detect them altogether.

I've verified that the regex patterns in the YAML file are correct and aligned with the requirements for detecting Aadhar and HICNs. Additionally, I've ensured that the YAML file is correctly mounted into the Presidio container and configured for recognition.

Please advise on how to troubleshoot and resolve this issue with custom entity detection in Presidio Version 2.

I follow these steps:

  1. Save the YAML file: Save the YAML file containing the custom entity rules (in this case, medical_numbers.yaml) to a location accessible by your Presidio container.

  2. Mount the YAML file into the container: When running the Presidio container, mount the directory containing the YAML file to the appropriate configuration directory inside the container (usually /presidio/config/).
    Command to run the Presidio container with mounting the custom entity configuration file:
    Docker run -d
    -v /path/to/medical_numbers.yaml:/presidio/config/medical_numbers.yaml
    -p 8080:8080
    mcr.microsoft.com/presidio-analyzer:latest

  3. Configure Presidio to recognize the custom entity: Inside the container, ensure that Presidio is configured to load and recognize the custom entity rules from the mounted YAML file.

  4. Testing detection: After configuring Presidio, you can test the detection of medical numbers, including Aadhar, and Health Insurance Claim Numbers, in your text data.

@omri374
Copy link
Contributor

omri374 commented Feb 20, 2024

@surajsonee can you please elaborate on how you did step 3? Was the yaml loaded and recognizers added? if you print the list of recognizers, are the ones from the yaml listed there?

@surajsonee
Copy link
Author

surajsonee commented Feb 22, 2024

sure!
Here is the yaml file which I'm using:
https://github.com/microsoft/presidio/blob/main/presidio-analyzer/conf/example_recognizers.yaml

  1. Access the Presidio Container: Use the docker exec command to access the running Presidio container's shell. For example:
    sudo docker exec -it <container_id> bash

  2. Navigate to Configuration Directory: Inside the container, navigate to the directory where Presidio's configuration files are stored. This is typically the /presidio/config/ directory.

  3. Check Mounted YAML File: Verify that the custom entity YAML file is correctly mounted in the container's configuration directory. using ls command to list the files in the directory:

  4. Access Presidio Configuration Directory: First, navigate to the directory where Presidio's configuration files are stored. which is /presidio/config/ directory within the Presidio container.

  5. Review Configuration Files: Look for configuration files or scripts that are used to initialize Presidio's analyzer engine. These files often have names like config.yaml or similar.

content of config.yaml:
custom_entities:
yaml_path: /presidio/config/example_recognizers.yaml

  1. Inspect Configuration Content: Open the configuration file using a text editor or command-line tools like cat or less. Look for sections or properties related to loading custom entity rules or YAML files.

  2. Add Recognizers to Registry: Add the created recognizers to Presidio's recognizer registry.
    Here's a Python example demonstrating how to add recognizers after loading the YAML:

from presidio_analyzer import Pattern, PatternRecognizer, AnalyzerEngine
import yaml

# Load custom entity rules from YAML file
with open('example_recognizers.yaml', 'r') as yaml_file:
    custom_entity_rules = yaml.safe_load(yaml_file)

# Create an instance of AnalyzerEngine
analyzer = AnalyzerEngine()

# Iterate over each entity in the custom entity rules
for entity_name, entity_config in custom_entity_rules.items():
    patterns = entity_config.get('patterns', [])
    
    # Create a recognizer for each pattern defined for the entity
    for pattern_config in patterns:
        name = pattern_config.get('name')
        regex = pattern_config.get('regex')
        score = pattern_config.get('score', 0.8)  # Default score
        
        # Create a Pattern object
        pattern = Pattern(name=name, regex=regex, score=score)
        
        # Create a PatternRecognizer with the Pattern object
        recognizer = PatternRecognizer(supported_entity=entity_name, patterns=[pattern])
        
        # Add the recognizer to Presidio's recognizer registry
        analyzer.registry.add_recognizer(recognizer)

In this above example, example_recognizers.yaml is the YAML file containing the custom entity rules. The script reads this file, extracts the entity names and patterns, creates recognizers based on the extracted information, and adds them to Presidio's recognizer registry.

Please let me know where I'm doing wrong.

Thank you!

@omri374
Copy link
Contributor

omri374 commented Feb 22, 2024

Hi, I'm not sure what's wrong, as you seem to add the recognizers the right way. Could it be that patterns are always empty?

BTW we have a method for adding recognizers from YAML: https://microsoft.github.io/presidio/analyzer/adding_recognizers/#reading-pattern-recognizers-from-yaml
Perhaps try to see if it makes any difference.

@surajsonee
Copy link
Author

Thank you for the reference!
Could you please provide guidance on which files require modification to establish custom entity rules?

@omri374
Copy link
Contributor

omri374 commented Feb 23, 2024

Sure. if you change the default configuration in app.py:

self.engine = AnalyzerEngine()

To something more similar to the tutorial:

yaml_file = "recognizers.yaml"
registry = RecognizerRegistry()
registry.load_predefined_recognizers()

registry.add_recognizers_from_yaml(yaml_file)

self.engine = AnalyzerEngine(registry=registry)

You should be able to load the yaml based recognizers into the analyzer engine, and these would be used in each call.

@surajsonee
Copy link
Author

Thank you! I tried this but it didn't work. I have tried another solution which is working now.

@GautierT
Copy link

GautierT commented Apr 4, 2024

@surajsonee : hey. What is the other solution that you used ? thanks.

@surajsonee
Copy link
Author

@GautierT I have made some customizations to it so that it is running fine now. You can email me and I will share the code with you.
Email: [email protected]

@WithIbadKhan
Copy link

@GautierT I have made some customizations to it so that it is running fine now. You can email me and I will share the code with you. Email: [email protected]

[email protected]
Please I need.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants