Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

example GLiNER integration #1504

Open
wants to merge 9 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 5 additions & 4 deletions docs/samples/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,14 +25,15 @@
| Usage | Text | Python file | [Passing a lambda as a Presidio anonymizer using Faker](python/example_custom_lambda_anonymizer.py)|
| Usage | Text | Python file | [Synthetic data generation with OpenAI](python/synth_data_with_openai.ipynb)|
| Usage | Text | Python file | [Keeping some entities from being anonymized](python/keep_entities.ipynb)|
| Usage | Text | LiteLLM Proxy | [PII Masking LLM calls across Anthropic/Gemini/Bedrock/Azure, etc.](docker/litellm.md)|
| Usage | Text | LiteLLM Proxy | [PII Masking LLM calls across Anthropic/Gemini/Bedrock/Azure, etc.](docker/litellm.md)|
| Usage | Text | Python Notebook | [YAML based no-code configuration](python/no_code_config.ipynb) |
| Usage | Text | Python file | [Using GLiNER within Presidio](python/gliner.md) |
| Usage | | REST API (postman) | [Presidio as a REST endpoint](docker/index.md)|
| Deployment | | App Service | [Presidio with App Service](deployments/app-service/index.md)|
| Deployment | | Kubernetes | [Presidio with Kubernetes](deployments/k8s/index.md)|
| Deployment | | Spark/Azure Databricks | [Presidio with Spark](deployments/spark/index.md)|
| Deployment | | Azure Data Factory with App Service | [ETL for small dataset](deployments/data-factory/presidio-data-factory.md#option-1-presidio-as-an-http-rest-endpoint) |
| Deployment | | Azure Data Factory with Databricks | [ETL for large datasets](deployments/data-factory/presidio-data-factory.md#option-2-presidio-on-azure-databricks) |
| ADF Pipeline | | Azure Data Factory | [Add Presidio as an HTTP service to your Azure Data Factory](deployments/data-factory/presidio-data-factory-template-gallery-http.md) |
| ADF Pipeline | | Azure Data Factory | [Add Presidio on Databricks to your Azure Data Factory](deployments/data-factory/presidio-data-factory-template-gallery-databricks.md) |
| Demo | | Streamlit app | [Create a simple demo app using Streamlit](python/streamlit/index.md)
| ADF Pipeline | | Azure Data Factory | [Add Presidio as an HTTP service to your Azure Data Factory](deployments/data-factory/presidio-data-factory-template-gallery-http.md) |
| ADF Pipeline | | Azure Data Factory | [Add Presidio on Databricks to your Azure Data Factory](deployments/data-factory/presidio-data-factory-template-gallery-databricks.md) |
| Demo | | Streamlit app | [Create a simple demo app using Streamlit](python/streamlit/index.md)
79 changes: 79 additions & 0 deletions docs/samples/python/gliner.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,79 @@
# Using GLiNER within Presidio

## What is GLiNER

GLiNER is a Named Entity Recognition (NER) model capable of identifying any entity type using a bidirectional transformer encoder (BERT-like). It provides a practical alternative to traditional NER models, which are limited to predefined entities, and Large Language Models (LLMs) that, despite their flexibility, are costly and large for resource-constrained scenarios.

Paper: [GLiNER: Generalist Model for Named Entity Recognition using Bidirectional Transformer](https://arxiv.org/abs/2311.08526)

Since GLiNER takes as input both the sentence/text and entity types, it can be used for zero-shot named entity recognition. This means that it can recognize entities that were not seen during training.

## PII Detection with GLiNER

GLiNER has a trained PII detection model: 🔍 [`urchade/gliner_multi_pii-v1`](https://huggingface.co/urchade/gliner_multi_pii-v1) *(Apache 2.0)*

This model is capable of recognizing various types of *personally identifiable information* (PII), including but not limited to these entity types: `person`, `organization`, `phone number`, `address`, `passport number`, `email`, `credit card number`, `social security number`, `health insurance id number`, `date of birth`, `mobile phone number`, `bank account number`, `medication`, `cpf`, `driver's license number`, `tax identification number`, `medical condition`, `identity card number`, `national id number`, `ip address`, `email address`, `iban`, `credit card expiration date`, `username`, `health insurance number`, `registration number`, `student id number`, `insurance number`, `flight number`, `landline phone number`, `blood type`, `cvv`, `reservation number`, `digital signature`, `social media handle`, `license plate number`, `cnpj`, `postal code`, `passport_number`, `serial number`, `vehicle registration number`, `credit card brand`, `fax number`, `visa number`, `insurance company`, `identity document number`, `transaction number`, `national health insurance number`, `cvc`, `birth certificate number`, `train ticket number`, `passport expiration date`, and `social_security_number`.

## Using GLiNER with Presidio

Presidio has a built-in `EntityRecognizer` for GLiNER: `GLiNERRecognizer`. This recognizer can be used to detect PII entities in text using the GLiNER model.

### Installation

To use GLiNER with Presidio, you need to install the `presidio-analyzer` with the `gliner` extra:

```bash
pip install 'presidio-analyzer[gliner]'
```

!!! note
GLiNER only supports python 3.10 and above, while Presidio supports version 3.9 and above.

### Example

```python
from presidio_analyzer import AnalyzerEngine
from presidio_analyzer.nlp_engine import NlpEngineProvider
from presidio_analyzer.predefined_recognizers import GLiNERRecognizer


# Load a small spaCy model as we don't need spaCy's NER
nlp_engine = NlpEngineProvider(
nlp_configuration={
"nlp_engine_name": "spacy",
"models": [{"lang_code": "en", "model_name": "en_core_web_sm"}],
}
)

# Create an analyzer engine
analyzer_engine = AnalyzerEngine()

# Define and create the GLiNER recognizer
entity_mapping = {
"person": "PERSON",
"name": "PERSON",
"organization": "ORGANIZATION",
"location": "LOCATION"
}

gliner_recognizer = GLiNERRecognizer(
model_name="urchade/gliner_multi_pii-v1",
entity_mapping=entity_mapping,
flat_ner=False,
multi_label=True,
map_location="cpu",
)

# Add the GLiNER recognizer to the registry
analyzer_engine.registry.add_recognizer(gliner_recognizer)

# Remove the spaCy recognizer to avoid NER coming from spaCy
analyzer_engine.registry.remove_recognizer("SpacyRecognizer")

# Analyze text
results = analyzer_engine.analyze(
text="Hello, my name is Rafi Mor, I'm from Binyamina and I work at Microsoft. ", language="en"
)

print(results)
```
85 changes: 43 additions & 42 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -74,48 +74,49 @@ nav:
- Presidio Structured Python API: api/structured_python.md
- REST API reference: https://microsoft.github.io/presidio/api-docs/api-docs.html" target="_blank
- Samples:
- Usage:
- Home: samples/index.md
- Text:
- Presidio Basic Usage Notebook: samples/python/presidio_notebook.ipynb
- Customizing Presidio Analyzer: samples/python/customizing_presidio_analyzer.ipynb
- Configuring The NLP engine: samples/python/ner_model_configuration.ipynb
- Encrypting and Decrypting identified entities: samples/python/encrypt_decrypt.ipynb
- Getting the identified entity value using a custom Operator: samples/python/getting_entity_values.ipynb
- Anonymizing known values: samples/python/Anonymizing known values.ipynb
- Keeping some entities from being anonymized: samples/python/keep_entities.ipynb
- Integrating with external services: samples/python/integrating_with_external_services.ipynb
- Remote Recognizer: https://github.com/microsoft/presidio/blob/main/docs/samples/python/example_remote_recognizer.py
- Azure AI Language as a Remote Recognizer: samples/python/text_analytics/index.md
- Using Flair as an external PII model: https://github.com/microsoft/presidio/blob/main/docs/samples/python/flair_recognizer.py
- Using Span Marker as an external PII model: https://github.com/microsoft/presidio/blob/main/docs/samples/python/span_marker_recognizer.py
- Using Transformers as an external PII model: samples/python/transformers_recognizer/index.md
- Pseudonymization (replace PII values using mappings): samples/python/pseudonymization.ipynb
- Passing a lambda as a Presidio anonymizer using Faker: https://github.com/microsoft/presidio/blob/main/docs/samples/python/example_custom_lambda_anonymizer.py
- Synthetic data generation with OpenAI: samples/python/synth_data_with_openai.ipynb
- YAML based no-code configuration: samples/python/no_code_config.ipynb
- Data:
- Analyzing structured / semi-structured data in batch: samples/python/batch_processing.ipynb
- Presidio Structured Basic Usage Notebook: samples/python/example_structured.ipynb
- Analyze and Anonymize CSV file: https://github.com/microsoft/presidio/blob/main/docs/samples/python/process_csv_file.py
- Images:
- Redacting Text PII from DICOM images: samples/python/example_dicom_image_redactor.ipynb
- Using an allow list with image redaction: samples/python/image_redaction_allow_list_approach.ipynb
- Plot custom bounding boxes: samples/python/plot_custom_bboxes.ipynb
- Example DICOM redaction evaluation: samples/python/example_dicom_redactor_evaluation.ipynb
- PDF:
- Annotating PII in a PDF: samples/python/example_pdf_annotation.ipynb
- Deployment:
- Presidio with App Service: samples/deployments/app-service/index.md
- Presidio with Kubernetes: samples/deployments/k8s/index.md
- Presidio with Spark: samples/deployments/spark/index.md
- Azure Data Factory:
- ETL using AppService/Databricks: samples/deployments/data-factory/presidio-data-factory.md
- Add Presidio as an HTTP service to your Azure Data Factory: samples/deployments/data-factory/presidio-data-factory-template-gallery-http.md
- Add Presidio on Databricks to your Azure Data Factory: samples/deployments/data-factory/presidio-data-factory-template-gallery-databricks.md
- PII Masking LLM calls using LiteLLM proxy: samples/docker/litellm.md
- Demo:
- Create a simple demo app using Streamlit: samples/python/streamlit/index.md

- Home: samples/index.md
- Text:
- Presidio Basic Usage Notebook: samples/python/presidio_notebook.ipynb
- Customizing Presidio Analyzer: samples/python/customizing_presidio_analyzer.ipynb
- Configuring The NLP engine: samples/python/ner_model_configuration.ipynb
- Encrypting and Decrypting identified entities: samples/python/encrypt_decrypt.ipynb
- Getting the identified entity value using a custom Operator: samples/python/getting_entity_values.ipynb
- Anonymizing known values: samples/python/Anonymizing known values.ipynb
- Keeping some entities from being anonymized: samples/python/keep_entities.ipynb
- Integrating with external services: samples/python/integrating_with_external_services.ipynb
- Remote Recognizer: https://github.com/microsoft/presidio/blob/main/docs/samples/python/example_remote_recognizer.py
- Azure AI Language as a Remote Recognizer: samples/python/text_analytics/index.md
- Using Flair as an external PII model: https://github.com/microsoft/presidio/blob/main/docs/samples/python/flair_recognizer.py
- Using Span Marker as an external PII model: https://github.com/microsoft/presidio/blob/main/docs/samples/python/span_marker_recognizer.py
- Using Transformers as an external PII model: samples/python/transformers_recognizer/index.md
- Using GLiNER as an external PII model: samples/python/gliner.md
- Pseudonymization (replace PII values using mappings): samples/python/pseudonymization.ipynb
- Passing a lambda as a Presidio anonymizer using Faker: https://github.com/microsoft/presidio/blob/main/docs/samples/python/example_custom_lambda_anonymizer.py
- Synthetic data generation with OpenAI: samples/python/synth_data_with_openai.ipynb
- YAML based no-code configuration: samples/python/no_code_config.ipynb
- Data:
- Analyzing structured / semi-structured data in batch: samples/python/batch_processing.ipynb
- Presidio Structured Basic Usage Notebook: samples/python/example_structured.ipynb
- Analyze and Anonymize CSV file: https://github.com/microsoft/presidio/blob/main/docs/samples/python/process_csv_file.py
- Images:
- Redacting Text PII from DICOM images: samples/python/example_dicom_image_redactor.ipynb
- Using an allow list with image redaction: samples/python/image_redaction_allow_list_approach.ipynb
- Plot custom bounding boxes: samples/python/plot_custom_bboxes.ipynb
- Example DICOM redaction evaluation: samples/python/example_dicom_redactor_evaluation.ipynb
- PDF:
- Annotating PII in a PDF: samples/python/example_pdf_annotation.ipynb
- Deployment:
- Presidio with App Service: samples/deployments/app-service/index.md
- Presidio with Kubernetes: samples/deployments/k8s/index.md
- Presidio with Spark: samples/deployments/spark/index.md
- Azure Data Factory:
- ETL using AppService/Databricks: samples/deployments/data-factory/presidio-data-factory.md
- Add Presidio as an HTTP service to your Azure Data Factory: samples/deployments/data-factory/presidio-data-factory-template-gallery-http.md
- Add Presidio on Databricks to your Azure Data Factory: samples/deployments/data-factory/presidio-data-factory-template-gallery-databricks.md
- PII Masking LLM calls using LiteLLM proxy: samples/docker/litellm.md
- Demo app:
- Create a simple demo app using Streamlit: samples/python/streamlit/index.md
not_in_nav : |
design.md
samples/deployments/index.md
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@
from .es_nie_recognizer import EsNieRecognizer
from .es_nif_recognizer import EsNifRecognizer
from .fi_personal_identity_code_recognizer import FiPersonalIdentityCodeRecognizer
from .gliner_recognizer import GLiNERRecognizer
from .iban_recognizer import IbanRecognizer
from .in_aadhaar_recognizer import InAadhaarRecognizer
from .in_pan_recognizer import InPanRecognizer
Expand Down Expand Up @@ -96,6 +97,7 @@
"ItIdentityCardRecognizer",
"ItPassportRecognizer",
"InPanRecognizer",
"GLiNERRecognizer",
"PlPeselRecognizer",
"AzureAILanguageRecognizer",
"InAadhaarRecognizer",
Expand Down
Loading
Loading