A Docker-powered service for named entity extraction from text or PDF files.
This repository provides a Docker-powered service for Named Entity Recognition (NER), enabling the extraction of specific entities from text or PDF files. The service enables extraction of various entities with the help of pdf-document-layout-analysis, a service that segments documents with high accuracy.
- GitHub: NER-in-docker
- HuggingFace: NER-in-docker
Clone the service:
git clone https://github.com/huridocs/NER-in-docker
cd NER-in-docker
Run the service:
-
With GPU support:
make start
-
Without GPU support:
make start_no_gpu
Get the entities from a text:
curl -X POST -d "text=Some example text" http://localhost:5070
Get the entities from a PDF:
curl -X POST -F "file=@/PATH/TO/PDF/pdf_name.pdf" localhost:5070
To stop the server:
make stop
- Docker Desktop 4.25.0 install link
- For GPU support install link
- 6 GB RAM memory
- 6 GB GPU memory (if not, it will run on CPU)
For entity extraction, the service use two different models.
One of them is GLiNER Multi v2.1. GLiNER is a named entity recognition model
capable of identifying any entity type using a bidirectional transformer encoder (BERT-like). It provides a practical alternative to
traditional NER models, which are limited to predefined entities, and Large Language Models (LLMs) that, despite their flexibility,
are costly and large for resource-constrained scenarios. In this project, we use GLiNER specifically for extracting DATE
entities.
The other model is Flair NER English Ontonotes Large. Flair is a
powerful NLP library and provides various kinds of models. This model we use in our project is one of the largest models Flair provides,
capable of detecting up to 18 entity classes. However, we use this model to extract only the following entities: PERSON
, ORGANIZATION
, LOCATION
and LAW
.
In addition to these NER models, you can check this link for details on the models that segment documents: PDF Document Layout Analysis Models
As we mentioned int the Quick Start, you can use the service like this:
curl -X POST -d "text=Some example text" http://localhost:5070
or
curl -X POST -F 'file=@/PATH/TO/PDF/pdf_name.pdf' localhost:5070
When text is sent to the service, it is passed directly to the NER models.
When a PDF is sent, the PDF is first segmented using the pdf-document-layout-analysis service. Then, for each segment, entities are extracted by the NER models.
When you send a PDF to the service, you should be prepared that the service may use lots of resources. The pdf-document-layout-analysis service segments the PDFs using a visual model. So, please note that if you do not have GPU in your system, or enough free GPU memory, this visual model will run on CPU, and it might cause longer response time.
After entity extraction, the service tries to merge entities with different or incorrect spellings into a single representative entity (e.g., "Jane Doe" and "J. Doe" will be merged into "Jane Doe").
When the process is done, the output will be a NamedEntitiesResponse
in the following format:
{
"entities": [
{
"group_name": Representative name for all the variations of the same entity (e.g: "Jane Doe")
"type": Type of the entity (e.g: "PERSON")
"text": Text of the entity (e.g: "J. Doe")
"character_start": Starting index of the entity within the context
"character_end": Ending index of the entity within the context
}
],
"groups": [
{
"group_name": Representative name for all the variations of the same entity (e.g: "Jane Doe")
"type": Type of the entity group (e.g: "PERSON")
"entities_ids": [
IDs of the entities in the group (e.g: [0, 5])
],
"entities_text": [
Texts of the entities in the group (e.g: ["Jane Doe", "J. Doe"])
]
}
]
}
To make it more clear, let's explain.
The response will return two lists, one is "entities" and the other one is "groups".
The "entities" list contains all the entities found in the given input, doesn't matter their variations. Every entity in "entities" list will have:
-
"group_name", this attribute will include the representative name of the given entity. For example, if there are two entities like "J. Doe" and "Jane Doe", these entities will be merged into the same group, and the group_name will be the most representative version of these entities, which is "Jane Doe".
-
The other attributes like "type", "text", "character_start" and "character_end" are self-explanatory.
The "groups" list will contain the groups of entities. The "group_name" and "type" attributes here are the same with the "entities" list. "entities_ids" and "entities_text" attributes will hold the IDs of each entity and the variations of all the entities in the same group.
After you are done with the service, you can stop it with this command:
make stop
For GLiNER benchmark details, you can refer to this link.
Also you can refer to this paper.
For Flair benchmark details, you can refer to this link.
Also you can refer to this paper.