One way of understanding and speeding up the consumption of unstructured text by end-users is to automatically extract so-called “key concepts” and order them in a meaningful way. Performing analytics over such extractions becomes a key task to facilitate knowledge acquisition and curation.
You will operate on an in-house collection of patent documents. It is your task to parse these patents and to semantically enrich them using key concept/phrase extraction so that users can later quickly review documents by looking only at them instead of reading the text. To support that, you should store the key phrases and document identifiers in a data structure or database that supports fast retrieval for analytics purposes. As an additional use-case, please generate a meaningful ordering of the top-30 key phrases also for every document that could help to understand the main theme.
The XML documents need to be parsed and requires the following steps:
- Download the unstructured XML file
- Unzip (patents file) in directory Phrase-Extractor-using-KeyBERT/data.
cd Phrase-Extractor-using-KeyBERT/src
pip install bs4 absl-py
python parser.py
Note: Download and parsing should be done before building the docker image (~1 hour depends on Sys config)
- First, clone repository and then run the following commands
cd Phrase-Extractor-using-KeyBERT
docker build -f Dockerfile -t docker_key_extractor .
- Once the docker image is built successfully and python library installations are successful.
docker run -ti docker_key_extractor
- Activate the virtual environment
source /venv/bin/activate
- If parsing is already done or Phrase-Extractor-using-KeyBERT/data/raw is available, run the following
cd KPE/src
python3 keyBERT.py
optionally as provided in flags description in keyBERT.py
python3 keyBERT.py --model_name [model name]
Keyword Extraction: process of extracting most relevant words/phrases from an input text. Exisiting approaches like YAKE and Rake work on statistical approaches which fail to capture the semantic structure in natural languages. BERT - a bi-directional transformer model converts text into embedding vectors such that they can capture the context of a document. A detailed tutorial on how BERT embeddings are used for keyword extraction models can be found here
In this solution the core idea can be split into the following:
- Candidate Keywords/Keyphrases --> controlled by n_gram_range
- BERT Embeddings --> good performance for both similarity- and paraphrasing tasks. Available options SpaCy, Hugginface transformers, Flair, sentence-transformers.
- Cosine Similarity --> compare the document and candidate embeddings
- Diversification --> if keywords needs to be diversified (two options available Max Sum Similarity/Maximal Margin Relevance)