This project has been done as the part of Minor Project submission at Heritage Institute of Technology under the Mentorship of Prof. Sandipan Ganguly (HIT-K).
A library with pre-trained model for POS Tagging, Word Embedding, Name Entity Recognition, FastText, Bengali StopWords, Bengali Corpus Class recognition etc.
-
pypi package installer(python 3.6, 3.7, 3.8 tested okay)
pip install bnlp_toolkit
or Upgrade
pip install -U bnlp_toolkit
Raw Text-> Tokenization -> POS Tagging
-
We have first used Natural Language ToolKit or NLTK library to define & apply basic POS tagging on English Corpus.
-
In the next step, we took a small Bengali Corpus & tokenized each Bengali words from sentences individually using BasicTokenizer from BNLP under Rule-Based Approach. Then the same applied on two larger Bengali corpora.
-
In next step, we have used NLTKTokenizer from BNLP to tokenize Bengali small corpus into two phases. One is in Word Tokenizing & second one is in Sentence Tokenizing under Rule-based approach. Word Tokenizer tokenized Bengali Words while Sentence Tokenizer tokenized each sentences separately. Then applied the same on two larger Bengali Corpora.
-
Next we used SentencePieceTokenizer to apply Unsupervised Learning on two Bengali Corpora.
-
In the next step, we used POS function with pre-trained model from BNLP & took a small Bengali Corpus to tag Bengali words & categorize them into different Parts of Speeches under Conditional Random Field based approach.
-
In the next we have embedded Bengali Words of a corpus using BengaliWord2Vector with pre-trained model from BNLP to get the vector shape of words & their values under Deep Learning approach.
We found false positive result as well & calculated Confusion Matrices to get Precision, Recall & F1 value.
- Jupyter Notebook/Google Colab
- BNLP Library taken from: Prof. Sagor Sarker (Bangladesh) on GitHub.
- Research papers on Bengali Pos Tagging taken as references.
- Rajdeep Das (LinkedIn)
- Arghyadeep Banerjee
- Soham Chakraborty
- Tanmay Guchhait
- Debabrata Maity
- Alik Sarkar
- Sanju Manna
OR, you can click via DOI:http://dx.doi.org/10.13140/RG.2.2.35358.41287/1
Subject: Project Technical Report (Publication no. 359257508)
- https://bnlp.readthedocs.io/en/latest/
- https://github.com/sagorbrur/bnlp
- https://www.researchgate.net/publication/348957805_BNLP_Natural_language_processing_toolkit_for_Bengali_language
- https://medium.com/analytics-vidhya/bengali-pos-part-of-speech-tagging-using-indian-corpus-e85f47d3ad65
- https://nltr.itewb.gov.in/
BNLP Developer Credit: Prof. Sagor Sarker (https://github.com/sagorbrur)
Thank you for visiting.
© Rajdeep Das