An n-gram generator for indic languages.
An n-gram model is a type of probabilistic model for predicting the next item in a sequence. n-grams are used in various areas of statistical natural language processing and genetic sequence analysis.
An n-gram is a subsequence of n items from a given sequence. The items in question can be phonemes, syllables, letters, words or base pairs according to the application.
An n-gram of size 1 is referred to as a "unigram"; size 2 is a "bigram" (or, less commonly, a "digram"); size 3 is a "trigram"; and size 4 or more is simply called an "n-gram".
- Clone the repository
git clone https://github.com/libindic/indicngram.git
- Change to the cloned directory
cd indicngram
- Run setup.py to create installable source
python setup.py sdist
- Install using pip
pip install dist/libindic-ngram*.tar.gz
Input Parameters: Text and value of N (default value 2)
Output: List of grams
>>> from libindic.ngram import Ngram
>>> ngram_generator = Ngram()
>>> ngram_gerator(<text>, <window size>)
>>> from libindic.ngram import Ngram
>>> ngram_generator = Ngram()
>>> text = "Languages"
>>> grams = ngram_generator.letterNgram(text, 3)
>>> print(grams)
['Lan', 'ang', 'ngu', 'gua', 'uag', 'age', 'ges']
>>> for gram in grams:
... print("".join(gram))
Lan
ang
ngu
gua
uag
age
ges
Run tests with python setup.py test
Read the docs for more.