A probabalistic approach to solving the problem of agglutination which exists in indic languages. Implementation here applies for Malayalam, although codes used are mostly language agnostic.
- First clone the repository
git clone https://github.com/libindic/sandhi-splitter.git
- Create a installable source and then install using pip
python setup.py sdist
pip install dist/sandhisplitter*.tar.gz
Note: We suggest you work on virtualenv instead of installing
system-wide using sudo
, since module is still under development.
After installation, with necessary arguments, use
sandhisplitter_train [--help] [args]
sandhisplitter_benchmark_model [--help] [args]
For more details, refer to docs/index.rst
Sandhisplitter
class provides two main functions, split
and join
.
>>> from sandhisplitter import Sandhisplitter
>>> s = Sandhisplitter()
>>> s.split('ആദ്യമെത്തി')
(['ആദ്യം', 'എത്തി'], [4])
>>> s.split('വയ്യാതെയായി')
(['വയ്യാതെ', 'ആയി'], [7])
>>> s.split('എന്നെക്കൊണ്ടുവയ്യ')
(['എന്നെക്കൊണ്ടുവയ്യ'], [])
>>> s.split('ഇന്നത്തെക്കാലത്ത്')
(['ഇന്നത്തെക്കാലത്ത്'], [])
>>> s.split('എന്തൊക്കെയോ')
(['എന്ത്', 'ഒക്കെയോ'], [3])
>>> s.join(['ആദ്യം', 'ആയി'])
'ആദ്യമായി'