Sastrawi is a simple PHP library which provides stemming of words in Indonesian Language (Bahasa). Despite its simplicity, this library is designed to be high quality and well documented. For more information in english, see README.
Development | Master | Releases | Statistics |
---|---|---|---|
Indonesia is the fourth most populous country in the world. Indonesian internet users use indonesian language as their primary language. So, developers need a tool to improve text searching in Indonesian language. One of the most important tool is a stemmer.
Stemming is the process of reducing morphological variants of a word to a common stem form. Some researches has shown that stemming is language-dependent.
Let's say we have a blog content:
Rakyat memenuhi halaman gedung untuk menyuarakan isi hatinya.
The query below will result none.
SELECT * FROM posts WHERE content LIKE '%suara%'
Even fuzzy full-text-search tool needs a stemmer to improve the result. A possible improvement would be to index the reduced version of the sentence as follow:
rakyat penuh halaman gedung suara isi hati
We would also reduce the search keyword:
Bersuara => suara
- Have a high quality PHP library to ease stemming Indonesian words.
- Integrate well with other packages / frameworks.
- Have a simple and easy to use API
Sastrawi can be installed with Composer. Add sastrawi into your composer.json
:
{
"require": {
"sastrawi/sastrawi": "*"
}
}
Then run composer install
or composer update
from command line
.
Copy the following code into a php file in your project directory. Then call it from command line.
<?php
// demo.php
// include composer autoloader
require_once __DIR__ . '/vendor/autoload.php';
// create stemmer
$stemmerFactory = new \Sastrawi\Stemmer\StemmerFactory();
$stemmer = $stemmerFactory->createStemmer();
// stem
$sentence = 'Perekonomian Indonesia sedang dalam pertumbuhan yang membanggakan';
$output = $stemmer->stem($sentence);
echo $output . "\n";
// will print:
// ekonomi indonesia sedang dalam tumbuh yang bangga
The MIT License (MIT). Please see License File for more information.
Algorithms and trademarks used in this library are the property of their respective owners.
- Algoritma Nazief dan Adriani
- Asian J. 2007. Effective Techniques for Indonesian Text Retrieval. PhD thesis School of Computer Science and Information Technology RMIT University Australia
- Arifin, A.Z., I.P.A.K. Mahendra dan H.T. Ciptaningtyas. 2009. Enhanced Confix Stripping Stemmer and Ants Algorithm for Classifying News Document in Indonesian Language, Proceeding of International Conference on Information & Communication Technology and Systems (ICTS)
- A. D. Tahitoe, D. Purwitasari. 2010. Implementasi Modifikasi Enhanced Confix Stripping Stemmer Untuk Bahasa Indonesia dengan Metode Corpus Based Stemming, Institut Teknologi Sepuluh Nopember (ITS) – Surabaya, 60111, Indonesia
Sastrawi rely heavily on a root word dictionary. It is based on kateglo.com with some modifications.
Sastrawi is released under The MIT License (MIT) while Kateglo's root word dictionary is under CC-BY-NC-SA 3.0. For more information please see Sastrawi License File and Kateglo's content license.