Skip to content

Commit

Permalink
add rule json string interface
Browse files Browse the repository at this point in the history
  • Loading branch information
dotSlashLu committed Feb 3, 2015
1 parent 5fc6d75 commit e07f8ad
Show file tree
Hide file tree
Showing 8 changed files with 346 additions and 95 deletions.
200 changes: 200 additions & 0 deletions README.html
Original file line number Diff line number Diff line change
@@ -0,0 +1,200 @@
<h1 id="nodescws">nodescws</h1>



<h3 id="scws">scws</h3>



<h5 id="about">About</h5>

<p>scws即Simple Chinese Word Segmentation。是C语言开发的基于词频词典的机械式中文分词引擎。scws的作者为<a href="http://www.hightman.cn">hightman</a>,采用BSD许可协议发布。nodescws的作者在libscws上添加功能并添加了node.js binding,除自己的代码外,不持有任何libscws的著作权。</p>

<p>scws(Simple Chinese Word Segmentation) is a mechanistic Chinese word segement engine written in C. The author of this library is hightman. scws is published under BSD license. As the author of nodescws, I just added some features to the lib and wrap this great library as a node addon, thus holding no copyright of any of the library’s code but my own work.</p>

<p>scws的主页: <a href="http://www.xunsearch.com/scws/">http://www.xunsearch.com/scws</a>, GitHub: <a href="https://github.com/hightman/scws">https://github.com/hightman/scws</a></p>

<h5 id="performance">Performance</h5>

<p>在 FreeBSD 6.2 系统,单核单 CPU 至强 3.0G 的服务器上,测试长度为 80,535 的文本。 用附带的命令行工具耗时将约 0.17 秒。 <br>
分词精度 95.60%,召回率 90.51% (F-1: 0.93)</p>

<p>On a server with a single core Xeon CPU and 3.0G memory running FreeBSD 6.2, Segmenting a 80,535 text using the cli tool based on this library took 0.17 seconds, with the accuracy of 95.60% and recall of 90.51%(F-1 0.93).</p>

<hr>

<h2 id="nodescws-1">nodescws</h2>

<p>Current release: v0.2.1 (versions lower than v0.2.0 are no longer maintained. See Changelog)</p>

<ul>
<li>项目主页: <a href="https://github.com/dotSlashLu/nodescws">https://github.com/dotSlashLu/nodescws</a></li>
<li>使用问题,bug report: <a href="https://github.com/dotSlashLu/nodescws/issues">https://github.com/dotSlashLu/nodescws/issues</a></li>
</ul>



<h3 id="install">Install</h3>

<p><code>npm install scws</code></p>



<h3 id="usage">Usage</h3>

<pre><code>var Scws = require("scws");
var scws = new Scws.init(settings);
var results = scws.segment(text);
scws.destroy(); // DO NOT forget this or your memory may be corrupted
</code></pre>



<h4 id="new-scwsinitsettings">new Scws.init(settings)</h4>

<ul>
<li><p>settings: <code>Object</code>, 分词设置, 支持charset, dicts, rule, ignorePunct, multi, debug</p>

<ul><li><p>charset: <code>String</code>, <em>Optional</em></p>

<pre><code>采用的encoding,支持"utf8","gbk", 默认值"utf8"
</code></pre></li>
<li><p>dicts: <code>String</code>, <em>Required</em></p>

<pre><code>要采用的词典文件的filename,多个文件之间用':'分隔。
支持xdb格式以及txt格式,自制词典请以".txt"作文件后缀。
例如"./dicts/dict.utf8.xdb:./dicts/dict_cht.utf8.xdb:./dicts/dict.test.txt"
scws自带的xdb格式词典附在该extension目录下(一般是node_modules/scws/)的./dicts/ ,
有简体和繁体两种选择,如果该项缺失则默认使用自带utf8简体中文词典
</code></pre></li>
<li><p>rule: <code>String</code>, <em>Optional</em></p>

<pre><code>要采用的规则文件,设置对应编码下的地名,人名,停用词等。
详见该extension目录下(一般是node_modules/scws/)的rules/rules.utf8.ini。
若该配置缺失则默认使用自带utf8的规则文件。

v0.2.3添加了JSON支持,避免繁复的ini语法。
若以.json结尾,则会解析对应的JSON rule文件,也可以直接传JSON string来进行配置。
</code></pre></li>
<li><p>ignorePunct: <code>Bool</code>, <em>Optional</em></p>

<pre><code>是否忽略标点
</code></pre></li>
<li><p>multi: <code>String</code>, <em>Optional</em></p>

<pre><code>是否进行长词复合切分,例如中国人这个词产生“中国人”,“中国”,“人”多个结果,可选值"short", "duality", "zmain", "zall":
short: 短词
duality: 组合相邻的两个单字
zmain: 重要单字
zall: 全部单字
</code></pre></li>
<li><p>debug: <code>Bool</code>, <em>Optional</em></p>

<pre><code>是否以debug模式运行,若为true则输出scws的log, warning, error到stdout, defult为false
</code></pre></li>
<li><p>applyStopWord: <code>Bool</code>, <em>Optional</em></p>

<pre><code>是否应用rule文件中[nostats]区块所规定的停用词,默认为true
</code></pre></li></ul></li>
</ul>

<h4 id="scwssegmenttext">scws.segment(text)</h4>

<ul>
<li>text: <code>String</code>, 要切分的字符串</li>
</ul>

<p>Return <code>Array</code></p>

<pre><code>[
{
word: '可读性',
offset: 183, // 该词在文档中的位置
length: 9, // byte
attr: 'n', // 词性,采用《现代汉语语料库加工规范——词语切分与词性标注》标准,涵义请参考 http://blog.csdn.net/dbigbear/article/details/1488087
idf: 7.800000190734863
},
...
]
</code></pre>



<h3 id="example-用例">Example 用例</h3>

<pre><code>var fs = require("fs")
Scws = require("scws");

fs.readFile("./test_doc.txt", {
encoding: "utf8"
}, function(err, data){
if (err)
return console.error(err);

// initialize scws with config entries
var scws = new Scws.init({
charset: "utf8",
//dicts: "./dicts/dict.utf8.xdb:./dicts/dict_cht.utf8.xdb:./dicts/dict.test.txt",
dicts: "./dicts/dict.utf8.xdb",
rule: "./rules/rules.utf8.ini",
ignorePunct: true,
multi: "duality",
debug: true
});

// segment text
res = scws.segment(data);
res1 = scws.segment("大家好我来自德国,我是德国人");

console.log(res);
console.log("test reuse of scws: ", res1);

// destroy scws, recollect memory
scws.destroy();
})
</code></pre>



<h3 id="changelog">Changelog</h3>



<h4 id="v023">v0.2.3</h4>

<ul>
<li>Changed project structure</li>
<li>Refactored node bindings</li>
<li>Added rule setting by JSON file and JSON string thus making adding stop words more easier with node</li>
</ul>

<h4 id="v022">v0.2.2</h4>

<ul>
<li>Some small bug fixes, including issue #5(Thanks to @Frully)</li>
</ul>



<h4 id="v021">v0.2.1</h4>

<ul>
<li>Add stop words support</li>
<li>Remove line endings when <code>ignorePunct</code> is set true</li>
</ul>

<p>You can add your own stop words in the entry <code>[nostats]</code> in the rule file. Turn off stop words feature by setting <code>applyStopWord</code> false.</p>



<h4 id="v020">v0.2.0</h4>

<p>New syntax to initialize scws: <code>scws = new Scws(config); result = scws.segment(text); scws.destroy()</code> so that we are able to reuse scws instance, thus gaining great improvement in perfermence when recurrently used(approximately 1/4 faster).</p>

<p>Added new setting entry <code>debug</code>. Setting <code>config.debug = true</code> will make scws output it’s log, error, warning to stdout</p>



<h4 id="v013">v0.1.3</h4>

<p>Published to npm registry. usage: <code>scws(text, settings);</code> available setting entries: charset, dicts, rule, ignorePunct, multi.</p>
58 changes: 34 additions & 24 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,15 +2,15 @@

### scws

##### 关于 About
##### About
scws即Simple Chinese Word Segmentation。是C语言开发的基于词频词典的机械式中文分词引擎。scws的作者为[hightman][1],采用BSD许可协议发布。nodescws的作者在libscws上添加功能并添加了node.js binding,除自己的代码外,不持有任何libscws的著作权。

scws(Simple Chinese Word Segmentation) is a mechanistic Chinese word segement engine written in C. The author of this library is hightman. scws is published under BSD license. As the author of nodescws, I just added some features to the lib and wrap this great library as a node addon, thus holding no copyright of any of the library's code but my own work.

scws的主页: [http://www.xunsearch.com/scws][2], GitHub: [https://github.com/hightman/scws][3]


##### 性能指标 Performance
##### Performance

在 FreeBSD 6.2 系统,单核单 CPU 至强 3.0G 的服务器上,测试长度为 80,535 的文本。 用附带的命令行工具耗时将约 0.17 秒。
分词精度 95.60%,召回率 90.51% (F-1: 0.93)
Expand Down Expand Up @@ -42,47 +42,51 @@ Current release: v0.2.1 (versions lower than v0.2.0 are no longer maintained. Se
采用的encoding,支持"utf8","gbk", 默认值"utf8"

- dicts: `String`, *Required*

要采用的词典文件的filename,多个文件之间用':'分隔。
支持xdb格式以及txt格式,自制词典请以".txt"作文件后缀。
例如"./dicts/dict.utf8.xdb:./dicts/dict_cht.utf8.xdb:./dicts/dict.test.txt"
scws自带的xdb格式词典附在该extension目录下(一般是node_modules/scws/)的./dicts/ ,
有简体和繁体两种选择,如果该项缺失则默认使用自带utf8简体中文词典

- rule: `String`, *Optional*

要采用的规则文件,设置对应编码下的地名,人名,停用词等。
详见该extension目录下(一般是node_modules/scws/)的rules/rules.utf8.ini。
若该配置缺失则默认使用自带utf8的规则文件

若该配置缺失则默认使用自带utf8的规则文件。

v0.2.3添加了JSON支持,避免繁复的ini语法。
若以.json结尾,则会解析对应的JSON rule文件,也可以直接传JSON string来进行配置。


- ignorePunct: `Bool`, *Optional*

是否忽略标点

- multi: `String`, *Optional*

是否进行长词复合切分,例如中国人这个词产生“中国人”,“中国”,“人”多个结果,可选值"short", "duality", "zmain", "zall":
short: 短词
duality: 组合相邻的两个单字
zmain: 重要单字
zall: 全部单字

- debug: `Bool`, *Optional*

是否以debug模式运行,若为true则输出scws的log, warning, error到stdout, defult为false

- applyStopWord: `Bool`, *Optional*

是否应用rule文件中[nostats]区块所规定的停用词,默认为true

#### scws.segment(text)

* text: `String`, 要切分的字符串

Return `Array`

[
{
{
word: '可读性',
offset: 183, // 该词在文档中的位置
length: 9, // byte
Expand All @@ -96,13 +100,13 @@ Return `Array`

var fs = require("fs")
Scws = require("scws");

fs.readFile("./test_doc.txt", {
encoding: "utf8"
}, function(err, data){
if (err)
return console.error(err);

// initialize scws with config entries
var scws = new Scws.init({
charset: "utf8",
Expand All @@ -113,19 +117,24 @@ Return `Array`
multi: "duality",
debug: true
});

// segment text
res = scws.segment(data);
res1 = scws.segment("大家好我来自德国,我是德国人");

console.log(res);
console.log("test reuse of scws: ", res1);

// destroy scws, recollect memory
scws.destroy();
})

### Changelog
#### v0.2.3
- Changed project structure
- Refactored node bindings
- Added rule setting by JSON file and JSON string thus making adding stop words more easier with node

#### v0.2.2
- Some small bug fixes, including issue #5(Thanks to @Frully)

Expand All @@ -144,9 +153,10 @@ Added new setting entry `debug`. Setting `config.debug = true` will make scws ou
Published to npm registry. usage: `scws(text, settings);` available setting entries: charset, dicts, rule, ignorePunct, multi.



[1]: http://www.hightman.cn
[2]: http://www.xunsearch.com/scws/
[3]: https://github.com/hightman/scws
[4]: https://github.com/dotSlashLu/nodescws
[5]: https://github.com/dotSlashLu/nodescws/issues

Loading

0 comments on commit e07f8ad

Please sign in to comment.