Support language which need tokenizer (Chinese, Japanese .etc) #123

eromoe · 2017-02-06T04:37:21Z

I think iepy need a common interface to embed a tokenizer to support language like Chinese, Japanese .etc.

There is a old ie project with gui named GATE, it contain a pre-trained model and dataset, maybe helpful
https://gate.ac.uk/sale/tao/splitch15.html#sec:misc-creole:language-plugins:chinese

The text was updated successfully, but these errors were encountered:

francolq · 2017-02-07T22:17:03Z

Hello. The preprocessing pipeline can be customized to introduce a different tokenizer. See for instance:

https://github.com/awolfmann/PLN-2015/blob/practico4/information_extraction/resoluciones-unc/bin/preprocess.py

eromoe · 2017-02-08T02:29:00Z

Hello @francolq ,
I have seen how to customise in docs:

    pipeline = PreProcessPipeline([
        CustomTokenizer(),
        CustomSentencer(),
        CustomLemmatizer(),
        CustomPOSTagger(),
        CustomNER(),
        CustomSegmenter(),
    ], docs)
    pipeline.process_everything()

Then I look into the code , preprocess.tokenizer.TokenizeSentencerRunner seems not be used in anywhere. And I found:

one pipeline may have multiple runner
one runner may have step or not

As I see, there is not just as simple as adding a tokenizer since some runners are relative.It is a little hard to customise without knowing the input and output of each runner and step format and the runner api design principle (Currently I have to view the code and tried to understand what it does, but due to knowledge and language limitation, I may stuck at some place). I would like to help to make iepy compatible with CJK language if anyone could provide the api principle to write the runners. @machinalis @jmansilla

jmansilla · 2017-03-20T15:51:50Z

Sorry the delay respect this talk. Can I still help here @eromoe ?

YanWenqiang · 2017-09-25T08:53:34Z

@eromoe Right now, I want iepy to customize to Chinese, could you give me a hand ?

eromoe · 2017-09-25T09:08:07Z

@YanWenqiang Sorry, I was just need the annotator and object binding of iepy, since it was not easy to integrate Chinese , I have already made my own now.

YanWenqiang · 2017-09-25T09:16:40Z

@eromoe All right. Thanks a lot. Now I was also met with this trouble, I really need someone could help me.

hwaking · 2017-12-09T15:15:25Z

@eromoe I am doing Chinese EMR information extraction ， can i use iepy to do entity relationship extraction ？

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support language which need tokenizer (Chinese, Japanese .etc) #123

Support language which need tokenizer (Chinese, Japanese .etc) #123

eromoe commented Feb 6, 2017 •

edited

Loading

francolq commented Feb 7, 2017

eromoe commented Feb 8, 2017 •

edited

Loading

jmansilla commented Mar 20, 2017

YanWenqiang commented Sep 25, 2017

eromoe commented Sep 25, 2017 •

edited

Loading

YanWenqiang commented Sep 25, 2017

hwaking commented Dec 9, 2017

Support language which need tokenizer (Chinese, Japanese .etc) #123

Support language which need tokenizer (Chinese, Japanese .etc) #123

Comments

eromoe commented Feb 6, 2017 • edited Loading

francolq commented Feb 7, 2017

eromoe commented Feb 8, 2017 • edited Loading

jmansilla commented Mar 20, 2017

YanWenqiang commented Sep 25, 2017

eromoe commented Sep 25, 2017 • edited Loading

YanWenqiang commented Sep 25, 2017

hwaking commented Dec 9, 2017

eromoe commented Feb 6, 2017 •

edited

Loading

eromoe commented Feb 8, 2017 •

edited

Loading

eromoe commented Sep 25, 2017 •

edited

Loading