Making JWordSplitter fit for Dutch #22

ghost · 2018-04-17T07:25:33Z

There are some rules in Dutch that currently make JWordSplitter less fit for Dutch. Most difficult is filtering the detected compounds.
autoonderdeel is not acceptable, even though auto and onderdeel are both valid parts; when a vowel that consists of two letters is split, this is unaccepetable. (e.g a-a a-e a-i a-u ij Aa and more) A regexp-like filter could prevent this, and also prevent other boundary mistakes.

Second issue is that joining parts can be s - and s-, but not every word does allow all of those. This could be solved by adding the part with their ~s , ~s- m ~- , but not all of those are allowed at the end of the compound. Some are allowed at the start, some in the middle, some at the end, some everywhere. F

Even then, checking could be more strict by having flags (postags?) for the parts, and filtering of valid orders of tags could be applied. But since there are exceptions, one could also add an exception list.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Making JWordSplitter fit for Dutch #22

Making JWordSplitter fit for Dutch #22

ghost commented Apr 17, 2018

Making JWordSplitter fit for Dutch #22

Making JWordSplitter fit for Dutch #22

Comments

ghost commented Apr 17, 2018