Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Making JWordSplitter fit for Dutch #22

Open
ghost opened this issue Apr 17, 2018 · 0 comments
Open

Making JWordSplitter fit for Dutch #22

ghost opened this issue Apr 17, 2018 · 0 comments

Comments

@ghost
Copy link

ghost commented Apr 17, 2018

There are some rules in Dutch that currently make JWordSplitter less fit for Dutch. Most difficult is filtering the detected compounds.
autoonderdeel is not acceptable, even though auto and onderdeel are both valid parts; when a vowel that consists of two letters is split, this is unaccepetable. (e.g a-a a-e a-i a-u ij Aa and more) A regexp-like filter could prevent this, and also prevent other boundary mistakes.

Second issue is that joining parts can be s - and s-, but not every word does allow all of those. This could be solved by adding the part with their ~s , ~s- m ~- , but not all of those are allowed at the end of the compound. Some are allowed at the start, some in the middle, some at the end, some everywhere. F

Even then, checking could be more strict by having flags (postags?) for the parts, and filtering of valid orders of tags could be applied. But since there are exceptions, one could also add an exception list.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

0 participants