Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

German Stemmer - improvements when possible #200

Open
rfool opened this issue Sep 29, 2024 · 5 comments
Open

German Stemmer - improvements when possible #200

rfool opened this issue Sep 29, 2024 · 5 comments

Comments

@rfool
Copy link

rfool commented Sep 29, 2024

The folliwing words should (ideally, if possible) produce the same „stemmed word“:

schließen -> schliess
schließt -> schliesst
schließend -> schliessend

holen -> hol
holt -> holt

vorbereiten -> vorbereit
vorbereitet -> vorbereitet
vorbereitend -> vorbereit

schenken -> schenk
schenkt -> schenkt
schenkte -> schenkt
schenkten -> schenkt
schenkend -> schenkend
schenkender -> schenkend

@rfool
Copy link
Author

rfool commented Sep 29, 2024

Related to #161, but this is about different rules

@ojwb
Copy link
Member

ojwb commented Oct 2, 2024

Removing -t and -d was also raised in #139.

Most of the cases here seem to boil down to not removing -t and -d (if we could remove -d at the right point, then e.g. schenkender -> schenkend -> schenken -> schenk) but there's also removing -et from vorbereitet (I think that's actually a special case of a -t suffix being -et when the verb stem ends in a t, but for the algorithm it's perhaps best handled as something like -tet -> -t).

I suspect it wasn't added to the algorithm originally because it's hard to avoid removing them in cases where it's harmful (because some words end in t or d but it's not one of these suffixes). Sometimes a condition on what comes before can avoid this (possibly deliberately not handling the suffix in every case to get most of the benefit without the downsides).

It'd be good to investigate - if you have any useful insights how to distinguish -t and -d suffixes from words that just happen to end in t or d please share.

If it really isn't practical to remove these then it'd be good to document that. The older algorithm descriptions tend to just describe the mechanics of the algorithm without giving much (if any) background as to why choices were made.

@rfool
Copy link
Author

rfool commented Oct 6, 2024

Oh well, its hard.

My best idea was to consider character pairs like kt lt ßt instead of single t or d.

But that doesnt lead to safe general rules either. German language probably needs some more reforms, before stemming like in english could work. Probably never.

@ojwb
Copy link
Member

ojwb commented Oct 6, 2024

My best idea was to consider character pairs like kt lt ßt instead of single t or d.

But that doesnt lead to safe general rules either.

Restricting removal based on what's before the suffix is quite a common solution (and removing a suffix in a subset of cases can still be worthwhile). I'll take a deeper look, and document if it doesn't seem solvable.

German language probably needs some more reforms, before stemming like in english could work. Probably never.

I wouldn't say English is particularly easy to stem - it doesn't have as many inflected forms as some other languages, but it has a large vocabulary much of which has been taken from multiple other languages, so there's rather a lot of irregularity to deal with. There are definitely endings in English we don't try to deal with either (e.g. see #172).

A stemmer can be useful without handling every possible word perfectly though, and overstemming tends to be more problematic because it can result in a search term matching an unrelated word.

@rfool
Copy link
Author

rfool commented Oct 8, 2024

I wouldn't say English is particularly easy to stem ... There are definitely endings in English we don't try to deal with either (e.g. see #172).

Oh my gosh, you are right.

In my naiive mind, english was still the english in which 20 years ago or so, I could pluralize and singularize words with just a bunch of simple rules (and yes, categories became categorie at that time for me 🙄 ). Hey, but it did the job, especially in software development - for skeletons, code generation, mapping, ORM, data modeling, all that stuff - simple english is just perfect for such usage.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants