German Stemmer - improvements when possible #200

rfool · 2024-09-29T11:23:38Z

The folliwing words should (ideally, if possible) produce the same „stemmed word“:

schließen -> schliess
schließt -> schliesst
schließend -> schliessend

holen -> hol
holt -> holt

vorbereiten -> vorbereit
vorbereitet -> vorbereitet
vorbereitend -> vorbereit

schenken -> schenk
schenkt -> schenkt
schenkte -> schenkt
schenkten -> schenkt
schenkend -> schenkend
schenkender -> schenkend

rfool · 2024-09-29T11:26:13Z

Related to #161, but this is about different rules

ojwb · 2024-10-02T22:16:19Z

Removing -t and -d was also raised in #139.

Most of the cases here seem to boil down to not removing -t and -d (if we could remove -d at the right point, then e.g. schenkender -> schenkend -> schenken -> schenk) but there's also removing -et from vorbereitet (I think that's actually a special case of a -t suffix being -et when the verb stem ends in a t, but for the algorithm it's perhaps best handled as something like -tet -> -t).

I suspect it wasn't added to the algorithm originally because it's hard to avoid removing them in cases where it's harmful (because some words end in t or d but it's not one of these suffixes). Sometimes a condition on what comes before can avoid this (possibly deliberately not handling the suffix in every case to get most of the benefit without the downsides).

It'd be good to investigate - if you have any useful insights how to distinguish -t and -d suffixes from words that just happen to end in t or d please share.

If it really isn't practical to remove these then it'd be good to document that. The older algorithm descriptions tend to just describe the mechanics of the algorithm without giving much (if any) background as to why choices were made.

rfool · 2024-10-06T09:06:53Z

Oh well, its hard.

My best idea was to consider character pairs like kt lt ßt instead of single t or d.

But that doesnt lead to safe general rules either. German language probably needs some more reforms, before stemming like in english could work. Probably never.

ojwb · 2024-10-06T21:58:25Z

My best idea was to consider character pairs like kt lt ßt instead of single t or d.

But that doesnt lead to safe general rules either.

Restricting removal based on what's before the suffix is quite a common solution (and removing a suffix in a subset of cases can still be worthwhile). I'll take a deeper look, and document if it doesn't seem solvable.

German language probably needs some more reforms, before stemming like in english could work. Probably never.

I wouldn't say English is particularly easy to stem - it doesn't have as many inflected forms as some other languages, but it has a large vocabulary much of which has been taken from multiple other languages, so there's rather a lot of irregularity to deal with. There are definitely endings in English we don't try to deal with either (e.g. see #172).

A stemmer can be useful without handling every possible word perfectly though, and overstemming tends to be more problematic because it can result in a search term matching an unrelated word.

rfool · 2024-10-08T22:31:04Z

I wouldn't say English is particularly easy to stem ... There are definitely endings in English we don't try to deal with either (e.g. see #172).

Oh my gosh, you are right.

In my naiive mind, english was still the english in which 20 years ago or so, I could pluralize and singularize words with just a bunch of simple rules (and yes, categories became categorie at that time for me 🙄 ). Hey, but it did the job, especially in software development - for skeletons, code generation, mapping, ORM, data modeling, all that stuff - simple english is just perfect for such usage.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

German Stemmer - improvements when possible #200

German Stemmer - improvements when possible #200

rfool commented Sep 29, 2024

rfool commented Sep 29, 2024

ojwb commented Oct 2, 2024

rfool commented Oct 6, 2024

ojwb commented Oct 6, 2024

rfool commented Oct 8, 2024

German Stemmer - improvements when possible #200

German Stemmer - improvements when possible #200

Comments

rfool commented Sep 29, 2024

rfool commented Sep 29, 2024

ojwb commented Oct 2, 2024

rfool commented Oct 6, 2024

ojwb commented Oct 6, 2024

rfool commented Oct 8, 2024