-
Notifications
You must be signed in to change notification settings - Fork 173
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Norwegian stemming of words that end with "ers" #175
Comments
When you say "a number of words", roughly how many cases are there? The rule to remove For the others, adding suffixes to reduce them suitably seems to work:
That fixes all cases for The alternative approach would be to somehow prevent I think I've spotted a case that The words ending
|
Oh, this is probably a genitive -s suffix, so would translate to English as "employer's", so it ideally would stem to the same thing as I'm struggling to see a rule (or set of rules) for handling words ending I had a look for a more comprehensive Norwegian word list but haven't found one yet. If you know of a source that'd be helpful. Otherwise maybe I should try generating one from wikipedia data (we have a script to automate that). |
Thank you for your feedback! The case of "revers" is actually harder, as it can both mean "reverse of a coin or reverse gear", as you say, but also is a plural possessive form of "rev" (fox). The English translation is "foxes'", as in "Several foxes' fur were matted". "Andelhavers" and "arbeidgivers" should probably not be changed, since they are both singular possessive forms of nouns that end with "er" in singular form. "tryllevers", which is indeed a magic verse, has the same problem. I have compiled a list of the words I think are relevant, according to naob.no (Norwegian online dictionary). It's not a lot, in other words, but some of these words can be combined with others, as Norwegian has the concept of "combined words" (sammensatte ord), which gives us for instance "tryllevers", "sangvers" (song verse), "salmevers" (psalm verse), "barnevers" (child verse or child poem), "bibelvers" (bible verse), "bordvers" (saying grace before eating food), "matvers" (the same as previous) and probably others, too. Here is the list over nouns:
There are more words, but all of these are twins of other words. For instance "kammers" which can be a singular form of "small room", but also a plural possessive form of "kam" (comb). I don't think there can be a common rule for all of these words, at least not one that I can think of. That's why I asked if adding exceptions to the stemmer might be the cleanest solution here. |
OK, so this one is a genuinely ambiguous case. Probably the first meaning is going to occur more commonly than the second, but neither is a particularly common word and which is more likely will depend somewhat on the nature of the data. Interpreting it as the second as we currently do isn't unreasonable. Thanks for the list - that's really helpful. I'll study it and see if I can come up with a plan. |
These seem to fall into two sets. One is the short words where we don't remove
( This fixes It changes two words in the existing The other case is where we're removing |
That looks like a good change! |
In Norwegian, we have a number of words where the noun in its singular, indefinite form end with "ers", for instance "kontrovers" ([a] controversy), "univers" ([a] universe) and "ters" ([a] third - in musical terminology). Other forms for these words are "kontroversen/kontroverser/kontroversene/kontroversens", "universet/universer/universene/universets" and "tersen/terser/tersene/tersens". Right now, these are not being stemmed correctly. The first turn into "kontrov", "univ" and "ter", respectively, while the other forms turn into "kontrovers", "univers" and "ters". This is not correct.
Would the best way to solve this be to add exceptions to the stemmer?
The text was updated successfully, but these errors were encountered: