Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

German "ein" ("one") used as a numeral #1061

Open
verenablaschke opened this issue Oct 26, 2024 · 4 comments
Open

German "ein" ("one") used as a numeral #1061

verenablaschke opened this issue Oct 26, 2024 · 4 comments
Labels

Comments

@verenablaschke
Copy link
Member

In German, the numeral "one" can have the same form as the indefinite article (incl. Being inflected). The German UD guidelines say about this:

The word ein can be either translated as the indefinite article “a” or as the numeral “one”. It is always tagged DET and not NUM, i.e., we do not attempt to distinguish contexts in which the emphasis is on quantity and not on indefiniteness. (The quantity is present in any case, as the indefinite article is never used in plural.) [x]

This causes several inconsistencies and a validator complaint:

  1. The new leaf-det-clf validation rule (New validator rule: leaf-det-clf (and det vs. nmod) #1059) complains about structures where “ein” is modified. For instance, the German HDT treebank contains sentences like “Dieses Vergehen könne mit bis zu einem Jahr Haft oder einer Geldstrafe geahndet werden.” (“This offence can be punished with up to one year in prison or a fine.”) where “bis zu”/“up to” modifies “einem”/“one”, and “einem” modifies “Jahr”/“year”. Treating “einem” purely as a determiner leads to a determiner being the head of a dependent.
    (HDT also contains extremely similar structures that are clearly marked as numerals, e.g. “Ihm droht nun eine Gefängnisstrafe von bis zu fünf Jahren [...]” “He is now facing a prison sentence of up to five years” -- annotated with the same tree structure, but “fünf”/“five” is a NUM/nummod.)
biszueinemjahrhaft
  1. We also find sentences where “ein” is directly contrasted with other numbers, e.g. “Beide Auftritte bleiben laut Koch noch ein bis zwei Wochen im Netz .” (“According to Koch, both performances will remain online for another one to two weeks.”), which are currently treated rather unintuitively. “ein” is the determiner of “Wochen”/“weeks”, “bis”/”to” is treated as a modifier of “Wochen” and “zwei”/“two” is analyzed as a numeral that exists independently of any “one to two” structure. It would be more intuitive to treat “ein” as numeral and “ein bis zwei” as a phrase.
einbiszweiwochen
  1. It’s even possible to think of sentences where a DET vs NUM analysis makes a difference in meaning: “Es dauert nicht nur eine_NUM Minute (sondern zwei Minuten) / Es dauert nicht nur eine_DET Minute (sondern eine Stunde).” (“It doesn’t take only one minute (but two minutes). / It doesn’t take only a minute (but an hour).”)

  2. As a side note, both Dutch treebanks have plenty of entries where “een” is tagged as NUM, and all three Swedish treebanks have instances of “en” or “ett” as NUM.

Can we relax the strong requirement of “ein(e)” needing to be a determiner in German UD analyses?

@nschneid
Copy link
Contributor

It seems to me that there will be some cases where one tag or the other is more intuitive, but there may be a lot of gray area in between. Do other German treebanks make a distinction, and if so, what tests do they give?

(I don't know if an analogy to English one is helpful because it cannot be an indefinite article, but there are 3 different tags that can apply.)

@LeonieWeissweiler
Copy link
Contributor

GSD has one occurrence of "ein" tagged as NUM (in an unamibigous context as described above) but also several validation errors because of numeral "ein" tagged as DET. The other two have no "ein" as NUM.

@verenablaschke
Copy link
Member Author

verenablaschke commented Oct 26, 2024

The other German treebanks follow the language-specific guidelines as well, with the one exception Leonie pointed out: GSD sentence train-s4486 "Die Behaarung besteht aus ein - oder vielzelligen und nichtdrüsigen oder aber mit einem ein - oder mehrzelligen Drüsenkopf versehenen Trichomen." ("The coat of hair consists of uni- or multicellular and non-glandular trichomes or trichomes with a uni- or multicellular glandular head."). Curiously enough, the first "ein" is treated as a NUM and the second one is treated as a DET although the context looks basically identical (I don't think there is a difference between "mehrzellig" and "vielzellig" (both: "multicellular", literally "multiple/several-celled" and "many-celled"), but I can't say for sure).

@amir-zeldes
Copy link
Contributor

+1 for distinguishing NUM from DET in unambiguous environments, if it's possible to implement... I guess when it's modified like that it's a clear indication.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants