-
Notifications
You must be signed in to change notification settings - Fork 245
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New validator rule: leaf-det-clf (and det vs. nmod) #1059
Comments
The errors in Hebrew are due to things like # x- so the RTL text doesn't make this unreadable
32 x-ה x-ה DET art PronType=Art 33 det _ Gloss=the|Ref=GEN_19.8
33 x-אֲנָשִׁ֤ים x-אישׁ NOUN subs Gender=Masc|Number=Plur 38 obl _ Gloss=man|Ref=GEN_19.8
34-35 x-הָאֵל֙ x-_ _ _ _ _ _ _ _
34 x-הָ x-ה DET art PronType=Art 35 det _ Gloss=the|Ref=GEN_19.8
35 x-אֵל֙ x-אל PRON prde Number=Plur|PronType=Dem 33 det _ Gloss=these|Ref=GEN_19.8 where demonstrative pronouns have their own determiners. (I'm open to other means of annotating this.) |
@mr-martian this is also the analysis used in the modern Hebrew TBs, so I would be inclined to accept and keep it (it's also parallel to how adjectival modification works in Hebrew) |
If I were doing Hebrew from scratch, the one alternative I'd consider is treating ה as an inflectional prefix rather than a syntactic word. |
I would vote against that TBH, it's not how other languages with repeating articles do it either (e.g. Greek) and it complicates lemmatization, type counts, and a bunch of other things. |
I have one remaining error:
The offending tree has someone emphasising 'every' by saying a h-uile h-uile. Is there maybe a better way I should be doing this or could it be an exception? |
Repetition for emphasis: would The validator currently allows |
This invalidated both HDT and GSD for German, mostly because of vor allem (mainly) and unter anderem (among others). For both, the first word is an How should we handle this better? |
No, for the German case it's not "among other teachers", notice "other" is dative but "teacher" is not - it's "among others, teachers". I think the mistake is the deprel det - this is not a determiner but an oblique modifier, just like English "among others". |
What about Here two examples in one sentence from Roman tragedies in UD Latin-CIRCSE: |
For spoken data, we need three relations to be added to the validator:
|
In Latvian, we have several expressions considered as compound pronouns in Latvian traditional grammar which consist of one particle and one pronoun. For example, kaut kāds where kaut is a particle and kāds is a pronoun (this expression roughly means 'some kind of'). Currently, we annotate the particle as The particles in these expressions usually are kaut, diez, diezin, nez, nezin, and they all have very fuzzy, hard to pin down semantics so we feel uncomfortable annotating them as adverbs. We would like to annotate these expressions as Would you please consider allowing |
@dan-zeman What about relaxing the error to a warning while we figure out the contours of the rule? |
I think that this new rule is fine, even if, while correcting, I and colleagues have encountered a couple of cases which really do not look reducible to a trivial correction as all the others.
To summarise the above discussion, my two proposals are to deactivate this validation rule if:
|
We have something similar to the case in 1. in Coptic where a word is repeated for distributive meaning:
Etc. 1-2 also work fine in modern Hebrew BTW, and 3. would work in the plural. What we did in UD Coptic was interpret them as nominal modifiers without a preposition (i.e. "one one" is the same as "one by one" with the word "by" suppressed). We then used the |
This new rule invalidates an analysis in my Low Saxon dataset that I just presented last spring in my LREC-COLING paper and discussed with other UD people at the conference, even with @dan-zeman himself, if I remember correctly. It is explained in Section 5.1 here: https://aclanthology.org/2024.lrec-main.1388.pdf The gloss and translation of the sentence can be found in Section 4.3. Attaching the possessor in dative case to the possessee instead of the determiner does not represent the way this construction works because 1) the dative possessor cannot be attached to the possessee without the determiner and 2) the possessee can be dropped while the determiner cannot. E.g., in the example in my paper, "In der Gemoene iarem." (literally "in the parish hers") is a valid answer to a specification question in whose service the person stands. (A note to German speakers: Masculine and neuter nouns show that this is indeed a dative, not a genitive.) |
@ftyers @jonorthwash Is there a way to get around Pronoun det with appos in (). This is something that might show up in a text «his (John's) text is strange.» I would have: det(text, his) appos(his, John's)
Also, in Latvian we struggle with constructions similar to "such a high price that nobody could afford it" from the original post as well. |
Yes, @nschneid, I think the problem encountered in UD_Erzya-JR should be made explicit, here.
`some like him (Stepan Ivanich) had gotten older...' obl(syrelgadstʹ, ladso) This could also be dealt with as a postposition, where the noun ‹lad› `way' in the Inessive case would contribute to the same ‹obl› dependency obl(syrelgadstʹ, sonze) Departing from a ‹det› dependency, however, we could approach English(, but this is not what EWT does). His friends come from all over. In linguistics, such a sentence might be quoted with an inserted identifier for contextual clarity, e.g., His (Fred's) friends come from all over. Authors themselves [their very selves], might do the same thing with commas: Since the validator does not allow words with a ‹det› dependency to take children, one might opt to follow a Swedish lead and change all instances of genitive-case personal pronoun ‹det› to ‹nmod:poss/nmod:det›, but wouldn't that go against the established norm? Here is an example of Swedish
In Swedish, the first and second person pronouns are associated with distinct determiners that are called pronouns in UD vår, min, er, din. These words inflect according to their possessa, and therefore they might be seen as analogically the same phenomena as the Czech possessive determiners. `possessive determiners (which modify a nominal) (note that some languages use PRON for similar words): [cs] můj, tvůj, jeho, její, náš, váš, jejich' https://universaldependencies.org/ru/dep/nmod.html https://universaldependencies.org/en/dep/nmod.html So it looks like there might be a Swedish–English consensus for nmod:poss use with possessive pronouns, and genitive personal pronouns. There is disparity within the Russian corpora along side a consistent Czech. |
211 treebanks are invalidated by this new rule, and we need guidance on what to do before the freeze!!! Please provide brief and clear instructions, as aligning the treebanks with this rule requires a lot of work. |
In Classical Chinese 彼此兵 (those and these soldiers) is invalidated by this new rule. How do we solve it?
|
@nschneid, hi! the UD_Finnish-FTB has an interesting construction
ilmeisen harvoja valtiomiehiä sininen ‹blue› |
@jpiitula sent_id = j7hnk-6227 is problematic. See UniversalDependencies/docs#1059
The rule in validator script is something like that :
if I understand correctly we are allowed to use only |
Thank you @johnnymoretti but I think that |
@KoichiYasuoka For sure, I'm not going into detail about the language, I've just reported what the rule says. At the moment |
Why not |
In Latvian we have occasional subordinate clause problem as well - tās somas, ko atrada vakar 'those bags which were found yesterday', because in this situation we might as well talk about various kinds of bags, some where found yesterday, and some not. We struggle applying concept of determiners for Latvian in general, but this seems to be a determiner situation, right? |
But eng. my is not a pronoun... actually, I do not understand how my office can use The case you report
looks very similar to the latin one I discussed: you have one element referring to the But we might be up to something regarding elements adding |
I know that the page describes Russian. Does that make the description less valid? |
Thanks! But that is taken from the language-specific guidelines for Russian (note the "ru" in the URL), not from the universal guidelines. And further down the page there are examples of nmod cases that do show agreement, so it idoesn't seem to be intended as a real criterion. But the fact that people writing language-specific guidelines have drawn this conclusion clearly shows that the universal guidelines are in need of clarification. :) |
Well, it means at least that it only applies to that language. And if it is inconsistent with the universal guidelines, I would say it's problematic. However, as I already said, the basic problem seems to be that the universal guidelines are not clear enough to begin with. |
Talking about the "universal" guidelines, they are indeed not of much help here in their current state since they only mention that in some languages the |
I personally think that this might be preferable for cross-linguistic consistency, but as you correctly point out the guidelines do allow both options and there has been no amendment to the guidelines on this point. In particular, the new validator rule that is the original focus of this issue was not (as far as I know) introduced with this in mind. So the main point I take from this discussion is that we need to improve our guidelines for the nmod relation. Whether this will lead to an amendment or only a clarification is too early to say at this point. |
@jnivre, @sylvainkahane and @jasiewert, I am suggesting that the Swedish "ditt hus" 'your house' could be annotated as det(hus, ditt), but "hans hus" 'his house' nmod:poss(hus, hans). This describes a distinction between structures, such as parallel between 3Sg genitive form in and nouns in the genitive, on the one hand, and "ditt hus" possessive determiners that agree with their head words, i.e., "ta maison", "твой дом", on the other. This, of course, does not answer the English dilemma with my, thy, your, our, their, her. |
Actually "their" and "her" are also historically genitives, like "its" and "his", though some of the pronoun forms are Anglo-Saxon and others are borrowed from Scandinavian (e.g. their, which comes from the dem. stem, not the proper personal stem, Old English "hiera"). But the fact that it's basically impossible to tell which is which now shows that this probably doesn't play a role in how we should analyze the syntax for English - synchronically, none of the forms show any kind of agreement. |
I agree with @amir-zeldes. Although we must always use language-specific criteria when interpreting the guidelines for a specific languages, I don't think the presence of agreement should be the basis of the distinction between det and nmod. |
I still don't see an answer in the thread for such cases where an unambiguous determiner is explained by another phrase. How to annotate these sentences to pass the new validation rule? Some examples from Latvian: |
I am surprised that semantic criteria should override morpho-syntactic criteria, given that the dependency relations are called "syntactic relations" in the documentation, not "semantic relations". Isn't agreement a rather decisive criterion when distinguishing, e.g., |
Um, thanks but I don't think I said the last part 😅 I do think agreement is an important syntactic phenomenon to consider, and my understanding was that the individual language guidelines do differ in how they treat possessives. Whether that's a good idea or not is debatable. Some aspects of UD in practice ignore morpho-syntactic facts such as agreement, for example in treating copulas as auxiliaries, and I do see the logic of that, especially for languages with split copula systems (so the subject of a Russian nominal sentence depends on the lexical predicate, whether or not a copula is present). The same argument could be made for split possessive systems, like English was historically and the Czech one synchronically as well, but we could also make the opposite argument, as I think @jasiewert is doing. My only point was about English, where I think going for a split system (interlocutive my, your, our as det but delocutive her, his, its as nmod) is particularly uncompelling, because synchronically there is no evidence one way or the other. Between the two, I prefer nmod for English, because it means that all personal pronouns can have content-y deprels, rather than the function-y det, and it unifies pronominal and nominal possession (genitive 's) in a way that seems systematic and satisfying. In other languages, things could play out very differently, and I don't believe in giving English special importance in the discussion of universal guidelines. |
@amir-zeldes Sorry about misrepresenting your position. I still don't think agreement is decisive, but that requires a longer argument. And I completely agree that English should not be given priority. |
@jasiewert At the universal level, we always have to rely on functional criteria, because they are the only ones that are universally applicable. But please not that functional is not purely semantic, it is semantic + information packaging. When we develop guidelines for a specific language, we therefore have to start by using the functional criteria to identify the prototypical cases (such as primary transitives for core arguments), observe what morphosyntactic criteria are characteristic of those, and then extend them to other cases. I think agreeing possessives is a typical example where two such characteristics clash in many languages, the referentiality of nominal modifiers and the agreement patterns of adjectival modifiers. Which one should be given preference may ultimately depend on other factors, which is why the current guidelines allow both options. I may be biased towards referentiality myself, but I fully respect that the facts are different in different languages. But, for the record, my original comment was with respect to Chinese, where agreement clearly is not a criterion. I hope this at least clarifies my position. |
@
I may be mistaken, but in the second example, it looks like the second modifier could be an adjectival modifier that attaches to the nominal head. Is there anything that rules out such an analysis. |
Well... I think we need different criteria to validate
|
This might not be related to the main topic that has been discussed here, but I was wondering if the prepositional phrase ของเธอ (of her) could modify the noun เล่ม (used as a classifier in this structure), instead the head noun หนังสือ /nǎŋ-sɯ̌ɯ/ (book). In Thai, the head noun modified by the classifier phrase can be omitted when the noun is previously mentioned and known. Then the noun used as a classifier turns to be the head noun of the NP, as in these trees: From the full phrase:
it becomes this:
So Apart from this reason, the noun เล่ม /lêm/ is the head noun of the modifier phrase modifying the head noun หนังสือ /nǎŋ-sɯ̌ɯ/ (book). A noun used as a classifier in Thai must not be used alone with a head noun to be modified. It must be modified to be able to modify the head noun. The structure (noun classifier) is ungrammatical in Thai. I am not sure if there is difference. Thai uses postmodifiers, except numbers and quantifiers expressing quantities are placed before a noun used as a classifier. And I guess my annotation with P.S. I just realised that I English-grossed the word เธอ /thɤɤ/ (PRON) incorrectly. It should be "she", not "her". In Thai, we have only one word to express personal and possessive pronouns. |
Thank you @leky40 but the latter example เล่มนี้ของเธอ
is slightly far from this issue "leaf-det-clf" ... ah well, OK, we'll try to investigate หนังสือสองเล่มนี้ของเธอ now. How do you think about this? I think we can omit หนังสือ from หนังสือสองเล่มนี้ของเธอ, then what structure is suitable for สองเล่มนี้ของเธอ?
|
My analysis is different from the tree above. A number is quite tricky. In Thai, when a number is placed before a noun to be modified, it expresses quantities. When it is placed after a noun, it expresses sequences (order). From the structure you show above, the number "two" expresses quantities and it modifies the noun เล่ม /lêm/, which is used as a classifier for the head noun หนังสือ /nǎŋ-sɯ̌ɯ/ (book) in this entire noun phrase. I was asked how I knew that the number two modified the noun เล่ม /lêm/, which is used as a classifier. I would say its position, and when a question is made. It would be "กี่เล่ม (how many + the noun เล่ม /lêm/)", not "กี่หนังสือ (how many + book หนังสือ /nǎŋ-sɯ̌ɯ/)". A noun used as a classifier is the head noun of the modifier phrase. And if the noun /lêm/, which is used as a classifier for the book, is omitted from the structure presented above, it is ungrammatical. That's why the number cannot modify the head noun หนังสือ /nǎŋ-sɯ̌ɯ/ (book). So the trees I annotated would be:
I know that these annotations of mine might not be validated. But this is how a Thai classifier works with other modifiers. Apart from showing specificness / emphasis to a head noun, a noun used as a classifier is also used to replace the head noun which it modified. It is a noun, functioning as a classifier. |
Šāds is not an adjective in Latvian, as it does not form comparative degrees and it does not have definite/indefinite endings. It is a pronoun that tends to take place of adjective in the sentence. By us it is totaly fine to annotate such pronouns as adjectival modifiers, in fact, we would love to. But if we do so, then there is nothing left to be annotated by det role. Is it okay to forgo det role at all just because we don't have nothing like actual articles in language? (Sorry if the question is kinda dumb, we honestly struggle with understanding and applying determiner concept correctly :) ) |
I meant that "kategoriskam" was an amod, but I may have misunderstood what the problem is. Which word is it that you want to have as a dependent of "sadam"? (Please excuse the lack of diacritics.) |
Our current analysis is:
Thus, I assumed that you suggested that we should mark šādām as |
@lauma Thanks for clarifying. I did not suggest that you should mark anything as nmod here. That discussion was about another example. From your description of the example, it definitely sounds like "kategoriskām" should be attached as an amod to the head noun (that is, as a sister to the det node). That is normally the right analysis for a DET-ADJ-NOUN sequence, and I don't think the punctuation changes this fact. Similarly, I think you should consider attaching the two determiners as sisters in the first example. The principle of prioritising content words in UD has as a consequence that you often get a flat analysis where other frameworks would have a hierarchical analysis. This is true, for example, of multiple auxiliaries attaching to the same word, and it is the prescribed analysis for multiple determiners as well. I am aware that the first example is a bit special, but I think this could still be the least bad analysis (given the general principles of UD). |
I understand @jnivre's remark about the functionalist approach of UD and I see some advantages to have a common annotation of possessives. Note that we already have a feature |
As Croft and others have shown, any given language-particular construction (formalized via a feature, UPOS, or deprel) can be named based on its prototypical function, but its actual application will tend to extend beyond that prototype, so line-drawing becomes tricky in many cases (do we prioritize the general function viewed crosslingually or the language-internal tests?). E.g. there are English-specific morphosyntactic arguments that the English Determiner Relation/Slot should encompass both core determiners and prenominal possessives, but for purposes of meaning and crosslinguistic comparison, possessive dependents are quite different from core determiners. I do think UD's prioritization of content relations is a principle that resolves this, but as @sylvainkahane notes it leaves something out in terms of grouping together language-internal constructions defined by morphosyntactic distribution. Maybe we should move toward annotating the additional categorizations in a separate layer (e.g. the UCxn approach) to expose the broader category of determiner relations. Whether of-PPs ought to be grouped together with Saxon genitives under any approach to syntax seems dubious to me, though they certainly have semantic parallels. |
I notice that the leaf-det-clf rule introduced in UniversalDependencies/tools@1e4debd and then revised in UniversalDependencies/tools@759c5ae has invalidated quite a lot (a majority?) of treebanks.
Is further revision necessary? For example, EWT is still experiencing some errors that look like they should be valid:
det
+nmod
e.g. "at least some reports" (det(reports, some)
,nmod(some, least)
). "at least" is admittedly ADV-like, so another option is to make itExtPos=ADV
andadvmod
.det
licensing anadvcl
, as in these results. The guidelines on sufficiency and excess for "so" and similar say theadvcl
should attach to the adjective or adverb, not the noun in a case like sufficient flour. In such a high price that nobody could afford it, I suppose "such" should have anadvcl
dependent?The text was updated successfully, but these errors were encountered: