New validator rule: leaf-det-clf (and det vs. nmod) #1059

nschneid · 2024-10-08T19:31:36Z

I notice that the leaf-det-clf rule introduced in UniversalDependencies/tools@1e4debd and then revised in UniversalDependencies/tools@759c5ae has invalidated quite a lot (a majority?) of treebanks.

Is further revision necessary? For example, EWT is still experiencing some errors that look like they should be valid:

det + nmod e.g. "at least some reports" (det(reports, some), nmod(some, least)). "at least" is admittedly ADV-like, so another option is to make it ExtPos=ADV and advmod.
"such"/det licensing an advcl, as in these results. The guidelines on sufficiency and excess for "so" and similar say the advcl should attach to the adjective or adverb, not the noun in a case like sufficient flour. In such a high price that nobody could afford it, I suppose "such" should have an advcl dependent?

The text was updated successfully, but these errors were encountered:

mr-martian · 2024-10-08T19:48:41Z

The errors in Hebrew are due to things like

# x- so the RTL text doesn't make this unreadable
32	x-ה	x-ה	DET	art	PronType=Art	33	det	_	Gloss=the|Ref=GEN_19.8
33	x-אֲנָשִׁ֤ים	x-אישׁ	NOUN	subs	Gender=Masc|Number=Plur	38	obl	_	Gloss=man|Ref=GEN_19.8
34-35	x-הָאֵל֙	x-_	_	_	_	_	_	_	_
34	x-הָ	x-ה	DET	art	PronType=Art	35	det	_	Gloss=the|Ref=GEN_19.8
35	x-אֵל֙	x-אל	PRON	prde	Number=Plur|PronType=Dem	33	det	_	Gloss=these|Ref=GEN_19.8

where demonstrative pronouns have their own determiners. (I'm open to other means of annotating this.)

amir-zeldes · 2024-10-08T20:04:53Z

@mr-martian this is also the analysis used in the modern Hebrew TBs, so I would be inclined to accept and keep it (it's also parallel to how adjectival modification works in Hebrew)

mr-martian · 2024-10-08T20:13:05Z

If I were doing Hebrew from scratch, the one alternative I'd consider is treating ה as an inflectional prefix rather than a syntactic word.

amir-zeldes · 2024-10-08T21:53:13Z

I would vote against that TBH, it's not how other languages with repeating articles do it either (e.g. Greek) and it complicates lemmatization, type counts, and a bunch of other things.

colinbatchelor · 2024-10-10T13:52:45Z

I have one remaining error:
[(in gd_arcosg-ud-train.conllu) Line 55940 Sent p01_033h Node 79]: [L3 Syntax leaf-det-clf] 'det' not expected to have children (79:a:det --> 81:h-uile:compound)

a h-uile 'every' is treated in the source material as a determiner, which seems reasonable.
I've also been following other treebanks like Turkish in using compound for reduplication: https://universaldependencies.org/gd/dep/compound.html

The offending tree has someone emphasising 'every' by saying a h-uile h-uile. Is there maybe a better way I should be doing this or could it be an exception?

nschneid · 2024-10-10T18:52:04Z

Repetition for emphasis: would flat be a good option instead of compound? Cf. https://universaldependencies.org/u/dep/flat.html#iconic-sequences (though I can't speak to how languages are dealing with reduplication in general).

The validator currently allows fixed, but not flat, it seems.

LeonieWeissweiler · 2024-10-10T18:57:30Z

This invalidated both HDT and GSD for German, mostly because of vor allem (mainly) and unter anderem (among others). For both, the first word is an ADP' and the second is a DET' that depends on it with the `case' relation.

How should we handle this better?

nschneid · 2024-10-10T19:03:50Z

unter anderem is sometimes treated as a fixed expression. Here is a case triggering the error:

I assume this means "among other teachers"—is there a reason not to analyze it as "among [other teachers]", with unter attaching to Lehrer?

amir-zeldes · 2024-10-10T19:25:16Z

No, for the German case it's not "among other teachers", notice "other" is dative but "teacher" is not - it's "among others, teachers". I think the mistake is the deprel det - this is not a determiner but an oblique modifier, just like English "among others".

FedeIure · 2024-10-11T08:07:44Z

Repetition for emphasis: would flat be a good option instead of compound? Cf. https://universaldependencies.org/u/dep/flat.html#iconic-sequences (though I can't speak to how languages are dealing with reduplication in general).

The validator currently allows fixed, but not flat, it seems.

What about flat:redup to mark repetition for emphasis?

Here two examples in one sentence from Roman tragedies in UD Latin-CIRCSE:

sylvainkahane · 2024-10-11T09:41:40Z

For spoken data, we need three relations to be added to the validator:

discourse, which is very common between two determiners in false starts: "a, uh, a gap", "my, uh, our friend"
parataxis for cases such as "a, I don't how to call that, a kiosk, …": here we have a reparandum link between the two "a"s and we would like to attach the parenthesis to the first "a". More exactly we use parataxis:parenth in our spoken French treebanks.
dep for false starts such as "the last, the last day": here "the last" forms a phrase the head of which is missing and we decided to have dep(the, last). I am not against another solution, as long as "the last" is still a phrase.

lrituma · 2024-10-15T10:17:56Z

In Latvian, we have several expressions considered as compound pronouns in Latvian traditional grammar which consist of one particle and one pronoun. For example, kaut kāds where kaut is a particle and kāds is a pronoun (this expression roughly means 'some kind of'). Currently, we annotate the particle as discourse which is dependent of pronoun, and pronoun occasionally becomes det if the expression describes a noun. This leads to validation error.

The particles in these expressions usually are kaut, diez, diezin, nez, nezin, and they all have very fuzzy, hard to pin down semantics so we feel uncomfortable annotating them as adverbs.

We would like to annotate these expressions as compound (instead of fixed) because the pronoun is the second element in the phrase and we feel that it is the head of the phrase because the pronoun inflects together with a noun and bears the most of semantic meaning of the expression.

Would you please consider allowing compound in this construction or is there any other option appropriate here?

nschneid · 2024-10-15T17:29:24Z

@dan-zeman What about relaxing the error to a warning while we figure out the contours of the rule?

Stormur · 2024-10-17T15:45:50Z

I think that this new rule is fine, even if, while correcting, I and colleagues have encountered a couple of cases which really do not look reducible to a trivial correction as all the others.

The already mentioned reduplication, which is treated through flat:redup in Latin treebanks. One example is quot quot from quot: while the latter means 'as many as', the reduplication has a distributive sense as in 'for each possible one...' (this expression is sometimes even univerbated). I think to annotate them separately, each depending on the head, is not the right way to deal with them: here we do not have two or more different terms, but really the same one "clonating" itself. On the other hand, flat is really the closest relation we have to fixed, which would cause no problem, but is not a correct choice (well, in my opinion it is never the correct choice)
- Problem: horizontal relation
The phrase nostra qui remansissemus caede 'the murder of us who are left (behind)', but more literally 'our who are left murder', since nostra is the inflected possessive determiner for the 1st person plural. What happens here is that the possessive adds a nominal person, as it were, and this person is another referent beyond the noun caede 'murder' in this phrase; as such, the relative can target it (or at least, Cicero pleases himself in doing so). We could not really justify an analysis where we shift the relative under the head noun, since the murder is not one of its arguments.
- Problem: the relative clause dependent of the determiner cannot be traced back to the referent of its head

To summarise the above discussion, my two proposals are to deactivate this validation rule if:

the child of det is a flat relation
the head element has the feature Person, at least for acl:relcl

amir-zeldes · 2024-10-17T15:58:34Z

We have something similar to the case in 1. in Coptic where a word is repeated for distributive meaning:

one one = "one by one"
two two = "two by two, in pairs"
color color = "color for color, every color"

Etc. 1-2 also work fine in modern Hebrew BTW, and 3. would work in the plural. What we did in UD Coptic was interpret them as nominal modifiers without a preposition (i.e. "one one" is the same as "one by one" with the word "by" suppressed). We then used the nmod:unmarked relation, which is a subtype of nmod used without a case marker.

jasiewert · 2024-10-20T11:43:22Z

This new rule invalidates an analysis in my Low Saxon dataset that I just presented last spring in my LREC-COLING paper and discussed with other UD people at the conference, even with @dan-zeman himself, if I remember correctly. It is explained in Section 5.1 here: https://aclanthology.org/2024.lrec-main.1388.pdf The gloss and translation of the sentence can be found in Section 4.3.

Attaching the possessor in dative case to the possessee instead of the determiner does not represent the way this construction works because 1) the dative possessor cannot be attached to the possessee without the determiner and 2) the possessee can be dropped while the determiner cannot. E.g., in the example in my paper, "In der Gemoene iarem." (literally "in the parish hers") is a valid answer to a specification question in whose service the person stands. (A note to German speakers: Masculine and neuter nouns show that this is indeed a dative, not a genitive.)
The alternative to change the determiners' tags to PRON in Low Saxon would go against UD's own definition of determiners. I would therefore join @nschneid in asking you to relax the error to a warning or ask for language-specific exceptions to the rule.

@ftyers

@ftyers @jonorthwash Is there a way to get around Pronoun det with appos in (). This is something that might show up in a text «his (John's) text is strange.» I would have: det(text, his) appos(his, John's)

lauma · 2024-10-21T11:25:48Z

Also, in Latvian we struggle with constructions similar to "such a high price that nobody could afford it" from the original post as well.

rueter · 2024-10-21T15:15:06Z

Yes, @nschneid, I think the problem encountered in UD_Erzya-JR should be made explicit, here.
In Erzya (myv), Moksha (mdf) and Skolt Saami (sms), genitive forms of personal pronouns are regularly connected to their possessa with a ‹det› dependency.

sent_id = EKS:2011:39:15:ČesnokovF
Конат-конат сонзэ (Степан Иваныч) ладсо сырелгадсть...
Konat-konat    sonze    (Stepan Ivanych)  ladso    syrelgadstʹ...
such-such.Pl  his/her  (St. I.)                    in.way   become.older.3Pl

`some like him (Stepan Ivanich) had gotten older...'

obl(syrelgadstʹ, ladso)
det(ladso, sonze)
appos(sonze, Stepan)

This could also be dealt with as a postposition, where the noun ‹lad› `way' in the Inessive case would contribute to the same ‹obl› dependency

obl(syrelgadstʹ, sonze)
case(sonze, ladso)
appos(sonze, Stepan)

Departing from a ‹det› dependency, however, we could approach English(, but this is not what EWT does).

His friends come from all over.
det(friends, his)

In linguistics, such a sentence might be quoted with an inserted identifier for contextual clarity, e.g.,

His (Fred's) friends come from all over.
det(friends, his)
appos(His, Fred's)

Authors themselves [their very selves], might do the same thing with commas:
His, Fred's, friends come from all over.
det(friends, his)
appos(His, Fred's)

Since the validator does not allow words with a ‹det› dependency to take children, one might opt to follow a Swedish lead and change all instances of genitive-case personal pronoun ‹det› to ‹nmod:poss/nmod:det›, but wouldn't that go against the established norm?

Here is an example of Swedish hennes ‹her› given with ‹nmod:poss› dependency
The genitive form of a third person singular personal pronoun 'her'

# sent_id = sv-ud-dev-78
# text = Börjar hennes jobb att delas av den moderne mannen?
1	Börjar	börja	VERB	VB|PRS|AKT	Mood=Ind|Tense=Pres|VerbForm=Fin|Voice=Act	0	root	0:root	_
2	hennes	hon	PRON	PS|UTR/NEU|SIN/PLU|DEF	Definite=Def|Poss=Yes|PronType=Prs	3	nmod:poss	3:nmod:poss	_
3	jobb	jobb	NOUN	NN|NEU|SIN|IND|NOM	Case=Nom|Definite=Ind|Gender=Neut|Number=Sing	1	nsubj	1:nsubj|5:nsubj	_
4	att	att	PART	IE	_	5	mark	5:mark	_
5	delas	dela	VERB	VB|INF|SFO	VerbForm=Inf|Voice=Pass	1	xcomp	1:xcomp	_
6	av	av	ADP	PP	_	9	case	9:case	_
7	den	den	DET	DT|UTR|SIN|DEF	Definite=Def|Gender=Com|Number=Sing|PronType=Art	9	det	9:det	_
8	moderne	modern	ADJ	JJ|POS|MAS|SIN|DEF|NOM	Case=Nom|Definite=Def|Degree=Pos|Gender=Com|Number=Sing	9	amod	9:amod	_
9	mannen	man	NOUN	NN|UTR|SIN|DEF|NOM	Case=Nom|Definite=Def|Gender=Com|Number=Sing	5	obl:agent	5:obl:agent	SpaceAfter=No
10	?	?	PUNCT	MAD	_	1	punct	1:punct	_

In Swedish, the first and second person pronouns are associated with distinct determiners that are called pronouns in UD vår, min, er, din. These words inflect according to their possessa, and therefore they might be seen as analogically the same phenomena as the Czech possessive determiners.

`possessive determiners (which modify a nominal) (note that some languages use PRON for similar words): [cs] můj, tvůj, jeho, její, náš, váš, jejich'
See also
https://universaldependencies.org/cs/dep/nmod.html
The Czech is consistent.

https://universaldependencies.org/ru/dep/nmod.html
I note that Russian also ‹его карта› amod(карта, его)
translated as English ‹his card› amod(card, his)
Syntag appears to contradict this in ‹его мнению› his opinion' det(мнению, его) but also в его (и не только его, но и нашем) случае' ‹в его случае› `in his case' nmod(случае, его)

https://universaldependencies.org/en/dep/nmod.html
I note that the English provides ‹my office› nmod:poss(office, my)
which is the same coding as in EWT.

So it looks like there might be a Swedish–English consensus for nmod:poss use with possessive pronouns, and genitive personal pronouns.

There is disparity within the Russian corpora along side a consistent Czech.

johnnymoretti · 2024-10-22T09:52:04Z

211 treebanks are invalidated by this new rule, and we need guidance on what to do before the freeze!!! Please provide brief and clear instructions, as aligning the treebanks with this rule requires a lot of work.

KoichiYasuoka · 2024-10-23T23:54:03Z

In Classical Chinese 彼此兵 (those and these soldiers) is invalidated by this new rule. How do we solve it?

# sent_id = KR2b0041_018_par8_1550-1557
# text = 訂彼此兵不得過關
1	訂	訂	VERB	v,動詞,行為,動作	_	0	root	_	Gloss=settle|SpaceAfter=No
2	彼	彼	PRON	n,代名詞,指示,*	PronType=Dem	4	det	_	Gloss=that|SpaceAfter=No
3	此	此	PRON	n,代名詞,指示,*	PronType=Dem	2	flat	_	Gloss=this|SpaceAfter=No
4	兵	兵	NOUN	n,名詞,人,役割	_	7	nsubj	_	Gloss=soldier|SpaceAfter=No
5	不	不	ADV	v,副詞,否定,無界	Polarity=Neg	6	advmod	_	Gloss=not|SpaceAfter=No
6	得	得	AUX	v,助動詞,可能,*	Mood=Pot	7	aux	_	Gloss=must|SpaceAfter=No
7	過	過	VERB	v,動詞,行為,移動	_	1	ccomp	_	Gloss=pass|SpaceAfter=No
8	關	關	NOUN	n,名詞,固定物,建造物	Case=Loc	7	obj	_	Gloss=bar|SpaceAfter=No

rueter · 2024-10-24T11:58:38Z

@nschneid, hi! the UD_Finnish-FTB has an interesting construction

# sent_id = j7hnk-6227
«Viron presidentti Lennart Meri on yksi niitä [ilmeisen harvoja valtiomiehiä], jotka laativat puheensa itse.»
«The President of Estonia, Lennart Meri, is one of the [apparently few statesmen] who write their own speeches.»

8       ilmeisen        ilmeinen        ADJ     A,Sg,Gen        Case=Gen|Number=Sing    9       amod    _       _
9       harvoja harva   DET     Pron,Qnt,Pl,Par Case=Par|Number=Plur|PronType=Ind       10      det     _       _
10      valtiomiehiä    valtiomies      NOUN    N,Pl,Par        Case=Par|Number=Plur    0       root    _       _

ilmeisen harvoja valtiomiehiä
The genitive-case adjective ‹apparent› modifies the determiner ‹few›.
This same construction with a genitive-case adjective is observed with expressions of color, e.g.

sininen ‹blue›
vaalean + sininen ‹light + blue›
tumman + punainen ‹dark + red›
NB! in Finnish these are written as one word, i.e., vaaleansininen, tummanpunainen, except, perhaps, when saying ‹especially dark red› erittäin tumman punainen
amod(red, dark)
advmod(dark, especially)
The Finnish grammar might make reference to an instructive case in -n, but this instance ilmeisen harvoja valtiomiehiä does not seem to fall into that category: https://kaino.kotus.fi/visk/sisallys.php?p=389.
What do you think @fginter, @flammie, @jpiitula?

@jpiitula

@jpiitula sent_id = j7hnk-6227 is problematic. See UniversalDependencies/docs#1059

johnnymoretti · 2024-10-24T12:36:08Z

The rule in validator script is something like that :

if re.match(r"^(det|clf)$", pdeprel) and not re.match(r"^(advmod|obl|goeswith|fixed|reparandum|conj|cc|punct)$",cdeprel) :

if I understand correctly we are allowed to use only obl and not obl:cmp , right ? If it is so, why ? The main dependency relation shouldn't cover also its subtypes ?

KoichiYasuoka · 2024-10-24T12:47:27Z

Thank you @johnnymoretti but I think that det and clf cannot be treated in the same way. In Thai clf can be modified by ADJ or PRON (whose). On the other hand det can be linked by flat or conj...

johnnymoretti · 2024-10-24T12:59:58Z

@KoichiYasuoka For sure, I'm not going into detail about the language, I've just reported what the rule says. At the moment det and clf are in the same rule.

Stormur · 2024-10-24T15:36:24Z

In Classical Chinese 彼此兵 (those and these soldiers) is invalidated by this new rule. How do we solve it?

Why not conj here?

lauma · 2024-10-24T15:46:32Z

In Latvian we have occasional subordinate clause problem as well - tās somas, ko atrada vakar 'those bags which were found yesterday', because in this situation we might as well talk about various kinds of bags, some where found yesterday, and some not. We struggle applying concept of determiners for Latvian in general, but this seems to be a determiner situation, right?

Stormur · 2024-10-24T15:55:41Z

https://universaldependencies.org/en/dep/nmod.html I note that the English provides ‹my office› nmod:poss(office, my) which is the same coding as in EWT.

So it looks like there might be a Swedish–English consensus for nmod:poss use with possessive pronouns, and genitive personal pronouns.

But eng. my is not a pronoun... actually, I do not understand how my office can use nmod in English in the current standard.

The case you report

sent_id = EKS:2011:39:15:ČesnokovF
Конат-конат сонзэ (Степан Иваныч) ладсо сырелгадсть...
Konat-konat    sonze    (Stepan Ivanych)  ladso    syrelgadstʹ...
such-such.Pl  his/her  (St. I.)                    in.way   become.older.3Pl

looks very similar to the latin one I discussed: you have one element referring to the Person which is in сонзэ and which cannot go with ладсо. A kind of "double referent" phrase.

But we might be up to something regarding elements adding Persons. The case you report lets me wonder if indeed any element like this warrants nmod even when they look like other DETs.

jasiewert · 2024-10-25T14:12:27Z

I know that the page describes Russian. Does that make the description less valid?

jnivre · 2024-10-25T14:12:46Z

Thanks! But that is taken from the language-specific guidelines for Russian (note the "ru" in the URL), not from the universal guidelines. And further down the page there are examples of nmod cases that do show agreement, so it idoesn't seem to be intended as a real criterion. But the fact that people writing language-specific guidelines have drawn this conclusion clearly shows that the universal guidelines are in need of clarification. :)

jnivre · 2024-10-25T14:15:50Z

I know that the page describes Russian. Does that make the description less valid?

Well, it means at least that it only applies to that language. And if it is inconsistent with the universal guidelines, I would say it's problematic. However, as I already said, the basic problem seems to be that the universal guidelines are not clear enough to begin with.

jasiewert · 2024-10-25T14:22:40Z

Talking about the "universal" guidelines, they are indeed not of much help here in their current state since they only mention that in some languages the nmod relation is out of question for possessive determiners, but do not give any guidelines on how to distinguish det from nmod: "This is not yet completely parallel across languages; in some languages, it is much more clear than in English how possessive determiners relate to adjectives, and the nmod relation is out of question." https://universaldependencies.org/u/dep/det.html
In any case, this guideline does show that in some languages, possessive determiners might have to be annotated as something else than nmod. However, from this conversation, I get the impression that we are now being pushed towards annotating any possessive determiners as nmod no matter how they behave morpho-syntactically in the various languages. Is this a misunderstanding?

jnivre · 2024-10-25T14:41:06Z

I personally think that this might be preferable for cross-linguistic consistency, but as you correctly point out the guidelines do allow both options and there has been no amendment to the guidelines on this point. In particular, the new validator rule that is the original focus of this issue was not (as far as I know) introduced with this in mind. So the main point I take from this discussion is that we need to improve our guidelines for the nmod relation. Whether this will lead to an amendment or only a clarification is too early to say at this point.

rueter · 2024-10-25T15:34:02Z

@jnivre, @sylvainkahane and @jasiewert, I am suggesting that the Swedish "ditt hus" 'your house' could be annotated as det(hus, ditt), but "hans hus" 'his house' nmod:poss(hus, hans).

This describes a distinction between structures, such as parallel between 3Sg genitive form in and nouns in the genitive, on the one hand,
nmod:poss(дом, его) 'his home', nmod:poss(kotini, minun) 'my home' fin.

and "ditt hus" possessive determiners that agree with their head words, i.e., "ta maison", "твой дом", on the other.
det:poss(maison, ta), det:poss(дом, твой), det:poss(Haus, sein)

This, of course, does not answer the English dilemma with my, thy, your, our, their, her.
I am leaving the words "his" and "its" out.

amir-zeldes · 2024-10-25T16:41:04Z

This, of course, does not answer the English dilemma with my, thy, your, our, their, her. I am leaving the words "his" and "its" out.

Actually "their" and "her" are also historically genitives, like "its" and "his", though some of the pronoun forms are Anglo-Saxon and others are borrowed from Scandinavian (e.g. their, which comes from the dem. stem, not the proper personal stem, Old English "hiera"). But the fact that it's basically impossible to tell which is which now shows that this probably doesn't play a role in how we should analyze the syntax for English - synchronically, none of the forms show any kind of agreement.

jnivre · 2024-10-25T16:53:46Z

I agree with @amir-zeldes. Although we must always use language-specific criteria when interpreting the guidelines for a specific languages, I don't think the presence of agreement should be the basis of the distinction between det and nmod.

lrituma · 2024-10-25T17:54:22Z

For spoken data, we need three relations to be added to the validator:

discourse, which is very common between two determiners in false starts: "a, uh, a gap", "my, uh, our friend"

parataxis for cases such as "a, I don't how to call that, a kiosk, …": here we have a reparandum link between the two "a"s and we would like to attach the parenthesis to the first "a". More exactly we use parataxis:parenth in our spoken French treebanks.

dep for false starts such as "the last, the last day": here "the last" forms a phrase the head of which is missing and we decided to have dep(the, last). I am not against another solution, as long as "the last" is still a phrase.

I still don't see an answer in the thread for such cases where an unambiguous determiner is explained by another phrase. How to annotate these sentences to pass the new validation rule?

Some examples from Latvian:
_.. tādā godīgā iestādē ieperinājušies daži (tikai daži!) zagļi .. _ - "a few (only a few!) thieves have nested in such an honest institution"
ar šādām, reizēm ļoti kategoriskām, pozīcijām - "with such, sometimes very categorical, positions" (šādām is a determiner and it has an agreement with the noun after the insertion)

jasiewert · 2024-10-25T18:17:08Z

I am surprised that semantic criteria should override morpho-syntactic criteria, given that the dependency relations are called "syntactic relations" in the documentation, not "semantic relations".

Isn't agreement a rather decisive criterion when distinguishing, e.g., amod from nmod, at least in languages that exhibit adjective agreement? (Btw., precise criteria to distinguish amod from both det and nmod would probably be valuable as well.)
Treating various types of possessive constructions as nmod would certainly make things more uniform across corpora, but to me this sounds a bit like treating uniformity as an end in itself at the expense of representing linguistic distinctions.

amir-zeldes · 2024-10-25T19:11:01Z

I agree with @amir-zeldes. Although we must always use language-specific criteria when interpreting the guidelines for a specific languages, I don't think the presence of agreement should be the basis of the distinction between det and nmod.

Um, thanks but I don't think I said the last part 😅 I do think agreement is an important syntactic phenomenon to consider, and my understanding was that the individual language guidelines do differ in how they treat possessives. Whether that's a good idea or not is debatable.

Some aspects of UD in practice ignore morpho-syntactic facts such as agreement, for example in treating copulas as auxiliaries, and I do see the logic of that, especially for languages with split copula systems (so the subject of a Russian nominal sentence depends on the lexical predicate, whether or not a copula is present). The same argument could be made for split possessive systems, like English was historically and the Czech one synchronically as well, but we could also make the opposite argument, as I think @jasiewert is doing. My only point was about English, where I think going for a split system (interlocutive my, your, our as det but delocutive her, his, its as nmod) is particularly uncompelling, because synchronically there is no evidence one way or the other. Between the two, I prefer nmod for English, because it means that all personal pronouns can have content-y deprels, rather than the function-y det, and it unifies pronominal and nominal possession (genitive 's) in a way that seems systematic and satisfying. In other languages, things could play out very differently, and I don't believe in giving English special importance in the discussion of universal guidelines.

jnivre · 2024-10-25T20:20:03Z

@amir-zeldes Sorry about misrepresenting your position. I still don't think agreement is decisive, but that requires a longer argument. And I completely agree that English should not be given priority.

jnivre · 2024-10-25T20:26:09Z

@jasiewert At the universal level, we always have to rely on functional criteria, because they are the only ones that are universally applicable. But please not that functional is not purely semantic, it is semantic + information packaging. When we develop guidelines for a specific language, we therefore have to start by using the functional criteria to identify the prototypical cases (such as primary transitives for core arguments), observe what morphosyntactic criteria are characteristic of those, and then extend them to other cases. I think agreeing possessives is a typical example where two such characteristics clash in many languages, the referentiality of nominal modifiers and the agreement patterns of adjectival modifiers. Which one should be given preference may ultimately depend on other factors, which is why the current guidelines allow both options. I may be biased towards referentiality myself, but I fully respect that the facts are different in different languages. But, for the record, my original comment was with respect to Chinese, where agreement clearly is not a criterion. I hope this at least clarifies my position.

jnivre · 2024-10-25T20:28:20Z

@

For spoken data, we need three relations to be added to the validator:

discourse, which is very common between two determiners in false starts: "a, uh, a gap", "my, uh, our friend"

parataxis for cases such as "a, I don't how to call that, a kiosk, …": here we have a reparandum link between the two "a"s and we would like to attach the parenthesis to the first "a". More exactly we use parataxis:parenth in our spoken French treebanks.

dep for false starts such as "the last, the last day": here "the last" forms a phrase the head of which is missing and we decided to have dep(the, last). I am not against another solution, as long as "the last" is still a phrase.

I still don't see an answer in the thread for such cases where an unambiguous determiner is explained by another phrase. How to annotate these sentences to pass the new validation rule?

Some examples from Latvian: _.. tādā godīgā iestādē ieperinājušies daži (tikai daži!) zagļi .. _ - "a few (only a few!) thieves have nested in such an honest institution" ar šādām, reizēm ļoti kategoriskām, pozīcijām - "with such, sometimes very categorical, positions" (šādām is a determiner and it has an agreement with the noun after the insertion)

I may be mistaken, but in the second example, it looks like the second modifier could be an adjectival modifier that attaches to the nominal head. Is there anything that rules out such an analysis.

KoichiYasuoka · 2024-10-26T00:16:03Z

At the moment det and clf are in the same rule.

Well... I think we need different criteria to validate det and clf, espeicially when they are used together. But can we make the validator for clf really "universal"?

# text = หนังสือเล่มนี้ของเธอ
# text_zh = 她的這本書
1	หนังสือ	_	NOUN	NN	_	0	root	_	SpaceAfter=No|Translit=書
2	เล่ม	_	NOUN	CL	_	1	clf	_	SpaceAfter=No|Translit=本
3	นี้	_	DET	DT	_	2	det	_	SpaceAfter=No|Translit=這
4	ของ	_	ADP	IN	_	5	case	_	Gloss=of|SpaceAfter=No
5	เธอ	_	PRON	PRN	_	1	nmod:poss	_	SpaceAfter=No|Translit=她

leky40 · 2024-10-26T06:27:06Z

At the moment det and clf are in the same rule.

Well... I think we need different criteria to validate det and clf, espeicially when they are used together. But can we make the validator for clf really "universal"?
# text = หนังสือเล่มนี้ของเธอ
# text_zh = 她的這本書
1	หนังสือ	_	NOUN	NN	_	0	root	_	SpaceAfter=No|Translit=書
2	เล่ม	_	NOUN	CL	_	1	clf	_	SpaceAfter=No|Translit=本
3	นี้	_	DET	DT	_	2	det	_	SpaceAfter=No|Translit=這
4	ของ	_	ADP	IN	_	5	case	_	Gloss=of|SpaceAfter=No
5	เธอ	_	PRON	PRN	_	1	nmod:poss	_	SpaceAfter=No|Translit=她

This might not be related to the main topic that has been discussed here, but I was wondering if the prepositional phrase ของเธอ (of her) could modify the noun เล่ม (used as a classifier in this structure), instead the head noun หนังสือ /nǎŋ-sɯ̌ɯ/ (book). In Thai, the head noun modified by the classifier phrase can be omitted when the noun is previously mentioned and known. Then the noun used as a classifier turns to be the head noun of the NP, as in these trees:

From the full phrase:

# text = หนังสือเล่มนี้ของเธอ
# text_en = this book of hers
# gloss = /nǎŋ-sɯ̌ɯ/ /lêm/ /níi/ /khɔ̌ɔŋ/ /thɤɤ/
1	หนังสือ	_	NOUN	_	_	0	root	_	_
2	เล่ม	_	NOUN	_	_	1	clf	_	_
3	นี้	_	DET	_	_	2	det	_	_
4	ของ	_	ADP	_	_	5	case	_	_
5	เธอ	_	PRON	_	_	2	nmod:poss	_	_

it becomes this:

# text = เล่มนี้ของเธอ
# text_en = this book of hers (if the book is previously mentioned and known)
# gloss = /lêm/ /níi/ /khɔ̌ɔŋ/ /thɤɤ/
1	เล่ม	_	NOUN	_	_	0	root	_	_
2	นี้	_	DET	_	_	1	det	_	_
3	ของ	_	ADP	_	_	4	case	_	_
4	เธอ	_	PRON	_	_	1	nmod:poss	_	_

So clf is no longer used in this structure, but it's still known that it is used as a classifier for the book previously mentioned.

Apart from this reason, the noun เล่ม /lêm/ is the head noun of the modifier phrase modifying the head noun หนังสือ /nǎŋ-sɯ̌ɯ/ (book). A noun used as a classifier in Thai must not be used alone with a head noun to be modified. It must be modified to be able to modify the head noun. The structure (noun classifier) is ungrammatical in Thai.

I am not sure if there is difference. Thai uses postmodifiers, except numbers and quantifiers expressing quantities are placed before a noun used as a classifier.

And I guess my annotation with clf might not be validated.

P.S. I just realised that I English-grossed the word เธอ /thɤɤ/ (PRON) incorrectly. It should be "she", not "her". In Thai, we have only one word to express personal and possessive pronouns.

KoichiYasuoka · 2024-10-26T09:16:15Z

Thank you @leky40 but the latter example เล่มนี้ของเธอ

So clf is no longer used in this structure, but it's still known that it is used as a classifier for the book previously mentioned.

is slightly far from this issue "leaf-det-clf" ... ah well, OK, we'll try to investigate หนังสือสองเล่มนี้ของเธอ now. How do you think about this? I think we can omit หนังสือ from หนังสือสองเล่มนี้ของเธอ, then what structure is suitable for สองเล่มนี้ของเธอ?

# text = หนังสือสองเล่มนี้ของเธอ
1	หนังสือ	_	NOUN	_	_	0	root	_	Gloss=book|SpaceAfter=No
2	สอง	_	NUM	_	_	1	nummod	_	Gloss=two|SpaceAfter=No
3	เล่ม	_	NOUN	_	_	2	clf	_	Gloss=[classifier]|SpaceAfter=No
4	นี้	_	DET	_	_	3	det	_	Gloss=this|SpaceAfter=No
5	ของ	_	ADP	_	_	6	case	_	Gloss=of|SpaceAfter=No
6	เธอ	_	PRON	_	_	1	nmod:poss	_	Gloss=she|SpaceAfter=No

leky40 · 2024-10-26T11:03:25Z

Thank you @leky40 but the latter example เล่มนี้ของเธอ

So clf is no longer used in this structure, but it's still known that it is used as a classifier for the book previously mentioned.

is slightly far from this issue "leaf-det-clf" ... ah well, OK, we'll try to investigate หนังสือสองเล่มนี้ของเธอ now. How do you think about this? I think we can omit หนังสือ from หนังสือสองเล่มนี้ของเธอ, then what structure is suitable for สองเล่มนี้ของเธอ?
# text = หนังสือสองเล่มนี้ของเธอ
1	หนังสือ	_	NOUN	_	_	0	root	_	Gloss=book|SpaceAfter=No
2	สอง	_	NUM	_	_	1	nummod	_	Gloss=two|SpaceAfter=No
3	เล่ม	_	NOUN	_	_	2	clf	_	Gloss=[classifier]|SpaceAfter=No
4	นี้	_	DET	_	_	3	det	_	Gloss=this|SpaceAfter=No
5	ของ	_	ADP	_	_	6	case	_	Gloss=of|SpaceAfter=No
6	เธอ	_	PRON	_	_	1	nmod:poss	_	Gloss=she|SpaceAfter=No

My analysis is different from the tree above.

A number is quite tricky. In Thai, when a number is placed before a noun to be modified, it expresses quantities. When it is placed after a noun, it expresses sequences (order).

From the structure you show above, the number "two" expresses quantities and it modifies the noun เล่ม /lêm/, which is used as a classifier for the head noun หนังสือ /nǎŋ-sɯ̌ɯ/ (book) in this entire noun phrase.

I was asked how I knew that the number two modified the noun เล่ม /lêm/, which is used as a classifier. I would say its position, and when a question is made. It would be "กี่เล่ม (how many + the noun เล่ม /lêm/)", not "กี่หนังสือ (how many + book หนังสือ /nǎŋ-sɯ̌ɯ/)".

A noun used as a classifier is the head noun of the modifier phrase. And if the noun /lêm/, which is used as a classifier for the book, is omitted from the structure presented above, it is ungrammatical. That's why the number cannot modify the head noun หนังสือ /nǎŋ-sɯ̌ɯ/ (book).

So the trees I annotated would be:

# text = หนังสือสองเล่มนี้ของเธอ
# text_en = these two books of hers
# gloss = /nǎŋ-sɯ̌ɯ/ /sɔ̌ɔŋ/ /lêm/ /níi/ /khɔ̌ɔŋ/ /thɤɤ/
1	หนังสือ	_	NOUN	_	_	0	root	_	_
2	สอง	_	NUM	_	_	3	nummod	_	_
3	เล่ม	_	NOUN	_	_	1	clf	_	_
4	นี้	_	DET	_	_	3	det	_	_
5	ของ	_	ADP	_	_	6	case	_	_
6	เธอ	_	PRON	_	_	3	nmod:poss	_	_

# text = สองเล่มนี้ของเธอ
# text_en = these two books of hers (when the books are previously mentioned and known)
# gloss = /sɔ̌ɔŋ/ /lêm/ /níi/ /khɔ̌ɔŋ/ /thɤɤ/
1	สอง	_	NUM	_	_	2	nummod	_	_
2	เล่ม	_	NOUN	_	_	0	root	_	_
3	นี้	_	DET	_	_	2	det	_	_
4	ของ	_	ADP	_	_	5	case	_	_
5	เธอ	_	PRON	_	_	2	nmod:poss	_	_

I know that these annotations of mine might not be validated. But this is how a Thai classifier works with other modifiers. Apart from showing specificness / emphasis to a head noun, a noun used as a classifier is also used to replace the head noun which it modified. It is a noun, functioning as a classifier.

lauma · 2024-10-26T11:41:26Z

I still don't see an answer in the thread for such cases where an unambiguous determiner is explained by another phrase. How to annotate these sentences to pass the new validation rule?
Some examples from Latvian: _.. tādā godīgā iestādē ieperinājušies daži (tikai daži!) zagļi .. _ - "a few (only a few!) thieves have nested in such an honest institution" ar šādām, reizēm ļoti kategoriskām, pozīcijām - "with such, sometimes very categorical, positions" (šādām is a determiner and it has an agreement with the noun after the insertion)

I may be mistaken, but in the second example, it looks like the second modifier could be an adjectival modifier that attaches to the nominal head. Is there anything that rules out such an analysis.

Šāds is not an adjective in Latvian, as it does not form comparative degrees and it does not have definite/indefinite endings. It is a pronoun that tends to take place of adjective in the sentence. By us it is totaly fine to annotate such pronouns as adjectival modifiers, in fact, we would love to. But if we do so, then there is nothing left to be annotated by det role. Is it okay to forgo det role at all just because we don't have nothing like actual articles in language?

(Sorry if the question is kinda dumb, we honestly struggle with understanding and applying determiner concept correctly :) )

jnivre · 2024-10-26T15:08:10Z

I still don't see an answer in the thread for such cases where an unambiguous determiner is explained by another phrase. How to annotate these sentences to pass the new validation rule?
Some examples from Latvian: _.. tādā godīgā iestādē ieperinājušies daži (tikai daži!) zagļi .. _ - "a few (only a few!) thieves have nested in such an honest institution" ar šādām, reizēm ļoti kategoriskām, pozīcijām - "with such, sometimes very categorical, positions" (šādām is a determiner and it has an agreement with the noun after the insertion)

I may be mistaken, but in the second example, it looks like the second modifier could be an adjectival modifier that attaches to the nominal head. Is there anything that rules out such an analysis.

Šāds is not an adjective in Latvian, as it does not form comparative degrees and it does not have definite/indefinite endings. It is a pronoun that tends to take place of adjective in the sentence. By us it is totaly fine to annotate such pronouns as adjectival modifiers, in fact, we would love to. But if we do so, then there is nothing left to be annotated by det role. Is it okay to forgo det role at all just because we don't have nothing like actual articles in language?

(Sorry if the question is kinda dumb, we honestly struggle with understanding and applying determiner concept correctly :) )

I meant that "kategoriskam" was an amod, but I may have misunderstood what the problem is. Which word is it that you want to have as a dependent of "sadam"? (Please excuse the lack of diacritics.)

lauma · 2024-10-27T11:34:09Z

I still don't see an answer in the thread for such cases where an unambiguous determiner is explained by another phrase. How to annotate these sentences to pass the new validation rule?
Some examples from Latvian: _.. tādā godīgā iestādē ieperinājušies daži (tikai daži!) zagļi .. _ - "a few (only a few!) thieves have nested in such an honest institution" ar šādām, reizēm ļoti kategoriskām, pozīcijām - "with such, sometimes very categorical, positions" (šādām is a determiner and it has an agreement with the noun after the insertion)

I may be mistaken, but in the second example, it looks like the second modifier could be an adjectival modifier that attaches to the nominal head. Is there anything that rules out such an analysis.

Šāds is not an adjective in Latvian, as it does not form comparative degrees and it does not have definite/indefinite endings. It is a pronoun that tends to take place of adjective in the sentence. By us it is totaly fine to annotate such pronouns as adjectival modifiers, in fact, we would love to. But if we do so, then there is nothing left to be annotated by det role. Is it okay to forgo det role at all just because we don't have nothing like actual articles in language?
(Sorry if the question is kinda dumb, we honestly struggle with understanding and applying determiner concept correctly :) )

I meant that "kategoriskam" was an amod, but I may have misunderstood what the problem is. Which word is it that you want to have as a dependent of "sadam"? (Please excuse the lack of diacritics.)

Our current analysis is:

root of the fragment: pozīcijām ('positions', noun in plural dative).
det: šādām (inflecting pronoun, meaning something like 'such as' or 'like these') attached to pozīcijām, as it agrees in number and case with pozīcijām.
????: kategoriskām ('categorical', adjective in plural dative) attached to šādām, because it is located between commas after šādām, agrees with it and further specifies what this šādām entails. (It is also reasonably common in Latvian to attach a whole acl in these situations, and this is considered to be notably different from a noun just having two separate attributes coordinated or noncoordinated attributes like 'little green house'.)
'advmod: ļoti ('very' adverb, not inflecting) attached to kategoriskām.

Thus, I assumed that you suggested that we should mark šādām as nmod, not det and got confused. Would it be appropriate to assume that kategoriskām is obl, if the šādām stays det?

jnivre · 2024-10-27T13:26:28Z

@lauma Thanks for clarifying. I did not suggest that you should mark anything as nmod here. That discussion was about another example.

From your description of the example, it definitely sounds like "kategoriskām" should be attached as an amod to the head noun (that is, as a sister to the det node). That is normally the right analysis for a DET-ADJ-NOUN sequence, and I don't think the punctuation changes this fact.

Similarly, I think you should consider attaching the two determiners as sisters in the first example. The principle of prioritising content words in UD has as a consequence that you often get a flat analysis where other frameworks would have a hierarchical analysis. This is true, for example, of multiple auxiliaries attaching to the same word, and it is the prescribed analysis for multiple determiners as well. I am aware that the first example is a bit special, but I think this could still be the least bad analysis (given the general principles of UD).

sylvainkahane · 2024-10-27T18:30:16Z

I understand @jnivre's remark about the functionalist approach of UD and I see some advantages to have a common annotation of possessives. Note that we already have a feature Poss=Yes for that.
There is also some advantages to indicate the differences between the languages. For instance, English's treebanks use nmod:poss both possessive pronouns and for Saxon genitive "NP's", but not for genitive "of NP". I suppose it is way to indicate that "NP's" occupy the same syntactic position as possessive pronouns (which is also the position of determiners). In other languages, possessives and genitive NPs occupy very different syntactic positions and it could be strange to annotate them similarly. or at least if we do that we lost something concerning the grammar of the language.
Moreover, as noted by @jasiewert, in some languages, possessives agree with their nominal governor (like adjectives and unlike genitive NPs) and the nominalness of possessives can be discussed. It could also be nice to be able to indicate whether possessives occupy or not in the same position as determiners, when there is such a position.

nschneid · 2024-10-27T19:05:41Z

As Croft and others have shown, any given language-particular construction (formalized via a feature, UPOS, or deprel) can be named based on its prototypical function, but its actual application will tend to extend beyond that prototype, so line-drawing becomes tricky in many cases (do we prioritize the general function viewed crosslingually or the language-internal tests?). E.g. there are English-specific morphosyntactic arguments that the English Determiner Relation/Slot should encompass both core determiners and prenominal possessives, but for purposes of meaning and crosslinguistic comparison, possessive dependents are quite different from core determiners.

I do think UD's prioritization of content relations is a principle that resolves this, but as @sylvainkahane notes it leaves something out in terms of grouping together language-internal constructions defined by morphosyntactic distribution. Maybe we should move toward annotating the additional categorizations in a separate layer (e.g. the UCxn approach) to expose the broader category of determiner relations.

Whether of-PPs ought to be grouped together with Saxon genitives under any approach to syntax seems dubious to me, though they certainly have semantic parallels.

verenablaschke mentioned this issue Oct 17, 2024

"unter anderem" and "vor allem" in German #1060

Open

rueter added a commit to UniversalDependencies/UD_Finnish-FTB that referenced this issue Oct 24, 2024

Correct det dependencies

47a4922

@jpiitula sent_id = j7hnk-6227 is problematic. See UniversalDependencies/docs#1059

nschneid changed the title ~~New validator rule: leaf-det-clf~~ New validator rule: leaf-det-clf (and det vs. nmod) Oct 25, 2024

dan-zeman added this to the v2.15 milestone Oct 25, 2024

dan-zeman added standard needed dependencies universal labels Oct 25, 2024

verenablaschke mentioned this issue Oct 26, 2024

German "ein" ("one") used as a numeral #1061

Open

New validator rule: leaf-det-clf (and det vs. nmod) #1059

New validator rule: leaf-det-clf (and det vs. nmod) #1059

Comments

nschneid commented Oct 8, 2024 • edited Loading

mr-martian commented Oct 8, 2024

amir-zeldes commented Oct 8, 2024

mr-martian commented Oct 8, 2024

amir-zeldes commented Oct 8, 2024

colinbatchelor commented Oct 10, 2024

nschneid commented Oct 10, 2024

LeonieWeissweiler commented Oct 10, 2024 • edited Loading

nschneid commented Oct 10, 2024

amir-zeldes commented Oct 10, 2024

FedeIure commented Oct 11, 2024

sylvainkahane commented Oct 11, 2024

lrituma commented Oct 15, 2024

nschneid commented Oct 15, 2024

Stormur commented Oct 17, 2024 • edited Loading

amir-zeldes commented Oct 17, 2024

jasiewert commented Oct 20, 2024

lauma commented Oct 21, 2024

rueter commented Oct 21, 2024 • edited Loading

johnnymoretti commented Oct 22, 2024

KoichiYasuoka commented Oct 23, 2024

rueter commented Oct 24, 2024

johnnymoretti commented Oct 24, 2024

KoichiYasuoka commented Oct 24, 2024

johnnymoretti commented Oct 24, 2024

Stormur commented Oct 24, 2024

lauma commented Oct 24, 2024 • edited Loading

Stormur commented Oct 24, 2024

jasiewert commented Oct 25, 2024 • edited Loading

jnivre commented Oct 25, 2024

jnivre commented Oct 25, 2024

jasiewert commented Oct 25, 2024

jnivre commented Oct 25, 2024

rueter commented Oct 25, 2024 • edited Loading

amir-zeldes commented Oct 25, 2024

jnivre commented Oct 25, 2024

lrituma commented Oct 25, 2024

jasiewert commented Oct 25, 2024

amir-zeldes commented Oct 25, 2024

jnivre commented Oct 25, 2024

jnivre commented Oct 25, 2024 • edited Loading

jnivre commented Oct 25, 2024

KoichiYasuoka commented Oct 26, 2024

leky40 commented Oct 26, 2024 • edited Loading

KoichiYasuoka commented Oct 26, 2024

leky40 commented Oct 26, 2024

lauma commented Oct 26, 2024

jnivre commented Oct 26, 2024

lauma commented Oct 27, 2024

jnivre commented Oct 27, 2024 • edited Loading

sylvainkahane commented Oct 27, 2024

nschneid commented Oct 27, 2024

nschneid commented Oct 8, 2024 •

edited

Loading

LeonieWeissweiler commented Oct 10, 2024 •

edited

Loading

Stormur commented Oct 17, 2024 •

edited

Loading

rueter commented Oct 21, 2024 •

edited

Loading

lauma commented Oct 24, 2024 •

edited

Loading

jasiewert commented Oct 25, 2024 •

edited

Loading

rueter commented Oct 25, 2024 •

edited

Loading

jnivre commented Oct 25, 2024 •

edited

Loading

leky40 commented Oct 26, 2024 •

edited

Loading

jnivre commented Oct 27, 2024 •

edited

Loading