Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New validator rule: leaf-det-clf (and det vs. nmod) #1059

Open
nschneid opened this issue Oct 8, 2024 · 72 comments
Open

New validator rule: leaf-det-clf (and det vs. nmod) #1059

nschneid opened this issue Oct 8, 2024 · 72 comments

Comments

@nschneid
Copy link
Contributor

nschneid commented Oct 8, 2024

I notice that the leaf-det-clf rule introduced in UniversalDependencies/tools@1e4debd and then revised in UniversalDependencies/tools@759c5ae has invalidated quite a lot (a majority?) of treebanks.

Is further revision necessary? For example, EWT is still experiencing some errors that look like they should be valid:

  • det + nmod e.g. "at least some reports" (det(reports, some), nmod(some, least)). "at least" is admittedly ADV-like, so another option is to make it ExtPos=ADV and advmod.
  • "such"/det licensing an advcl, as in these results. The guidelines on sufficiency and excess for "so" and similar say the advcl should attach to the adjective or adverb, not the noun in a case like sufficient flour. In such a high price that nobody could afford it, I suppose "such" should have an advcl dependent?
@mr-martian
Copy link
Contributor

The errors in Hebrew are due to things like

# x- so the RTL text doesn't make this unreadable
32	x-ה	x-ה	DET	art	PronType=Art	33	det	_	Gloss=the|Ref=GEN_19.8
33	x-אֲנָשִׁ֤ים	x-אישׁ	NOUN	subs	Gender=Masc|Number=Plur	38	obl	_	Gloss=man|Ref=GEN_19.8
34-35	x-הָאֵל֙	x-_	_	_	_	_	_	_	_
34	x-הָ	x-ה	DET	art	PronType=Art	35	det	_	Gloss=the|Ref=GEN_19.8
35	x-אֵל֙	x-אל	PRON	prde	Number=Plur|PronType=Dem	33	det	_	Gloss=these|Ref=GEN_19.8

where demonstrative pronouns have their own determiners. (I'm open to other means of annotating this.)

@amir-zeldes
Copy link
Contributor

@mr-martian this is also the analysis used in the modern Hebrew TBs, so I would be inclined to accept and keep it (it's also parallel to how adjectival modification works in Hebrew)

@mr-martian
Copy link
Contributor

If I were doing Hebrew from scratch, the one alternative I'd consider is treating ה as an inflectional prefix rather than a syntactic word.

@amir-zeldes
Copy link
Contributor

I would vote against that TBH, it's not how other languages with repeating articles do it either (e.g. Greek) and it complicates lemmatization, type counts, and a bunch of other things.

@colinbatchelor
Copy link
Contributor

I have one remaining error:
[(in gd_arcosg-ud-train.conllu) Line 55940 Sent p01_033h Node 79]: [L3 Syntax leaf-det-clf] 'det' not expected to have children (79:a:det --> 81:h-uile:compound)

The offending tree has someone emphasising 'every' by saying a h-uile h-uile. Is there maybe a better way I should be doing this or could it be an exception?

@nschneid
Copy link
Contributor Author

Repetition for emphasis: would flat be a good option instead of compound? Cf. https://universaldependencies.org/u/dep/flat.html#iconic-sequences (though I can't speak to how languages are dealing with reduplication in general).

The validator currently allows fixed, but not flat, it seems.

@LeonieWeissweiler
Copy link
Contributor

LeonieWeissweiler commented Oct 10, 2024

This invalidated both HDT and GSD for German, mostly because of vor allem (mainly) and unter anderem (among others). For both, the first word is an ADP' and the second is a DET' that depends on it with the `case' relation.

How should we handle this better?

@nschneid
Copy link
Contributor Author

unter anderem is sometimes treated as a fixed expression. Here is a case triggering the error:

image

I assume this means "among other teachers"—is there a reason not to analyze it as "among [other teachers]", with unter attaching to Lehrer?

@amir-zeldes
Copy link
Contributor

No, for the German case it's not "among other teachers", notice "other" is dative but "teacher" is not - it's "among others, teachers". I think the mistake is the deprel det - this is not a determiner but an oblique modifier, just like English "among others".

@FedeIure
Copy link

Repetition for emphasis: would flat be a good option instead of compound? Cf. https://universaldependencies.org/u/dep/flat.html#iconic-sequences (though I can't speak to how languages are dealing with reduplication in general).

The validator currently allows fixed, but not flat, it seems.

What about flat:redup to mark repetition for emphasis?

Here two examples in one sentence from Roman tragedies in UD Latin-CIRCSE:

flat_redup_Latin_CIRCSE

@sylvainkahane
Copy link
Contributor

For spoken data, we need three relations to be added to the validator:

  • discourse, which is very common between two determiners in false starts: "a, uh, a gap", "my, uh, our friend"
  • parataxis for cases such as "a, I don't how to call that, a kiosk, …": here we have a reparandum link between the two "a"s and we would like to attach the parenthesis to the first "a". More exactly we use parataxis:parenth in our spoken French treebanks.
  • dep for false starts such as "the last, the last day": here "the last" forms a phrase the head of which is missing and we decided to have dep(the, last). I am not against another solution, as long as "the last" is still a phrase.

@lrituma
Copy link

lrituma commented Oct 15, 2024

In Latvian, we have several expressions considered as compound pronouns in Latvian traditional grammar which consist of one particle and one pronoun. For example, kaut kāds where kaut is a particle and kāds is a pronoun (this expression roughly means 'some kind of'). Currently, we annotate the particle as discourse which is dependent of pronoun, and pronoun occasionally becomes det if the expression describes a noun. This leads to validation error.

The particles in these expressions usually are kaut, diez, diezin, nez, nezin, and they all have very fuzzy, hard to pin down semantics so we feel uncomfortable annotating them as adverbs.

We would like to annotate these expressions as compound (instead of fixed) because the pronoun is the second element in the phrase and we feel that it is the head of the phrase because the pronoun inflects together with a noun and bears the most of semantic meaning of the expression.

Would you please consider allowing compound in this construction or is there any other option appropriate here?

@nschneid
Copy link
Contributor Author

@dan-zeman What about relaxing the error to a warning while we figure out the contours of the rule?

@Stormur
Copy link
Contributor

Stormur commented Oct 17, 2024

I think that this new rule is fine, even if, while correcting, I and colleagues have encountered a couple of cases which really do not look reducible to a trivial correction as all the others.

  1. The already mentioned reduplication, which is treated through flat:redup in Latin treebanks. One example is quot quot from quot: while the latter means 'as many as', the reduplication has a distributive sense as in 'for each possible one...' (this expression is sometimes even univerbated). I think to annotate them separately, each depending on the head, is not the right way to deal with them: here we do not have two or more different terms, but really the same one "clonating" itself. On the other hand, flat is really the closest relation we have to fixed, which would cause no problem, but is not a correct choice (well, in my opinion it is never the correct choice)
    • Problem: horizontal relation
  2. The phrase nostra qui remansissemus caede 'the murder of us who are left (behind)', but more literally 'our who are left murder', since nostra is the inflected possessive determiner for the 1st person plural. What happens here is that the possessive adds a nominal person, as it were, and this person is another referent beyond the noun caede 'murder' in this phrase; as such, the relative can target it (or at least, Cicero pleases himself in doing so). We could not really justify an analysis where we shift the relative under the head noun, since the murder is not one of its arguments.
    • Problem: the relative clause dependent of the determiner cannot be traced back to the referent of its head

To summarise the above discussion, my two proposals are to deactivate this validation rule if:

  1. the child of det is a flat relation
  2. the head element has the feature Person, at least for acl:relcl

@amir-zeldes
Copy link
Contributor

We have something similar to the case in 1. in Coptic where a word is repeated for distributive meaning:

  1. one one = "one by one"
  2. two two = "two by two, in pairs"
  3. color color = "color for color, every color"

Etc. 1-2 also work fine in modern Hebrew BTW, and 3. would work in the plural. What we did in UD Coptic was interpret them as nominal modifiers without a preposition (i.e. "one one" is the same as "one by one" with the word "by" suppressed). We then used the nmod:unmarked relation, which is a subtype of nmod used without a case marker.

@jasiewert
Copy link
Contributor

This new rule invalidates an analysis in my Low Saxon dataset that I just presented last spring in my LREC-COLING paper and discussed with other UD people at the conference, even with @dan-zeman himself, if I remember correctly. It is explained in Section 5.1 here: https://aclanthology.org/2024.lrec-main.1388.pdf The gloss and translation of the sentence can be found in Section 4.3.

Attaching the possessor in dative case to the possessee instead of the determiner does not represent the way this construction works because 1) the dative possessor cannot be attached to the possessee without the determiner and 2) the possessee can be dropped while the determiner cannot. E.g., in the example in my paper, "In der Gemoene iarem." (literally "in the parish hers") is a valid answer to a specification question in whose service the person stands. (A note to German speakers: Masculine and neuter nouns show that this is indeed a dative, not a genitive.)
The alternative to change the determiners' tags to PRON in Low Saxon would go against UD's own definition of determiners. I would therefore join @nschneid in asking you to relax the error to a warning or ask for language-specific exceptions to the rule.

nschneid referenced this issue in UniversalDependencies/UD_Erzya-JR Oct 21, 2024
@ftyers @jonorthwash Is there a way to get around Pronoun det with appos in (). This is something that might show up in a text «his (John's) text is strange.» I would have: det(text, his) appos(his, John's)
@lauma
Copy link
Contributor

lauma commented Oct 21, 2024

Also, in Latvian we struggle with constructions similar to "such a high price that nobody could afford it" from the original post as well.

@rueter
Copy link
Contributor

rueter commented Oct 21, 2024

Yes, @nschneid, I think the problem encountered in UD_Erzya-JR should be made explicit, here.
In Erzya (myv), Moksha (mdf) and Skolt Saami (sms), genitive forms of personal pronouns are regularly connected to their possessa with a ‹det› dependency.

sent_id = EKS:2011:39:15:ČesnokovF
Конат-конат сонзэ (Степан Иваныч) ладсо сырелгадсть...
Konat-konat    sonze    (Stepan Ivanych)  ladso    syrelgadstʹ...
such-such.Pl  his/her  (St. I.)                    in.way   become.older.3Pl

`some like him (Stepan Ivanich) had gotten older...'

obl(syrelgadstʹ, ladso)
det(ladso, sonze)
appos(sonze, Stepan)

This could also be dealt with as a postposition, where the noun ‹lad› `way' in the Inessive case would contribute to the same ‹obl› dependency

obl(syrelgadstʹ, sonze)
case(sonze, ladso)
appos(sonze, Stepan)

Departing from a ‹det› dependency, however, we could approach English(, but this is not what EWT does).

His friends come from all over.
det(friends, his)

In linguistics, such a sentence might be quoted with an inserted identifier for contextual clarity, e.g.,

His (Fred's) friends come from all over.
det(friends, his)
appos(His, Fred's)

Authors themselves [their very selves], might do the same thing with commas:
His, Fred's, friends come from all over.
det(friends, his)
appos(His, Fred's)

Since the validator does not allow words with a ‹det› dependency to take children, one might opt to follow a Swedish lead and change all instances of genitive-case personal pronoun ‹det› to ‹nmod:poss/nmod:det›, but wouldn't that go against the established norm?

Here is an example of Swedish hennes ‹her› given with ‹nmod:poss› dependency
The genitive form of a third person singular personal pronoun 'her'

# sent_id = sv-ud-dev-78
# text = Börjar hennes jobb att delas av den moderne mannen?
1	Börjar	börja	VERB	VB|PRS|AKT	Mood=Ind|Tense=Pres|VerbForm=Fin|Voice=Act	0	root	0:root	_
2	hennes	hon	PRON	PS|UTR/NEU|SIN/PLU|DEF	Definite=Def|Poss=Yes|PronType=Prs	3	nmod:poss	3:nmod:poss	_
3	jobb	jobb	NOUN	NN|NEU|SIN|IND|NOM	Case=Nom|Definite=Ind|Gender=Neut|Number=Sing	1	nsubj	1:nsubj|5:nsubj	_
4	att	att	PART	IE	_	5	mark	5:mark	_
5	delas	dela	VERB	VB|INF|SFO	VerbForm=Inf|Voice=Pass	1	xcomp	1:xcomp	_
6	av	av	ADP	PP	_	9	case	9:case	_
7	den	den	DET	DT|UTR|SIN|DEF	Definite=Def|Gender=Com|Number=Sing|PronType=Art	9	det	9:det	_
8	moderne	modern	ADJ	JJ|POS|MAS|SIN|DEF|NOM	Case=Nom|Definite=Def|Degree=Pos|Gender=Com|Number=Sing	9	amod	9:amod	_
9	mannen	man	NOUN	NN|UTR|SIN|DEF|NOM	Case=Nom|Definite=Def|Gender=Com|Number=Sing	5	obl:agent	5:obl:agent	SpaceAfter=No
10	?	?	PUNCT	MAD	_	1	punct	1:punct	_

In Swedish, the first and second person pronouns are associated with distinct determiners that are called pronouns in UD vår, min, er, din. These words inflect according to their possessa, and therefore they might be seen as analogically the same phenomena as the Czech possessive determiners.

`possessive determiners (which modify a nominal) (note that some languages use PRON for similar words): [cs] můj, tvůj, jeho, její, náš, váš, jejich'
See also
https://universaldependencies.org/cs/dep/nmod.html
The Czech is consistent.

https://universaldependencies.org/ru/dep/nmod.html
I note that Russian also ‹его карта› amod(карта, его)
translated as English ‹his card› amod(card, his)
Syntag appears to contradict this in ‹его мнению› his opinion' det(мнению, его) but also в его (и не только его, но и нашем) случае' ‹в его случае› `in his case' nmod(случае, его)

https://universaldependencies.org/en/dep/nmod.html
I note that the English provides ‹my office› nmod:poss(office, my)
which is the same coding as in EWT.

So it looks like there might be a Swedish–English consensus for nmod:poss use with possessive pronouns, and genitive personal pronouns.

There is disparity within the Russian corpora along side a consistent Czech.

@johnnymoretti
Copy link

211 treebanks are invalidated by this new rule, and we need guidance on what to do before the freeze!!! Please provide brief and clear instructions, as aligning the treebanks with this rule requires a lot of work.

@KoichiYasuoka
Copy link
Contributor

In Classical Chinese 彼此兵 (those and these soldiers) is invalidated by this new rule. How do we solve it?

# sent_id = KR2b0041_018_par8_1550-1557
# text = 訂彼此兵不得過關
1	訂	訂	VERB	v,動詞,行為,動作	_	0	root	_	Gloss=settle|SpaceAfter=No
2	彼	彼	PRON	n,代名詞,指示,*	PronType=Dem	4	det	_	Gloss=that|SpaceAfter=No
3	此	此	PRON	n,代名詞,指示,*	PronType=Dem	2	flat	_	Gloss=this|SpaceAfter=No
4	兵	兵	NOUN	n,名詞,人,役割	_	7	nsubj	_	Gloss=soldier|SpaceAfter=No
5	不	不	ADV	v,副詞,否定,無界	Polarity=Neg	6	advmod	_	Gloss=not|SpaceAfter=No
6	得	得	AUX	v,助動詞,可能,*	Mood=Pot	7	aux	_	Gloss=must|SpaceAfter=No
7	過	過	VERB	v,動詞,行為,移動	_	1	ccomp	_	Gloss=pass|SpaceAfter=No
8	關	關	NOUN	n,名詞,固定物,建造物	Case=Loc	7	obj	_	Gloss=bar|SpaceAfter=No

those_and_these_soldiers

@rueter
Copy link
Contributor

rueter commented Oct 24, 2024

@nschneid, hi! the UD_Finnish-FTB has an interesting construction

# sent_id = j7hnk-6227
«Viron presidentti Lennart Meri on yksi niitä [ilmeisen harvoja valtiomiehiä], jotka laativat puheensa itse.»
«The President of Estonia, Lennart Meri, is one of the [apparently few statesmen] who write their own speeches.»

8       ilmeisen        ilmeinen        ADJ     A,Sg,Gen        Case=Gen|Number=Sing    9       amod    _       _
9       harvoja harva   DET     Pron,Qnt,Pl,Par Case=Par|Number=Plur|PronType=Ind       10      det     _       _
10      valtiomiehiä    valtiomies      NOUN    N,Pl,Par        Case=Par|Number=Plur    0       root    _       _

ilmeisen harvoja valtiomiehiä
The genitive-case adjective ‹apparent› modifies the determiner ‹few›.
This same construction with a genitive-case adjective is observed with expressions of color, e.g.

sininen ‹blue›
vaalean + sininen ‹light + blue›
tumman + punainen ‹dark + red›
NB! in Finnish these are written as one word, i.e., vaaleansininen, tummanpunainen, except, perhaps, when saying ‹especially dark red› erittäin tumman punainen
amod(red, dark)
advmod(dark, especially)
The Finnish grammar might make reference to an instructive case in -n, but this instance ilmeisen harvoja valtiomiehiä does not seem to fall into that category: https://kaino.kotus.fi/visk/sisallys.php?p=389.
What do you think @fginter, @flammie, @jpiitula?

rueter added a commit to UniversalDependencies/UD_Finnish-FTB that referenced this issue Oct 24, 2024
@jpiitula
sent_id = j7hnk-6227
is problematic.
See UniversalDependencies/docs#1059
@johnnymoretti
Copy link

The rule in validator script is something like that :

if re.match(r"^(det|clf)$", pdeprel) and not re.match(r"^(advmod|obl|goeswith|fixed|reparandum|conj|cc|punct)$",cdeprel) :

if I understand correctly we are allowed to use only obl and not obl:cmp , right ? If it is so, why ? The main dependency relation shouldn't cover also its subtypes ?

@KoichiYasuoka
Copy link
Contributor

Thank you @johnnymoretti but I think that det and clf cannot be treated in the same way. In Thai clf can be modified by ADJ or PRON (whose). On the other hand det can be linked by flat or conj...

@johnnymoretti
Copy link

@KoichiYasuoka For sure, I'm not going into detail about the language, I've just reported what the rule says. At the moment det and clf are in the same rule.

@Stormur
Copy link
Contributor

Stormur commented Oct 24, 2024

In Classical Chinese 彼此兵 (those and these soldiers) is invalidated by this new rule. How do we solve it?

Why not conj here?

@lauma
Copy link
Contributor

lauma commented Oct 24, 2024

In Latvian we have occasional subordinate clause problem as well - tās somas, ko atrada vakar 'those bags which were found yesterday', because in this situation we might as well talk about various kinds of bags, some where found yesterday, and some not. We struggle applying concept of determiners for Latvian in general, but this seems to be a determiner situation, right?

@Stormur
Copy link
Contributor

Stormur commented Oct 24, 2024

https://universaldependencies.org/en/dep/nmod.html I note that the English provides ‹my office› nmod:poss(office, my) which is the same coding as in EWT.

So it looks like there might be a Swedish–English consensus for nmod:poss use with possessive pronouns, and genitive personal pronouns.

But eng. my is not a pronoun... actually, I do not understand how my office can use nmod in English in the current standard.


The case you report

sent_id = EKS:2011:39:15:ČesnokovF
Конат-конат сонзэ (Степан Иваныч) ладсо сырелгадсть...
Konat-konat    sonze    (Stepan Ivanych)  ladso    syrelgadstʹ...
such-such.Pl  his/her  (St. I.)                    in.way   become.older.3Pl

looks very similar to the latin one I discussed: you have one element referring to the Person which is in сонзэ and which cannot go with ладсо. A kind of "double referent" phrase.


But we might be up to something regarding elements adding Persons. The case you report lets me wonder if indeed any element like this warrants nmod even when they look like other DETs.

@jasiewert
Copy link
Contributor

jasiewert commented Oct 25, 2024

I know that the page describes Russian. Does that make the description less valid?

@jnivre
Copy link
Contributor

jnivre commented Oct 25, 2024

Thanks! But that is taken from the language-specific guidelines for Russian (note the "ru" in the URL), not from the universal guidelines. And further down the page there are examples of nmod cases that do show agreement, so it idoesn't seem to be intended as a real criterion. But the fact that people writing language-specific guidelines have drawn this conclusion clearly shows that the universal guidelines are in need of clarification. :)

@jnivre
Copy link
Contributor

jnivre commented Oct 25, 2024

I know that the page describes Russian. Does that make the description less valid?

Well, it means at least that it only applies to that language. And if it is inconsistent with the universal guidelines, I would say it's problematic. However, as I already said, the basic problem seems to be that the universal guidelines are not clear enough to begin with.

@jasiewert
Copy link
Contributor

Talking about the "universal" guidelines, they are indeed not of much help here in their current state since they only mention that in some languages the nmod relation is out of question for possessive determiners, but do not give any guidelines on how to distinguish det from nmod: "This is not yet completely parallel across languages; in some languages, it is much more clear than in English how possessive determiners relate to adjectives, and the nmod relation is out of question." https://universaldependencies.org/u/dep/det.html
In any case, this guideline does show that in some languages, possessive determiners might have to be annotated as something else than nmod. However, from this conversation, I get the impression that we are now being pushed towards annotating any possessive determiners as nmod no matter how they behave morpho-syntactically in the various languages. Is this a misunderstanding?

@nschneid nschneid changed the title New validator rule: leaf-det-clf New validator rule: leaf-det-clf (and det vs. nmod) Oct 25, 2024
@jnivre
Copy link
Contributor

jnivre commented Oct 25, 2024

I personally think that this might be preferable for cross-linguistic consistency, but as you correctly point out the guidelines do allow both options and there has been no amendment to the guidelines on this point. In particular, the new validator rule that is the original focus of this issue was not (as far as I know) introduced with this in mind. So the main point I take from this discussion is that we need to improve our guidelines for the nmod relation. Whether this will lead to an amendment or only a clarification is too early to say at this point.

@rueter
Copy link
Contributor

rueter commented Oct 25, 2024

@jnivre, @sylvainkahane and @jasiewert, I am suggesting that the Swedish "ditt hus" 'your house' could be annotated as det(hus, ditt), but "hans hus" 'his house' nmod:poss(hus, hans).

This describes a distinction between structures, such as parallel between 3Sg genitive form in and nouns in the genitive, on the one hand,
nmod:poss(дом, его) 'his home', nmod:poss(kotini, minun) 'my home' fin.

and "ditt hus" possessive determiners that agree with their head words, i.e., "ta maison", "твой дом", on the other.
det:poss(maison, ta), det:poss(дом, твой), det:poss(Haus, sein)

This, of course, does not answer the English dilemma with my, thy, your, our, their, her.
I am leaving the words "his" and "its" out.

@amir-zeldes
Copy link
Contributor

This, of course, does not answer the English dilemma with my, thy, your, our, their, her. I am leaving the words "his" and "its" out.

Actually "their" and "her" are also historically genitives, like "its" and "his", though some of the pronoun forms are Anglo-Saxon and others are borrowed from Scandinavian (e.g. their, which comes from the dem. stem, not the proper personal stem, Old English "hiera"). But the fact that it's basically impossible to tell which is which now shows that this probably doesn't play a role in how we should analyze the syntax for English - synchronically, none of the forms show any kind of agreement.

@jnivre
Copy link
Contributor

jnivre commented Oct 25, 2024

I agree with @amir-zeldes. Although we must always use language-specific criteria when interpreting the guidelines for a specific languages, I don't think the presence of agreement should be the basis of the distinction between det and nmod.

@lrituma
Copy link

lrituma commented Oct 25, 2024

For spoken data, we need three relations to be added to the validator:

  • discourse, which is very common between two determiners in false starts: "a, uh, a gap", "my, uh, our friend"
  • parataxis for cases such as "a, I don't how to call that, a kiosk, …": here we have a reparandum link between the two "a"s and we would like to attach the parenthesis to the first "a". More exactly we use parataxis:parenth in our spoken French treebanks.
  • dep for false starts such as "the last, the last day": here "the last" forms a phrase the head of which is missing and we decided to have dep(the, last). I am not against another solution, as long as "the last" is still a phrase.

I still don't see an answer in the thread for such cases where an unambiguous determiner is explained by another phrase. How to annotate these sentences to pass the new validation rule?

Some examples from Latvian:
_.. tādā godīgā iestādē ieperinājušies daži (tikai daži!) zagļi .. _ - "a few (only a few!) thieves have nested in such an honest institution"
ar šādām, reizēm ļoti kategoriskām, pozīcijām - "with such, sometimes very categorical, positions" (šādām is a determiner and it has an agreement with the noun after the insertion)

@jasiewert
Copy link
Contributor

I am surprised that semantic criteria should override morpho-syntactic criteria, given that the dependency relations are called "syntactic relations" in the documentation, not "semantic relations".

Isn't agreement a rather decisive criterion when distinguishing, e.g., amod from nmod, at least in languages that exhibit adjective agreement? (Btw., precise criteria to distinguish amod from both det and nmod would probably be valuable as well.)
Treating various types of possessive constructions as nmod would certainly make things more uniform across corpora, but to me this sounds a bit like treating uniformity as an end in itself at the expense of representing linguistic distinctions.

@amir-zeldes
Copy link
Contributor

I agree with @amir-zeldes. Although we must always use language-specific criteria when interpreting the guidelines for a specific languages, I don't think the presence of agreement should be the basis of the distinction between det and nmod.

Um, thanks but I don't think I said the last part 😅 I do think agreement is an important syntactic phenomenon to consider, and my understanding was that the individual language guidelines do differ in how they treat possessives. Whether that's a good idea or not is debatable.

Some aspects of UD in practice ignore morpho-syntactic facts such as agreement, for example in treating copulas as auxiliaries, and I do see the logic of that, especially for languages with split copula systems (so the subject of a Russian nominal sentence depends on the lexical predicate, whether or not a copula is present). The same argument could be made for split possessive systems, like English was historically and the Czech one synchronically as well, but we could also make the opposite argument, as I think @jasiewert is doing. My only point was about English, where I think going for a split system (interlocutive my, your, our as det but delocutive her, his, its as nmod) is particularly uncompelling, because synchronically there is no evidence one way or the other. Between the two, I prefer nmod for English, because it means that all personal pronouns can have content-y deprels, rather than the function-y det, and it unifies pronominal and nominal possession (genitive 's) in a way that seems systematic and satisfying. In other languages, things could play out very differently, and I don't believe in giving English special importance in the discussion of universal guidelines.

@jnivre
Copy link
Contributor

jnivre commented Oct 25, 2024

@amir-zeldes Sorry about misrepresenting your position. I still don't think agreement is decisive, but that requires a longer argument. And I completely agree that English should not be given priority.

@jnivre
Copy link
Contributor

jnivre commented Oct 25, 2024

@jasiewert At the universal level, we always have to rely on functional criteria, because they are the only ones that are universally applicable. But please not that functional is not purely semantic, it is semantic + information packaging. When we develop guidelines for a specific language, we therefore have to start by using the functional criteria to identify the prototypical cases (such as primary transitives for core arguments), observe what morphosyntactic criteria are characteristic of those, and then extend them to other cases. I think agreeing possessives is a typical example where two such characteristics clash in many languages, the referentiality of nominal modifiers and the agreement patterns of adjectival modifiers. Which one should be given preference may ultimately depend on other factors, which is why the current guidelines allow both options. I may be biased towards referentiality myself, but I fully respect that the facts are different in different languages. But, for the record, my original comment was with respect to Chinese, where agreement clearly is not a criterion. I hope this at least clarifies my position.

@jnivre
Copy link
Contributor

jnivre commented Oct 25, 2024

@

For spoken data, we need three relations to be added to the validator:

  • discourse, which is very common between two determiners in false starts: "a, uh, a gap", "my, uh, our friend"
  • parataxis for cases such as "a, I don't how to call that, a kiosk, …": here we have a reparandum link between the two "a"s and we would like to attach the parenthesis to the first "a". More exactly we use parataxis:parenth in our spoken French treebanks.
  • dep for false starts such as "the last, the last day": here "the last" forms a phrase the head of which is missing and we decided to have dep(the, last). I am not against another solution, as long as "the last" is still a phrase.

I still don't see an answer in the thread for such cases where an unambiguous determiner is explained by another phrase. How to annotate these sentences to pass the new validation rule?

Some examples from Latvian: _.. tādā godīgā iestādē ieperinājušies daži (tikai daži!) zagļi .. _ - "a few (only a few!) thieves have nested in such an honest institution" ar šādām, reizēm ļoti kategoriskām, pozīcijām - "with such, sometimes very categorical, positions" (šādām is a determiner and it has an agreement with the noun after the insertion)

I may be mistaken, but in the second example, it looks like the second modifier could be an adjectival modifier that attaches to the nominal head. Is there anything that rules out such an analysis.

@KoichiYasuoka
Copy link
Contributor

At the moment det and clf are in the same rule.

Well... I think we need different criteria to validate det and clf, espeicially when they are used together. But can we make the validator for clf really "universal"?

# text = หนังสือเล่มนี้ของเธอ
# text_zh = 她的這本書
1	หนังสือ	_	NOUN	NN	_	0	root	_	SpaceAfter=No|Translit=書
2	เล่ม	_	NOUN	CL	_	1	clf	_	SpaceAfter=No|Translit=本
3	นี้	_	DET	DT	_	2	det	_	SpaceAfter=No|Translit=這
4	ของ	_	ADP	IN	_	5	case	_	Gloss=of|SpaceAfter=No
5	เธอ	_	PRON	PRN	_	1	nmod:poss	_	SpaceAfter=No|Translit=她

clf-det

@leky40
Copy link

leky40 commented Oct 26, 2024

At the moment det and clf are in the same rule.

Well... I think we need different criteria to validate det and clf, espeicially when they are used together. But can we make the validator for clf really "universal"?

# text = หนังสือเล่มนี้ของเธอ
# text_zh = 她的這本書
1	หนังสือ	_	NOUN	NN	_	0	root	_	SpaceAfter=No|Translit=書
2	เล่ม	_	NOUN	CL	_	1	clf	_	SpaceAfter=No|Translit=本
3	นี้	_	DET	DT	_	2	det	_	SpaceAfter=No|Translit=這
4	ของ	_	ADP	IN	_	5	case	_	Gloss=of|SpaceAfter=No
5	เธอ	_	PRON	PRN	_	1	nmod:poss	_	SpaceAfter=No|Translit=她

clf-det

This might not be related to the main topic that has been discussed here, but I was wondering if the prepositional phrase ของเธอ (of her) could modify the noun เล่ม (used as a classifier in this structure), instead the head noun หนังสือ /nǎŋ-sɯ̌ɯ/ (book). In Thai, the head noun modified by the classifier phrase can be omitted when the noun is previously mentioned and known. Then the noun used as a classifier turns to be the head noun of the NP, as in these trees:

From the full phrase:

# text = หนังสือเล่มนี้ของเธอ
# text_en = this book of hers
# gloss = /nǎŋ-sɯ̌ɯ/ /lêm/ /níi/ /khɔ̌ɔŋ/ /thɤɤ/
1	หนังสือ	_	NOUN	_	_	0	root	_	_
2	เล่ม	_	NOUN	_	_	1	clf	_	_
3	นี้	_	DET	_	_	2	det	_	_
4	ของ	_	ADP	_	_	5	case	_	_
5	เธอ	_	PRON	_	_	2	nmod:poss	_	_

Classifiers in Thai txt(1)

it becomes this:

# text = เล่มนี้ของเธอ
# text_en = this book of hers (if the book is previously mentioned and known)
# gloss = /lêm/ /níi/ /khɔ̌ɔŋ/ /thɤɤ/
1	เล่ม	_	NOUN	_	_	0	root	_	_
2	นี้	_	DET	_	_	1	det	_	_
3	ของ	_	ADP	_	_	4	case	_	_
4	เธอ	_	PRON	_	_	1	nmod:poss	_	_

Classifiers in Thai txt

So clf is no longer used in this structure, but it's still known that it is used as a classifier for the book previously mentioned.

Apart from this reason, the noun เล่ม /lêm/ is the head noun of the modifier phrase modifying the head noun หนังสือ /nǎŋ-sɯ̌ɯ/ (book). A noun used as a classifier in Thai must not be used alone with a head noun to be modified. It must be modified to be able to modify the head noun. The structure (noun classifier) is ungrammatical in Thai.

I am not sure if there is difference. Thai uses postmodifiers, except numbers and quantifiers expressing quantities are placed before a noun used as a classifier.

And I guess my annotation with clf might not be validated.

P.S. I just realised that I English-grossed the word เธอ /thɤɤ/ (PRON) incorrectly. It should be "she", not "her". In Thai, we have only one word to express personal and possessive pronouns.

@KoichiYasuoka
Copy link
Contributor

Thank you @leky40 but the latter example เล่มนี้ของเธอ

So clf is no longer used in this structure, but it's still known that it is used as a classifier for the book previously mentioned.

is slightly far from this issue "leaf-det-clf" ... ah well, OK, we'll try to investigate หนังสือสองเล่มนี้ของเธอ now. How do you think about this? I think we can omit หนังสือ from หนังสือสองเล่มนี้ของเธอ, then what structure is suitable for สองเล่มนี้ของเธอ?

# text = หนังสือสองเล่มนี้ของเธอ
1	หนังสือ	_	NOUN	_	_	0	root	_	Gloss=book|SpaceAfter=No
2	สอง	_	NUM	_	_	1	nummod	_	Gloss=two|SpaceAfter=No
3	เล่ม	_	NOUN	_	_	2	clf	_	Gloss=[classifier]|SpaceAfter=No
4	นี้	_	DET	_	_	3	det	_	Gloss=this|SpaceAfter=No
5	ของ	_	ADP	_	_	6	case	_	Gloss=of|SpaceAfter=No
6	เธอ	_	PRON	_	_	1	nmod:poss	_	Gloss=she|SpaceAfter=No

clf-det

@leky40
Copy link

leky40 commented Oct 26, 2024

Thank you @leky40 but the latter example เล่มนี้ของเธอ

So clf is no longer used in this structure, but it's still known that it is used as a classifier for the book previously mentioned.

is slightly far from this issue "leaf-det-clf" ... ah well, OK, we'll try to investigate หนังสือสองเล่มนี้ของเธอ now. How do you think about this? I think we can omit หนังสือ from หนังสือสองเล่มนี้ของเธอ, then what structure is suitable for สองเล่มนี้ของเธอ?

# text = หนังสือสองเล่มนี้ของเธอ
1	หนังสือ	_	NOUN	_	_	0	root	_	Gloss=book|SpaceAfter=No
2	สอง	_	NUM	_	_	1	nummod	_	Gloss=two|SpaceAfter=No
3	เล่ม	_	NOUN	_	_	2	clf	_	Gloss=[classifier]|SpaceAfter=No
4	นี้	_	DET	_	_	3	det	_	Gloss=this|SpaceAfter=No
5	ของ	_	ADP	_	_	6	case	_	Gloss=of|SpaceAfter=No
6	เธอ	_	PRON	_	_	1	nmod:poss	_	Gloss=she|SpaceAfter=No

clf-det

My analysis is different from the tree above.

A number is quite tricky. In Thai, when a number is placed before a noun to be modified, it expresses quantities. When it is placed after a noun, it expresses sequences (order).

From the structure you show above, the number "two" expresses quantities and it modifies the noun เล่ม /lêm/, which is used as a classifier for the head noun หนังสือ /nǎŋ-sɯ̌ɯ/ (book) in this entire noun phrase.

I was asked how I knew that the number two modified the noun เล่ม /lêm/, which is used as a classifier. I would say its position, and when a question is made. It would be "กี่เล่ม (how many + the noun เล่ม /lêm/)", not "กี่หนังสือ (how many + book หนังสือ /nǎŋ-sɯ̌ɯ/)".

A noun used as a classifier is the head noun of the modifier phrase. And if the noun /lêm/, which is used as a classifier for the book, is omitted from the structure presented above, it is ungrammatical. That's why the number cannot modify the head noun หนังสือ /nǎŋ-sɯ̌ɯ/ (book).

So the trees I annotated would be:

# text = หนังสือสองเล่มนี้ของเธอ
# text_en = these two books of hers
# gloss = /nǎŋ-sɯ̌ɯ/ /sɔ̌ɔŋ/ /lêm/ /níi/ /khɔ̌ɔŋ/ /thɤɤ/
1	หนังสือ	_	NOUN	_	_	0	root	_	_
2	สอง	_	NUM	_	_	3	nummod	_	_
3	เล่ม	_	NOUN	_	_	1	clf	_	_
4	นี้	_	DET	_	_	3	det	_	_
5	ของ	_	ADP	_	_	6	case	_	_
6	เธอ	_	PRON	_	_	3	nmod:poss	_	_

Classifiers in Thai copy txt

# text = สองเล่มนี้ของเธอ
# text_en = these two books of hers (when the books are previously mentioned and known)
# gloss = /sɔ̌ɔŋ/ /lêm/ /níi/ /khɔ̌ɔŋ/ /thɤɤ/
1	สอง	_	NUM	_	_	2	nummod	_	_
2	เล่ม	_	NOUN	_	_	0	root	_	_
3	นี้	_	DET	_	_	2	det	_	_
4	ของ	_	ADP	_	_	5	case	_	_
5	เธอ	_	PRON	_	_	2	nmod:poss	_	_

Classifiers in Thai txt(1) copy

I know that these annotations of mine might not be validated. But this is how a Thai classifier works with other modifiers. Apart from showing specificness / emphasis to a head noun, a noun used as a classifier is also used to replace the head noun which it modified. It is a noun, functioning as a classifier.

@lauma
Copy link
Contributor

lauma commented Oct 26, 2024

I still don't see an answer in the thread for such cases where an unambiguous determiner is explained by another phrase. How to annotate these sentences to pass the new validation rule?
Some examples from Latvian: _.. tādā godīgā iestādē ieperinājušies daži (tikai daži!) zagļi .. _ - "a few (only a few!) thieves have nested in such an honest institution" ar šādām, reizēm ļoti kategoriskām, pozīcijām - "with such, sometimes very categorical, positions" (šādām is a determiner and it has an agreement with the noun after the insertion)

I may be mistaken, but in the second example, it looks like the second modifier could be an adjectival modifier that attaches to the nominal head. Is there anything that rules out such an analysis.

Šāds is not an adjective in Latvian, as it does not form comparative degrees and it does not have definite/indefinite endings. It is a pronoun that tends to take place of adjective in the sentence. By us it is totaly fine to annotate such pronouns as adjectival modifiers, in fact, we would love to. But if we do so, then there is nothing left to be annotated by det role. Is it okay to forgo det role at all just because we don't have nothing like actual articles in language?

(Sorry if the question is kinda dumb, we honestly struggle with understanding and applying determiner concept correctly :) )

@jnivre
Copy link
Contributor

jnivre commented Oct 26, 2024

I still don't see an answer in the thread for such cases where an unambiguous determiner is explained by another phrase. How to annotate these sentences to pass the new validation rule?
Some examples from Latvian: _.. tādā godīgā iestādē ieperinājušies daži (tikai daži!) zagļi .. _ - "a few (only a few!) thieves have nested in such an honest institution" ar šādām, reizēm ļoti kategoriskām, pozīcijām - "with such, sometimes very categorical, positions" (šādām is a determiner and it has an agreement with the noun after the insertion)

I may be mistaken, but in the second example, it looks like the second modifier could be an adjectival modifier that attaches to the nominal head. Is there anything that rules out such an analysis.

Šāds is not an adjective in Latvian, as it does not form comparative degrees and it does not have definite/indefinite endings. It is a pronoun that tends to take place of adjective in the sentence. By us it is totaly fine to annotate such pronouns as adjectival modifiers, in fact, we would love to. But if we do so, then there is nothing left to be annotated by det role. Is it okay to forgo det role at all just because we don't have nothing like actual articles in language?

(Sorry if the question is kinda dumb, we honestly struggle with understanding and applying determiner concept correctly :) )

I meant that "kategoriskam" was an amod, but I may have misunderstood what the problem is. Which word is it that you want to have as a dependent of "sadam"? (Please excuse the lack of diacritics.)

@lauma
Copy link
Contributor

lauma commented Oct 27, 2024

I still don't see an answer in the thread for such cases where an unambiguous determiner is explained by another phrase. How to annotate these sentences to pass the new validation rule?
Some examples from Latvian: _.. tādā godīgā iestādē ieperinājušies daži (tikai daži!) zagļi .. _ - "a few (only a few!) thieves have nested in such an honest institution" ar šādām, reizēm ļoti kategoriskām, pozīcijām - "with such, sometimes very categorical, positions" (šādām is a determiner and it has an agreement with the noun after the insertion)

I may be mistaken, but in the second example, it looks like the second modifier could be an adjectival modifier that attaches to the nominal head. Is there anything that rules out such an analysis.

Šāds is not an adjective in Latvian, as it does not form comparative degrees and it does not have definite/indefinite endings. It is a pronoun that tends to take place of adjective in the sentence. By us it is totaly fine to annotate such pronouns as adjectival modifiers, in fact, we would love to. But if we do so, then there is nothing left to be annotated by det role. Is it okay to forgo det role at all just because we don't have nothing like actual articles in language?
(Sorry if the question is kinda dumb, we honestly struggle with understanding and applying determiner concept correctly :) )

I meant that "kategoriskam" was an amod, but I may have misunderstood what the problem is. Which word is it that you want to have as a dependent of "sadam"? (Please excuse the lack of diacritics.)

Our current analysis is:

  • root of the fragment: pozīcijām ('positions', noun in plural dative).
  • det: šādām (inflecting pronoun, meaning something like 'such as' or 'like these') attached to pozīcijām, as it agrees in number and case with pozīcijām.
  • ????: kategoriskām ('categorical', adjective in plural dative) attached to šādām, because it is located between commas after šādām, agrees with it and further specifies what this šādām entails. (It is also reasonably common in Latvian to attach a whole acl in these situations, and this is considered to be notably different from a noun just having two separate attributes coordinated or noncoordinated attributes like 'little green house'.)
  • 'advmod: ļoti ('very' adverb, not inflecting) attached to kategoriskām.

Thus, I assumed that you suggested that we should mark šādām as nmod, not det and got confused. Would it be appropriate to assume that kategoriskām is obl, if the šādām stays det?

@jnivre
Copy link
Contributor

jnivre commented Oct 27, 2024

@lauma Thanks for clarifying. I did not suggest that you should mark anything as nmod here. That discussion was about another example.

From your description of the example, it definitely sounds like "kategoriskām" should be attached as an amod to the head noun (that is, as a sister to the det node). That is normally the right analysis for a DET-ADJ-NOUN sequence, and I don't think the punctuation changes this fact.

Similarly, I think you should consider attaching the two determiners as sisters in the first example. The principle of prioritising content words in UD has as a consequence that you often get a flat analysis where other frameworks would have a hierarchical analysis. This is true, for example, of multiple auxiliaries attaching to the same word, and it is the prescribed analysis for multiple determiners as well. I am aware that the first example is a bit special, but I think this could still be the least bad analysis (given the general principles of UD).

@sylvainkahane
Copy link
Contributor

I understand @jnivre's remark about the functionalist approach of UD and I see some advantages to have a common annotation of possessives. Note that we already have a feature Poss=Yes for that.
There is also some advantages to indicate the differences between the languages. For instance, English's treebanks use nmod:poss both possessive pronouns and for Saxon genitive "NP's", but not for genitive "of NP". I suppose it is way to indicate that "NP's" occupy the same syntactic position as possessive pronouns (which is also the position of determiners). In other languages, possessives and genitive NPs occupy very different syntactic positions and it could be strange to annotate them similarly. or at least if we do that we lost something concerning the grammar of the language.
Moreover, as noted by @jasiewert, in some languages, possessives agree with their nominal governor (like adjectives and unlike genitive NPs) and the nominalness of possessives can be discussed. It could also be nice to be able to indicate whether possessives occupy or not in the same position as determiners, when there is such a position.

@nschneid
Copy link
Contributor Author

As Croft and others have shown, any given language-particular construction (formalized via a feature, UPOS, or deprel) can be named based on its prototypical function, but its actual application will tend to extend beyond that prototype, so line-drawing becomes tricky in many cases (do we prioritize the general function viewed crosslingually or the language-internal tests?). E.g. there are English-specific morphosyntactic arguments that the English Determiner Relation/Slot should encompass both core determiners and prenominal possessives, but for purposes of meaning and crosslinguistic comparison, possessive dependents are quite different from core determiners.

I do think UD's prioritization of content relations is a principle that resolves this, but as @sylvainkahane notes it leaves something out in terms of grouping together language-internal constructions defined by morphosyntactic distribution. Maybe we should move toward annotating the additional categorizations in a separate layer (e.g. the UCxn approach) to expose the broader category of determiner relations.

Whether of-PPs ought to be grouped together with Saxon genitives under any approach to syntax seems dubious to me, though they certainly have semantic parallels.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests