Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Toys R Us = Toy Be We? #1058

Open
amir-zeldes opened this issue Oct 7, 2024 · 17 comments
Open

Toys R Us = Toy Be We? #1058

amir-zeldes opened this issue Oct 7, 2024 · 17 comments

Comments

@amir-zeldes
Copy link
Contributor

I'm running into an issue lemmatizing "Toys R Us" in English. Here are the possibly conflicting guidelines:

  • Toys R Us is a compositionally interpretable name, so based on the guidelines, it should be given a normal morphosyntactic analysis - it is a copula clause, with R/cop
  • Based on English xpos guidelines it is tagged NNP, therefore the lemma should be capitalized, so presumably "Be"
  • The validator limits the forms of copulas and only accepts the lemma "be" as an auxiliary/copula

What is the right thing to do here?

  1. Add an alternative English copula lemma spelling "Be" - in a way, if we are serious about the capital lemma guideline, this will be necessary for all sorts of names containing "Be" which have a transparent syntactic analysis.
  2. Use the lowercase lemma "be" whenever we have a transparent copula, even if it is part of a capitalized name
  3. Not treat "Toys R Us" (or any name containing a capitalized copula) as transparent, and go with flat - keep in mind that this would also affect very transparent cases, like the novel "I Am a Cat" by Natsume Souseki.

Thoughts?

@dan-zeman
Copy link
Member

The problem seems to be that you want to have and not to have a transparent analysis at the same time. I think that one must select one of the following approaches and stick to it:

  • Transparent. The words are tagged NOUN AUX PRON. Their lemmas are lowercased except for the personal pronoun "I": toy, be, I (or we I don't remember how you lemmatize personal pronouns in English). The relations are nsubj(Us, Toys); cop(Us, R).
  • Not transparent. Tags are PROPN, lemmas are capitalized and perhaps equal to forms? Relations are probably flat.

I think my favorite would be the transparent option, but definitely with lowercase "be" as the lemma of the copula. But I could accept the non-transparent approach, provided it is not mixed with the transparent one.

@amir-zeldes
Copy link
Contributor Author

The problem is that this has already been discussed extensively for English, and the final decision was what I wrote above:

  • If at all possible, syntax inside names is analyzed transparently
  • Nouns that are names are tagged PROPN, even if they are identical to common nouns (so the "State Department" is PROPN PROPN, even though those are nouns)
  • Uppercase name components receive uppercase lemmas (so "State" - this is consistent with "America" as a clear name lemma, and I guess it makes sense since there will be borderline cases for which we cannot be certain if a capitalized noun is still "normal")
  • Verbs in names get lemmatized as usual to the dictionary form, but remain capitalized to indicate they are part of a name (but they are NOT tagged PROPN in UPOS - they are VERB/AUX etc.)

Again, this is not my preference or a proposal, this is what we settled on after the extensive discussion. So my question is only, given this framework, what's the right thing to do here? I think NOUN AUX PRON is not allowed because NOUN is ruled out by the above. But AUX PRON is still possible under the 'function words' exception, same as PTB xpos. However, lemma is meant to be "Be" based on those guidelines, so we need either a clear exception why it should be "be", or a clear exception why this shouldn't be cop, or an alternative lemma "Be" for the validator (not sure if there are other options I'm missing?)

@dan-zeman
Copy link
Member

So my question is only, given this framework, what's the right thing to do here?

OK, then I'll leave it for the other maintainers of English to weigh in. Because I think this framework is wrong and therefore none of the things is right to do :-)

@jnivre
Copy link
Contributor

jnivre commented Oct 7, 2024 via email

@AngledLuffa
Copy link

It does sound like a lot of the people instrumental in UD just said they don't like this particular scheme

Also I wanted to say hello to myself in the future when someone posts on Stanza's github, why is "R" being lemmatized to "Be"

@amir-zeldes
Copy link
Contributor Author

Well, I think the discussion was spread over a bunch of issues in different repos, but this is a good starting point:

#777

And see some issues here and cross-references:

UniversalDependencies/UD_English-PUD#3
UniversalDependencies/UD_English-EWT#91

I also notice some posts about this from @dan-zeman (and one from @jnivre ), so I don't think this policy in English should be too surprising. I think the transparent syntax part is what @dan-zeman wanted, whereas the PROPN/lemma part goes more towards parity with the LDC corpora notion of "namedness", i.e. the one used in the context of NER.

@nschneid
Copy link
Contributor

nschneid commented Oct 7, 2024

Based on notes in UniversalDependencies/UD_English-EWT#131 (comment) I don't think we're 100% settled on lemma capitalization rules. For truly closed-class UPOS tags like AUX and PART we probably want to require lowercasing.

("Be" or "R" is a particularly thorny case because of multiple divergences between PTB and UD: the PTB rule is that all non-modal auxiliaries are verbs, and all verbs are content words, and all content words in a proper name are tagged NNP. We do not want to mess with PTB policies in XPOS. But the lemma capitalization policy in UD can take the UPOS into account.)

Also: Technically the CorrectForm should be "ᴙ", right? :D

@dan-zeman
Copy link
Member

the transparent syntax part is what @dan-zeman wanted, whereas the PROPN/lemma part goes more towards parity with the LDC corpora

Yep, without trying to verify what exactly I wrote in those threads, I believe this is accurate. I think I've been also consistently opposed to the LDC-related part (I hear the arguments speaking for it, I'm just not willing to give them priority).

@jnivre
Copy link
Contributor

jnivre commented Oct 8, 2024 via email

@nschneid
Copy link
Contributor

nschneid commented Oct 8, 2024

TBC, the original question in this issue was about a lemmatization issue that I think can be resolved narrowly, but the general question of the definition of PROPN has come up.

@jnivre and @dan-zeman's perspective is actually reflected in the universal PROPN docs page, which specifies "Cat/NOUN on a Hot Tin Roof". In principle, in the universal guidelines, it seems fine to say that some nouns are inherently proper and thus should be labeled PROPN, while others are common nouns that happen to be leveraged in a proper name, and should remain NOUN.

The problem is that English tagsets/corpora have no tradition of making this distinction. This is both a theoretical problem in that we would need guidelines for the borderline cases (e.g. a single-word named entity derived from a common noun, like "Creed"), and a practical problem of implementation (30K NNP|NNPS tokens in GUM+EWT alone, and the presence of an article is an insufficient test: e.g. "Georgetown University/NOUN", "a Toyota/PROPN"). If somebody wanted to tackle this for English, I think it would entail developing detailed guidelines and a lexicon, and ensuring the presence of entity type annotations for disambiguation ("Cat" the name vs. the animal) (only GUM has these entity types at present).

If they are annotated as regular phrases, then they should not only have ordinary syntactic relations (as opposed to “flat”) but also ordinary (universal) postags, features and lemmas.

This cannot be strictly true (that a PROPN never has dependents other than flat) because there are plenty of phrasal names that contain nested proper names, e.g. "Anne/PROPN of Green Gables".

@jnivre
Copy link
Contributor

jnivre commented Oct 8, 2024

I did not mean to imply anything about what relations PROPN words can have. Of course many proper names are part of larger phrases, even phrases that are names (like the one you quote). All I said was that, in a transparent analysis, all words should have their ordinary postags, features and lemmas. And for "Anne", the ordinary postag is PROPN.

@jnivre
Copy link
Contributor

jnivre commented Oct 8, 2024

It is an interesting question, however, whether a flat analysis implies that all component words should be tagged PROPN. I can imagine cases where some words are juxtaposed to form a name without being a syntactic phrase, and where some of the words are not proper names. I am not sure I can come up with a convincing example, though. :)

@gossebouma
Copy link
Contributor

The Dutch treebanks use flat for analyzing multiword proper names, and normally label all parts as PROPN. So no attempt is made to annotate van (of) in Van Alebeek as an ADP. (same for determiners)
There are interesting exceptions, though. In het Goede Vrijdag-akkoord, (the Good Friday agreement), Vrijdag-akkoord is a flat dependent of Goede, yet it has UPOS=NOUN (as akkoord is a noun). Dutch spelling conventions are quite tricky here, compounds are normally written as a single word, but when the first part of the compound is a multiple word proper name the space is preserved.
Another case is names with punctuation symbols, like Stop Aids Now! . The ! is seen as a separate token with UPOS=SYM, yet forms part of the name and thus has dep label flat.

@amir-zeldes
Copy link
Contributor Author

it didn’t look like any consensus was reached

It may have been in part in meetings, but it was definitely reached - I wouldn't have undertaken the project to consolidate lemma casing in GUM if it hadn't been. I am also not trying to reopen these questions - just to interpret the English guidelines with respect to the conflict above.

I think Nathan's proposal of lowercasing based on upos AUX/PART, should work fine, I would just like that to be normative then.

the tag PROPN is reserved for words that are mainly (or only) used as names, which in English in turn implies not taking articles (except in meta-linguistic uses)

I'm not sure this is so straightforward for English, and I don't want to reopen the English discussion anyway, but if someone is thinking of applying this to other languages as a universal guideline I'd like to point out:

  1. Many languages don't have articles, and they are as diverse as Slavic and Japanese. Coming up with guidelines to explain what it means to mainly be used as a name there seems hard and likely to be inconsistent (I think the State Department is a name, and even if the article makes you say NOUN in English, I don't know how to argue one way or another for Jap. 外務省 "(the?) Foreign Ministry/Gaimusho")
  2. Some languages allow articles on stereotypical proper names in non-metalinguistic contexts (e.g. German "der Hans"), and many nouns habitually appear without them (esp. but not only mass nouns)
  3. In many languages, including English, some contexts neutralize article usage. For example in English compound modifiers, it's impossible to tell if something is article-compatible or not. Is "Wow Air" PROPN PROPN? Or PROPN NOUN because "Air" is a noun (but notice the whole phrase can be used without articles)? Or is it INTJ PROPN, because "wow" is an interjection? And what about names that are bare plurals?

I am also not saying the current situation is trivial in English, but I think cross-linguistically using something like article usage is a murky criterion, and many UD users probably expect PROPN to reflect something semantic like NER (and you can also check definiteness or articles using the FEATS and tree).

I'll go ahead and implement Nathan's solution - I'm leaving this open for a bit just because I don't want to shut discussion down of course.

@jnivre
Copy link
Contributor

jnivre commented Oct 8, 2024

I was definitely not suggesting using article usage as a universal criterion. Every language has to be judged on its own internal criteria, and if a language does not have a grammaticalised distinction between common and proper nouns, it can simply use the NOUN tag for all nouns. In fact, the non-obligatoriness of the NOUN-PROPN distinction is my standard example when explaining that, while you cannot invent language-specific upos tags, you don't have to use all tags in all languages.

@dan-zeman
Copy link
Member

many UD users probably expect PROPN to reflect something semantic like NER (and you can also check definiteness or articles using the FEATS and tree)

PROPN is definitely related to NER but it classifies one word, so it is not the same as NER when it comes to multiword entities. Czech is one of the languages where articles cannot be used as a criterion because they do not exist. We have a category called proper name in the grammar but it is semantic, it is used in rules for capitalization and it is not a part of speech because it can consist of multiple words. In fact, we were trying to convince people that UD should not have the PROPN category when UD v1 was discussed :-); but since the category is part of UD, we don't want to pretend it does not exist in Czech, because the users would expect it. The distinction is tricky and it is further complicated by the fact that our treebanks are conversions from non-UD annotation, so even if we can come up with acceptable annotation guidelines, we may not be able to enforce them in the data we have. I think the guidelines would be roughly as follows:

  • First/middle/last name of a person is PROPN even if it is string-wise identical with a common noun, adjective or another word.
  • Single-word names of cities, mountains, rivers etc. are normally PROPN. In multiword location names, we often see ADJ + NOUN (as in Černé jezero "Black Lake") but it is also common to see a PROPN modified by an ADJ (Mokrá Lhota "Wet Lhota", where the second word is etymologically derived from a common noun but has no such interpretation synchronically).
  • Similar approach can be taken to names of organizations, products, movie/book titles etc. Use non-PROPN categories wherever possible, but if a word exists only as a name / in a name, it will be PROPN. This includes various acronyms even if they are derived from phrases that would not contain PROPN.
  • Foreign names typically end up as PROPN. Even if they contain multiple words and the words are common nouns, adjectives or function words in the source language, they do not have these categories in Czech. Thus Grand Canyon would be PROPN + PROPN, but if someone translated it to Czech as Velký kaňon, it will be ADJ + NOUN.

Of course there will be numerous cases where it is debatable which of the rules above applies. So far it was convenient to rely on the pre-UD tagging and avoid formulating more precise guidelines but with new treebanks being annotated natively in UD, we won't be able to escape it forever.

@amir-zeldes
Copy link
Contributor Author

Every language has to be judged on its own internal criteria

Agreed - and I think for English what we have is pretty reasonable, and in any case as Nathan pointed out, it's not really feasible to revise it too much (huge manual effort, not clear that something different is actually better)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants