Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Checking the source dictionary #5

Open
jaumeortola opened this issue Feb 5, 2024 · 3 comments
Open

Checking the source dictionary #5

jaumeortola opened this issue Feb 5, 2024 · 3 comments

Comments

@jaumeortola
Copy link
Member

jaumeortola commented Feb 5, 2024

The first version of the source dictionary is here: https://github.com/languagetool-org/english-pos-dict/tree/main/src-dict

I will be adding some comments and ideas here. We can open new issues for some parts of the work.

  • We can proceed to separate the entries by groups: the ones that don't need review, the ones that need some manual review, and so on. For example:
    • words in all spelling dicts and tagged -> no need to review
    • words in all spelling dicts but not tagged -> maybe they can be tagged easily
    • words in US and GB and tagged -> maybe they can be accepted by all variants?
    • ...
  • Check that variant labels are coherent with en-US-GB.txt (use scripting).
  • Some sets of entries look suspicious: untagged words in GB with some prefixes (mis-, out-, over-, re-, under-) seem nonsense words. The same with some affixes (see: survivorshipably... survivorshipry).
  • Words with the tag us-large come from a Hunspell US dictionary that we didn't use until now. It is mentioned in Explore differences between en-US and en-US-large #2
  • We are using a simplified format for regular verbs: recharge=verb=all. (We use a few rules to cover more cases of regular verbs. See here). It would be useful to have something similar for nouns: a simple and quick way to tag a noun. We would need to define the format, and ways to write exceptions.
  • What sources we consider authoritative to determine if a word is GB or US? And AU, CA, ZA, NZ? Are there dictionaries for those variants?
@jaumeortola
Copy link
Member Author

jaumeortola commented Feb 7, 2024

Separating the dict in src-clean.txt (accepted entries), src-pending.txt and src-discarded.txt. 551de86

Done:

  • tagged words in all spelling dicts -> moved to src-clean
  • tagged words in GB and US -> moved to src-clean, variant modified to "all"
  • untagged words in GB with some prefixes (mis-, out-, over-, re-, under-, in- + uppercase) -> moved to src-discarded

@milekpl
Copy link
Member

milekpl commented Dec 31, 2024

@jaumeortola There are entries in the src dictionary that seem controversial to me:

I'm looking at the english-pos-dict.txt, line 115470:

formulaize=formulaize/VB,formulaize/VBP=none

This comes from src-pending.txt. But this entry will create just two verb forms for this word. This is not a good idea, since if the verb exists, we should tag it, or drop it from the dictionary. In this case, it is likely a typo, no clean corpus and dictionary has it (checked common corpus from pleias: https://huggingface.co/datasets/PleIAs/common_corpus/viewer?sql_console=true&sql=SELECT+*%0AFROM+train%0AWHERE+text+LIKE+%27%25formulaize%25%27%0ALIMIT+10%3B&views%5B%5D=train )

Such lines should not be admissible, as they will create incorrect entries. There are plenty of such verbs there in this file.

I already have some checks for myself (in another project) in a Python unittest to highlight inconsistencies and gaps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants