Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[pt] Migrate spelling dictionaries to Morfologik format #9424

Merged
merged 52 commits into from
Dec 6, 2023
Merged

Conversation

jaumeortola
Copy link
Member

@jaumeortola jaumeortola commented Oct 3, 2023

Migrate PT dictionary to Morfologik

This PR:

  • removes the old Hunspell data;
  • uses Morfologik-format binary dictionary data hosted on Sonatype;
  • adds a bunch of tests (mostly speller and tokeniser);
  • moves a lot (not all) of the data from hunspell/spelling.txt to spelling.txt;
  • reworks the tokeniser to use the same break characters as the other locales.

The work with curation of the dictionary source data was done in the portuguese-pos-dict repo.

To do before merging

  • compare this branch vs. master with a large corpus;
  • resolve problems (caused by tokenization, and others);
  • release the new dictionaries;
  • remove old Hunspell files and rules;
  • add more tests;
  • add words to resource/pt/do_not_suggest.txt.

@jaumeortola , I've taken all the steps to release it. As far as I can see, it's live. But the tests still fail when trying to read the contents of the live repository. This part of it I have no idea about.

@danielnaber , you can already review the Java, I don't reckon it'll change.

@susanaboatto ... lol yeah there's a lot of stuff. Do have a look at the changes made to the disambiguator and the new files (spelling.txt, do_not_suggest.txt, and dialect_alternations.txt). All the other PT-specific changes are really made directly to the dictionary source data in the PT dictionary repo.


Before merging, please let me squash this into something acceptable.

@jaumeortola jaumeortola self-assigned this Oct 3, 2023
@jaumeortola jaumeortola marked this pull request as draft October 4, 2023 12:32
@jaumeortola jaumeortola force-pushed the pt-dicts branch 2 times, most recently from e869757 to 46d0684 Compare October 7, 2023 06:54
@p-goulart p-goulart changed the title [pt] spelling dicts with Morfologik [pt] Migrate spelling dictionaries to Morfologik format Nov 16, 2023
@p-goulart p-goulart self-assigned this Nov 21, 2023
@danielnaber
Copy link
Member

But the tests still fail when trying to read the contents of the live repository. This part of it I have no idea about.

I restarted the CI process and this time it has worked, it seems.

@p-goulart
Copy link
Collaborator

@SteVio89 this is the PR that @maphjo mentioned yesterday with the partial test deploy for PT dictionaries.

@p-goulart p-goulart marked this pull request as ready for review November 22, 2023 09:14
@p-goulart
Copy link
Collaborator

@SteVio89 the rule ID we had was not being underlined in red – I thought it was enough for it to contain MORFOLOGIK but it had to contain MORFOLOGIK_RULE. So I've pushed new code with the updated rule ID and, once the tests pass, can we deploy that on the test server? Sorry about that.

@p-goulart p-goulart marked this pull request as draft November 23, 2023 13:49
@p-goulart
Copy link
Collaborator

Cannot merge until we work out how to get usage data from partial rollout.

@p-goulart
Copy link
Collaborator

After yesterday's test deployment:

  • I've checked the results on Matomo to make sure we're getting proper data;
  • the apply rate of the new rule hovers around 60%; the Hunspell version had ~55%:
    Screenshot 2023-12-06 at 9 59 06 AM

I think it's safe to say we can merge this! Big success 🥳

Thanks @jaumeortola @maphjo @SteVio89 for the support with this endeavour!

@p-goulart p-goulart marked this pull request as ready for review December 6, 2023 09:01
@p-goulart p-goulart merged commit 612dc21 into master Dec 6, 2023
2 checks passed
@p-goulart p-goulart deleted the pt-dicts branch December 6, 2023 09:16
@danielnaber
Copy link
Member

5 more percentage points is great! Thanks to everyone who has been working on this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants