Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[en] new dicts #10699

Open
wants to merge 4 commits into
base: master
Choose a base branch
from
Open

[en] new dicts #10699

wants to merge 4 commits into from

Conversation

jaumeortola
Copy link
Member

@jaumeortola jaumeortola commented Jul 2, 2024

Summary by CodeRabbit

Release Notes

  • Dictionary Updates

    • Removed various entries from part-of-speech and spelling dictionaries across different English variants, affecting recognized vocabulary.
    • Added new entries for multi-word terms, enhancing term recognition.
    • Introduced new entries in the do-not-synthesize list and updated spelling corrections for specific terms.
  • Spelling and Grammar

    • Improved spell-checking capabilities for different English regional variants by removing outdated terms.
    • Enhanced disambiguation rules and grammar suggestions for better accuracy.
  • Testing

    • Modified various spelling and test-related assertions to reflect changes in expected behavior.
    • Updated test cases for language processing to ensure accuracy.
  • Dependency

    • Updated english-pos-dict dependency to version 0.4.

@jaumeortola jaumeortola requested review from evan-defran-lt and AzadehSafakish and removed request for evan-defran-lt July 2, 2024 11:42
@AzadehSafakish
Copy link
Collaborator

I didn't build the new dictionaries locally, so these APs aren't guaranteed to work (although I think most of them should as they don't rely too much on postag info).
If they don't work, they should at least provide an idea of what needs to be fixed and how to fix it.

  • COMMA_THANKS[4]
<antipattern>
    <token postag="SENT_START|PCT" postag_regexp="yes" />
    <token>no</token>
    <token>thank</token>
    <token>you</token>
    <token>,</token>
    <example>No thank you, I'm full.</example>
</antipattern>
  • POSSESSIVE_APOSTROPHE[1]
<antipattern>
    <token>missing</token>
    <token>persons</token>
    <token chunk_re="[IE]-NP.*" />
    <example>Fadil requested the help of missing persons investigator Rami Hasan.</example>
</antipattern>
  • ADVISE_VBG[3]
<antipattern>   <!-- this isn't an FP, but the suggestions are incorrect in this context and interrupting the parallel structure makes the sentence worse -->
    <token skip="5" postag="VBG">
        <exception scope="next" postag="V.*" postag_regexp="yes" />
    </token>
    <token regexp="yes">and|or</token>
    <token postag="VBG" chunk="B-VP" />
    <example>Meditation helps downshifting more easily after work and sleeping better at night.</example>
</antipattern>
  • MISSING_HYPHEN[5]
<antipattern>
    <token regexp="yes" case_sensitive="yes" postag="CD">[A-Z].*</token>
    <token min="2" regexp="yes" case_sensitive="yes">[A-Z].*</token>
    <example>Mussels contributed to a valve problem in the 1990s at the Nine Mile Nuclear Power plant in Lake Ontario.</example>
    <example>The Eighty Minute Hour (1974) — A weird and ambitious "space opera" whose characters actually sing.</example>
    <example>With the Nine Inch Nails album Year Zero, the concept of the albums songs which "[take] place about 15 years in the future" when "Things are not good." and incorporated sites from the Web.</example>
    <example>Chris, As per clause 7.2 (a) (i) of the LNG sales contract we are providing you EcoElctrica's Ninety Day Schedule.</example>
</antipattern>
  • UH_UH_COMMA[1] (pattern, not antipattern)
<pattern>
    <token postag="UH">
        <exception postag="IN" />
        <exception regexp="yes">ha|yo|why|health|check|hip|meow|break|really|never|contact|blah|there|yum</exception>
    </token>
    <token>
        <match no="0" />
    </token>
</pattern>
<message>Consider adding a comma between these interjections.</message>
<suggestion>\1, \2</suggestion>
<suggestion>\1</suggestion>
<example correction="Oh, oh|Oh"><marker>Oh oh</marker>, he is coming.</example>
<example>Yum yum!</example>

Copy link
Contributor

coderabbitai bot commented Jan 11, 2025

Walkthrough

This pull request involves comprehensive updates to the LanguageTool English language modules. The changes span multiple resource files, including dictionaries, spelling lists, disambiguation rules, and test configurations. The modifications primarily focus on refining vocabulary entries, updating spelling rules, adjusting grammar and style guidelines, and making corresponding adjustments to test cases. The updates affect various English language variants (US, GB, AU, CA) and involve additions, removals, and modifications of linguistic resources.

Changes

File Path Change Summary
languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/added.txt Substantial removal of entries from part-of-speech dictionary
languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/disambiguation.xml Added new disambiguation rule for "instal/install" and updated example input forms
languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/do-not-synthesize.txt Added new entry for "saith"
languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/hunspell/spelling*.txt Removed words across different English variants (US, GB, CA, AU)
languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/multiwords.txt Added numerous multi-word terms and variations
languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/removed.txt Removed specific word entries
languagetool-language-modules/en/src/main/resources/org/languagetool/rules/en/en-GB/grammar.xml Updated rule message for movie theater/cinema suggestion
languagetool-language-modules/en/src/main/resources/org/languagetool/rules/en/en-GB/style.xml Updated Oxford spelling examples
languagetool-language-modules/en/src/main/resources/org/languagetool/rules/en/grammar.xml Added new antipatterns for various linguistic structures
languagetool-language-modules/en/src/main/resources/org/languagetool/rules/en/replace.txt Updated entries for "Burkina-Faso" and "zettagrammes"
pom.xml Updated english-pos-dict dependency version from 0.3 to 0.4
languagetool-standalone/src/test/java/org/languagetool/LanguageAnnotatorTest.java Modified expected output for language annotations in tests
languagetool-language-modules/en/src/test/java/org/languagetool/rules/en/*.java Commented out specific assertions in various spelling rule tests

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🧹 Nitpick comments (7)
languagetool-language-modules/en/src/test/java/org/languagetool/rules/en/MorfologikSouthAfricanSpellerRuleTest.java (1)

46-48: Consider adding replacement test cases for South African English.

While removing this specific test case, consider adding new test cases that:

  1. Cover the new dictionary entries being introduced
  2. Verify region-specific spelling variations
  3. Test the interaction between new grammar patterns and spelling suggestions

Would you like me to help draft additional test cases that cover these scenarios?

languagetool-language-modules/en/src/main/resources/org/languagetool/rules/en/grammar.xml (3)

54409-54410: Consider refining the correction suggestions for "due".

While the examples correctly identify the misuse of "due", some suggested corrections like "dude" or "duet" are unlikely to be appropriate in most contexts. Consider limiting suggestions to more probable alternatives like "do".


109335-109335: Consider documenting the rationale for exceptions.

While the exceptions list is comprehensive, it would be helpful to document why these specific words are excluded from interjection detection, possibly grouping them by category (e.g., greetings, expressions, commands).


140626-140626: Consider adding more descriptive documentation.

While the antipattern is well-implemented, the comment referencing issue #4823 should include a brief description of the issue being addressed, making the code more self-documenting.

languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/multiwords.txt (1)

Line range hint 1-8098: Consider expanding term coverage in key categories.

The file has good coverage but could be enhanced with:

  1. Technical/Scientific terms
  2. More idiomatic expressions
  3. Additional foreign phrases commonly used in English
  4. Modern brand names and trademarks

This would improve the disambiguation capabilities for contemporary text.

languagetool-language-modules/en/src/main/resources/org/languagetool/rules/en/style.xml (1)

Line range hint 3118-3337: Address TODO comments for false negatives.

Multiple examples are marked with "TODO. False negative" comments:

  • "I like both chips and eggs and pie and peas."
  • "Art, literature and film."

These false negatives could affect the accuracy of the grammar checker.

Would you like me to help:

  1. Analyze the patterns causing these false negatives?
  2. Create a GitHub issue to track these cases?
  3. Propose rule adjustments to handle these cases correctly?
languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/disambiguation.xml (1)

10665-10665: Consider using XML entity for Unicode character

While the example correctly demonstrates the disambiguation of "Most", consider using XML entity reference for the Unicode character 'μ' to ensure compatibility across different XML parsers.

-          <example type="ambiguous" inputform="Most[most/NN:U,much/JJS,much/RBS]" outputform="Most[much/JJS]"><marker>Most</marker> are around 40-50 μm in diameter.</example>
+          <example type="ambiguous" inputform="Most[most/NN:U,much/JJS,much/RBS]" outputform="Most[much/JJS]"><marker>Most</marker> are around 40-50 &#956;m in diameter.</example>
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5a33729 and 842fa0f.

📒 Files selected for processing (23)
  • languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/added.txt (1 hunks)
  • languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/disambiguation.xml (7 hunks)
  • languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/do-not-synthesize.txt (1 hunks)
  • languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/hunspell/spelling.txt (0 hunks)
  • languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/hunspell/spelling_en-AU.txt (0 hunks)
  • languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/hunspell/spelling_en-CA.txt (0 hunks)
  • languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/hunspell/spelling_en-GB.txt (0 hunks)
  • languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/hunspell/spelling_en-US.txt (1 hunks)
  • languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/multiwords.txt (2 hunks)
  • languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/removed.txt (0 hunks)
  • languagetool-language-modules/en/src/main/resources/org/languagetool/rules/en/en-GB/grammar.xml (1 hunks)
  • languagetool-language-modules/en/src/main/resources/org/languagetool/rules/en/en-GB/style.xml (2 hunks)
  • languagetool-language-modules/en/src/main/resources/org/languagetool/rules/en/grammar.xml (12 hunks)
  • languagetool-language-modules/en/src/main/resources/org/languagetool/rules/en/replace.txt (2 hunks)
  • languagetool-language-modules/en/src/main/resources/org/languagetool/rules/en/style.xml (3 hunks)
  • languagetool-language-modules/en/src/test/java/org/languagetool/rules/en/AbstractEnglishSpellerRuleTest.java (1 hunks)
  • languagetool-language-modules/en/src/test/java/org/languagetool/rules/en/MorfologikAmericanSpellerRuleTest.java (4 hunks)
  • languagetool-language-modules/en/src/test/java/org/languagetool/rules/en/MorfologikAustralianSpellerRuleTest.java (1 hunks)
  • languagetool-language-modules/en/src/test/java/org/languagetool/rules/en/MorfologikBritishSpellerRuleTest.java (1 hunks)
  • languagetool-language-modules/en/src/test/java/org/languagetool/rules/en/MorfologikCanadianSpellerRuleTest.java (1 hunks)
  • languagetool-language-modules/en/src/test/java/org/languagetool/rules/en/MorfologikSouthAfricanSpellerRuleTest.java (1 hunks)
  • languagetool-language-modules/en/src/test/java/org/languagetool/rules/en/SpellingCheckRuleTest.java (1 hunks)
  • pom.xml (1 hunks)
💤 Files with no reviewable changes (5)
  • languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/hunspell/spelling_en-AU.txt
  • languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/hunspell/spelling_en-CA.txt
  • languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/hunspell/spelling.txt
  • languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/hunspell/spelling_en-GB.txt
  • languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/removed.txt
✅ Files skipped from review due to trivial changes (2)
  • languagetool-language-modules/en/src/test/java/org/languagetool/rules/en/MorfologikBritishSpellerRuleTest.java
  • languagetool-language-modules/en/src/test/java/org/languagetool/rules/en/MorfologikAmericanSpellerRuleTest.java
🧰 Additional context used
🪛 LanguageTool
languagetool-language-modules/en/src/main/resources/org/languagetool/rules/en/replace.txt

[uncategorized] ~170-~170: Did you mean “Maillard reaction”?
Context: ...ay=a long way Burkina-Faso=Burkina Faso Mallard reaction=Maillard reaction dev ops=DevOps port c...

(EN_SIMPLE_REPLACE_MALLARD_REACTION)


[uncategorized] ~220-~220: Do not mix variants of the same word (‘zettagram’ and ‘zettagramme’) within a single text.
Context: ...re commonly used spelling. zettagrammes=zettagrams 'Zettagrams' is the more commonly used ...

(EN_WORD_COHERENCY)


[uncategorized] ~220-~220: Do not mix variants of the same word (‘zettagram’ and ‘zettagramme’) within a single text.
Context: ...used spelling. zettagrammes=zettagrams 'Zettagrams' is the more commonly used spelling.

(EN_WORD_COHERENCY)

🔇 Additional comments (31)
languagetool-language-modules/en/src/test/java/org/languagetool/rules/en/MorfologikSouthAfricanSpellerRuleTest.java (1)

47-47: Verify the impact of removing this test assertion.

The removal of this test case reduces coverage for South African English specific spelling suggestions. While this aligns with similar changes in other regional spelling test files, we should ensure that:

  1. The removal doesn't leave critical functionality untested
  2. There's still adequate coverage for South African English specific words

Let's verify the test coverage and related dictionary files:

languagetool-language-modules/en/src/test/java/org/languagetool/rules/en/MorfologikAustralianSpellerRuleTest.java (1)

47-47: Verify the removal of spelling suggestion test

The test for "AUTestWordToBeIgnore" has been commented out. While this appears to be part of a broader pattern of changes across regional variants, we should ensure we're not inadvertently losing important test coverage.

Please clarify:

  1. Is this test case no longer valid with the new dictionaries?
  2. Will this word be handled differently in the new implementation?
  3. Should we add alternative test cases to validate the new dictionary behavior?

Run this script to check for similar patterns across other regional variants:

✅ Verification successful

Test fixtures cleanup is valid

The commented-out test case is part of a systematic cleanup across all regional variants (AU, GB, CA, US, ZA). These were artificial test words prefixed with region codes, and their removal appears intentional as similar changes were made consistently across all regional test files.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check for similar commented-out tests across regional variants
# and verify if the test word exists in the new dictionaries

# Check for similar patterns in other test files
echo "Checking for similar patterns in other test files:"
rg -A 1 "assertSuggestion.*TestWordToBeIgnore" "languagetool-language-modules/en/src/test"

# Check if the test word exists in the new dictionaries
echo -e "\nChecking for test word in dictionaries:"
rg -i "AUTestWordToBeIgnore" "languagetool-language-modules/en/src/main/resources"

Length of output: 2549


Script:

#!/bin/bash
# Check for potential new test patterns and dictionary changes

echo "Checking for new test patterns in spell checker tests:"
rg "assertSuggestion.*rule.*lt" "languagetool-language-modules/en/src/test"

echo -e "\nChecking for recent changes in dictionary files:"
fd . "languagetool-language-modules/en/src/main/resources" -t f -e txt -e dict | while read -r file; do
    echo "=== $file ==="
    head -n 5 "$file" 2>/dev/null
done

Length of output: 35264

languagetool-language-modules/en/src/test/java/org/languagetool/rules/en/MorfologikCanadianSpellerRuleTest.java (1)

48-48: Avoid commenting out tests without explanation or replacement.

Commenting out test assertions reduces test coverage and could mask potential issues with Canadian-specific spelling suggestions. If the test case is no longer valid due to dictionary changes:

  1. Either remove it completely with a clear explanation in the commit message
  2. Or update it with new test cases that validate the new dictionary entries

Let's check if this is part of a broader pattern of disabled tests:

languagetool-language-modules/en/src/test/java/org/languagetool/rules/en/AbstractEnglishSpellerRuleTest.java (2)

76-76: Review similar word patterns for consistency.

Since we're adding an agent noun form ("campaigner") to the suggestions, we should review other similar test cases to ensure consistent handling of agent nouns.

Let's search for similar patterns:

✅ Verification successful

The agent noun handling is consistent across test cases.

The addition of "campaigner" follows the same pattern as the existing "spreader" test case, where agent nouns are included in suggestions for verb-based misspellings.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Find other test cases that might need similar updates
rg -B 2 -A 2 "assertAllMatches.*er\"" "languagetool-language-modules/en/src/test/java/org/languagetool/rules/en/"

# Check for other agent noun patterns in the tests
ast-grep --pattern 'assertAllMatches($_, $_, $_, $$$, "er")'

Length of output: 1318


Script:

#!/bin/bash
# Find test cases with past tense verbs (-ed) to check if they're missing agent nouns
rg -B 2 -A 2 'assertAllMatches.*ed"' "languagetool-language-modules/en/src/test/java/org/languagetool/rules/en/AbstractEnglishSpellerRuleTest.java"

# Also check for present participle forms (-ing) to get a complete picture
rg -B 2 -A 2 'assertAllMatches.*ing"' "languagetool-language-modules/en/src/test/java/org/languagetool/rules/en/AbstractEnglishSpellerRuleTest.java"

Length of output: 838


76-76: Verify dictionary consistency for "campaigner" suggestion.

The addition of "campaigner" to the expected suggestions appears reasonable, but we should verify that this change aligns with the dictionary modifications mentioned in the PR.

Let's check the dictionary entries:

languagetool-language-modules/en/src/main/resources/org/languagetool/rules/en/grammar.xml (9)

13975-13980: LGTM! Well-structured antipattern for "missing persons" compound modifier.

The antipattern correctly handles the "missing persons" phrase when followed by a noun phrase, with a clear example demonstrating its usage.


14353-14353: LGTM! Clear example for possessive apostrophe usage.

The example effectively demonstrates the distinction between plural and possessive forms.


18804-18804: LGTM! Natural example for "you" vs "your" correction.

The example effectively illustrates a common grammatical error in expressing gratitude.


22583-22583: LGTM! Appropriate handling of Danish text examples.

The examples correctly demonstrate Danish text detection scenarios.


81228-81235: LGTM! Excellent antipattern for preserving parallel structure.

The antipattern correctly prevents incorrect suggestions that would break parallel structure in sentences with coordinated gerunds. The example effectively demonstrates this.


93011-93011: LGTM! Clear example for proper noun handling.

The example effectively demonstrates the correct handling of proper nouns in historical context.


109345-109345: LGTM! Comprehensive examples for repeated word patterns.

The examples effectively cover various scenarios of acceptable word repetition.


109981-109988: LGTM! Well-structured antipattern for polite refusal.

The antipattern correctly handles the "no thank you" phrase with appropriate comma placement. The example is natural and clear.


114193-114200: LGTM! Robust antipattern for proper nouns with numbers.

The antipattern effectively handles various cases of proper nouns containing numbers, with comprehensive examples demonstrating different scenarios.

languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/added.txt (1)

5-5: Verify the impact of removing all dictionary entries.

The file has been emptied of all entries. This could significantly impact the language tool's ability to recognize and process various terms.

Let's verify what entries were removed and their potential impact:

✅ Verification successful

Removal of supplementary dictionary entries is acceptable

The removed entries were supplementary to the main binary dictionary (*.dict) and not critical for core functionality:

  • The file explicitly states it's "used additionally to the binary dictionary"
  • Spell checking is handled separately in hunspell/spelling.txt
  • Custom additions can still be made via added_custom.txt
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check the git history to see what entries were removed
# and analyze their importance for language processing.

# Get the previous version of the file
git show HEAD^:languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/added.txt > /tmp/added_old.txt

# Count and categorize the removed entries
echo "Total entries removed:"
wc -l /tmp/added_old.txt

echo -e "\nSample of removed entries by part of speech:"
grep -P "\tNNP\t" /tmp/added_old.txt | head -n 5
grep -P "\tNN\t" /tmp/added_old.txt | head -n 5
grep -P "\tVB\t" /tmp/added_old.txt | head -n 5
grep -P "\tJJ\t" /tmp/added_old.txt | head -n 5

Length of output: 521


Script:

#!/bin/bash
# Check the format and content of the previous version
echo "First 10 lines of the previous version:"
head -n 10 /tmp/added_old.txt

echo -e "\nChecking for related dictionary files:"
fd ".*\.txt" languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/

echo -e "\nChecking for documentation:"
rg -l "added.txt" --type md

Length of output: 4911

languagetool-language-modules/en/src/main/resources/org/languagetool/rules/en/en-GB/grammar.xml (1)

230-230: LGTM! The regex pattern update improves noun form handling.

The change to postag="(NN|NNS)(:UN)?" with $1 replacement makes the pattern more flexible by:

  1. Properly grouping the noun forms
  2. Handling both singular and plural forms with optional uncountable noun tag

Run the following script to verify the impact on other rules:

✅ Verification successful

LGTM! The pattern update is consistent with codebase conventions.

The change to postag="(NN|NNS)(:UN)?" with $1 replacement follows established patterns for handling noun forms throughout the codebase, particularly in handling both singular and plural forms with optional uncountable noun tags.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check for similar patterns in other rules that might benefit from this improvement

# Search for similar postag patterns in other rules
rg -A 2 'postag="NN\|NN\(S\)"' languagetool-language-modules/en/src/main/resources/org/languagetool/rules/

Length of output: 108


Script:

#!/bin/bash
# Search for various forms of postag patterns with NN or NNS in English rules
echo "=== Searching for postag patterns with NN/NNS ==="
rg -A 2 'postag="[^"]*(?:NN|NNS)[^"]*"' languagetool-language-modules/en/src/main/resources/org/languagetool/rules/

echo -e "\n=== Searching for specific postag replace patterns ==="
rg -A 2 'postag_replace' languagetool-language-modules/en/src/main/resources/org/languagetool/rules/

echo -e "\n=== Searching for similar message patterns with suggestions ==="
rg -A 2 '<message>[^<]*<suggestion>[^<]*<match[^>]*postag=' languagetool-language-modules/en/src/main/resources/org/languagetool/rules/

Length of output: 67564

pom.xml (1)

227-227: Verify compatibility with the new english-pos-dict version.

The update from version 0.3 to 0.4 of the english-pos-dict dependency could introduce breaking changes. Since this PR adds new dictionaries, please ensure:

  1. All existing functionality remains compatible
  2. The new version supports the new dictionary features
  3. No migration steps are required

Run the following script to check for potential impacts:

languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/multiwords.txt (2)

Line range hint 1-4: Well-documented file header with clear usage rules.

The header comment clearly explains:

  • Purpose: Multi-word terms for EnglishHybridDisambiguator
  • Usage rules: No disambiguation needed, single POS only
  • Case handling: Case-sensitive with POS added for variants
  • Integration: Terms excluded from spelling checks

Line range hint 1-8098: Verify accuracy of specialized terms.

Please verify:

  1. Historical names and titles (e.g., "Genghis Khan", "Kublai Khan")
  2. Foreign language terms and their translations
  3. Brand names and trademarks for current validity
  4. Correct pluralization forms (e.g., "chefs d'oeuvre", "chefs-d'oeuvre")

Consider adding comments for terms with special usage rules or historical context.

✅ Verification successful

Multiwords file follows stated guidelines correctly.

The file maintains consistency in POS tagging and follows its own guidelines:

  1. Terms have appropriate part-of-speech tags (e.g., Latin/French phrases as RB/JJ/NN, proper nouns as NNP)
  2. Multiple tags are only used when grammatically valid (e.g., "bona fide" as both JJ and RB)
  3. Special cases are marked with multiword tag

The commented-out entries appear to have been correctly commented as they had incorrect POS tags (mostly marked as NNP when they should be RB/JJ/NN).

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for any existing comments explaining special cases
rg -A 1 "^#.*" "languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/multiwords.txt"

Length of output: 1990


Script:

#!/bin/bash
# Search for Latin/French legal terms and their POS tags
rg "^(in|ex|ad|de|prima|sub|per|inter|ultra|vice|sui|pro)\s.*\t(NN|JJ|RB)" "languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/multiwords.txt"

# Search for terms with multiple POS tags to verify disambiguation claim
rg ".*\t.*,.*" "languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/multiwords.txt"

Length of output: 874


Script:

#!/bin/bash
# Compare commented terms with their uncommented counterparts
rg "^#.*\t" "languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/multiwords.txt" | cut -f1 | sed 's/^#//' | while read -r term; do
    rg "^$term\t" "languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/multiwords.txt"
done

Length of output: 5305

languagetool-language-modules/en/src/main/resources/org/languagetool/rules/en/style.xml (1)

3118-3118: Verify the correction suggestion for consistency with style guides.

The example suggests adding an Oxford comma in "Legislative power is vested in the President, the government and parliament." While this is a stylistic choice, ensure it aligns with the intended style guide for the project.

Run this script to check for consistency in Oxford comma rules:

✅ Verification successful

Oxford comma usage is consistent with project style guide

The suggestion to add an Oxford comma in "Legislative power is vested in the President, the government, and parliament" aligns with the project's consistent style pattern, as evidenced by numerous similar corrections in the codebase's style rules.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for other Oxford comma examples and corrections
rg -A 2 'correction="[^"]*,' "languagetool-language-modules/en/src/main/resources/org/languagetool/rules/en/"

Length of output: 90978

languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/disambiguation.xml (3)

88-89: LGTM! Clean removal of outdated comment

The removal of the "gif/NNS" comment and the example clarification improves the documentation.


1570-1570: LGTM! Good example addition

The new example correctly demonstrates the disambiguation of "No" as a determiner in the "No one" context, complementing the existing "one" disambiguation example.


3219-3219: LGTM! Clear disambiguation examples for "Let"

The examples effectively demonstrate the disambiguation between:

  1. "Let's" as a contraction (VBP)
  2. "Let" as an imperative verb (VB/VBD/VBN/VBP)

Also applies to: 3255-3255

languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/do-not-synthesize.txt (2)

11-11: LGTM: Mapping of "a'ight" to "alright"

The informal contraction "a'ight" is correctly mapped to "alright" with the RG (degree adverb) tag.


12-12: Verify archaic form "saith"

The archaic form "saith" is mapped to "say" with VBZ tag. Consider moving this to filter-archaic.txt during the next dictionary rebuild as suggested in the file header.

languagetool-language-modules/en/src/test/java/org/languagetool/rules/en/SpellingCheckRuleTest.java (1)

54-55: Restore or remove commented test assertion

The assertion testing spelling suggestions for "anotherArtificialTestWordForLanguageTol" is commented out. Either:

  1. Restore the assertion if it's still valid
  2. Remove it completely if it's no longer needed
  3. Add a comment explaining why it's commented out

This helps maintain test coverage and code clarity.

languagetool-language-modules/en/src/main/resources/org/languagetool/rules/en/replace.txt (2)

170-170: LGTM: Standardized country name format

The correction of "Burkina-Faso" to "Burkina Faso" aligns with the official country name format.

🧰 Tools
🪛 LanguageTool

[uncategorized] ~170-~170: Did you mean “Maillard reaction”?
Context: ...ay=a long way Burkina-Faso=Burkina Faso Mallard reaction=Maillard reaction dev ops=DevOps port c...

(EN_SIMPLE_REPLACE_MALLARD_REACTION)


220-220: LGTM: Consistent metric unit spelling

The standardization of "zettagrammes" to "zettagrams" maintains consistency with other metric unit spellings in the file.

🧰 Tools
🪛 LanguageTool

[uncategorized] ~220-~220: Do not mix variants of the same word (‘zettagram’ and ‘zettagramme’) within a single text.
Context: ...re commonly used spelling. zettagrammes=zettagrams 'Zettagrams' is the more commonly used ...

(EN_WORD_COHERENCY)


[uncategorized] ~220-~220: Do not mix variants of the same word (‘zettagram’ and ‘zettagramme’) within a single text.
Context: ...used spelling. zettagrammes=zettagrams 'Zettagrams' is the more commonly used spelling.

(EN_WORD_COHERENCY)

languagetool-language-modules/en/src/main/resources/org/languagetool/rules/en/en-GB/style.xml (3)

200-201: LGTM: Updated examples for Oxford spelling rules

The examples for "Americanisation/Americanization" correctly demonstrate the Oxford spelling variants with appropriate part-of-speech tags (NNP/NNPS).


203-203: LGTM: Added examples for naturalization and de-energization

The examples effectively demonstrate the Oxford spelling rules for:

  • "naturalisations/naturalizations" in the context of official documents
  • "de-energisations/de-energizations" with appropriate postag information

Also applies to: 209-209


257-257: LGTM: Added example for recentralization

The example correctly demonstrates the Oxford spelling rule for "recentralise/recentralize" with proper POS tag verification.


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codebase verification

Removal of terms will impact US English spell checking functionality

The verification confirms that the removed technical and compound terms (e.g., nanofiber, re-energize, sublicensable) are not present in any of the following dictionaries:

  • Common English spelling dictionary (spelling.txt)
  • Merged dictionary (spelling_merged.txt)
  • Custom additions (spelling_custom.txt)

This removal will negatively impact spell checking functionality for US English users working with technical and business documents.

🔗 Analysis chain

Verify the removal of US-specific technical and compound terms.

The removal of all entries from this file could impact spell checking functionality for US English users, particularly for technical terms (nanofiber, quantizers), compound words (re-energize), and business terminology (sublicensable).

Let's verify if these terms were moved to the common spelling.txt file:

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check if removed words exist in the common spelling file
# and verify potential impact on spell checking functionality

# Define the list of removed words to check
words=(
  "calisthenics"
  "megadonor"
  "megadonors"
  "remodeler"
  "remodelers"
  "nanofiber"
  "nanofibers"
  "quantizers"
  "re-energize"
  "re-energized"
  "re-energizes"
  "re-energizing"
  "nonsublicensable"
  "sublicensable"
  "non-sublicensable"
)

echo "Checking for removed words in spelling.txt..."
for word in "${words[@]}"; do
  echo "Checking: $word"
  rg -i "^${word}$" "languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/spelling.txt"
done

# Check if these words are used in example sentences or test cases
echo -e "\nChecking for usage in tests and examples..."
rg -i -l "($(IFS="|"; echo "${words[*]}"))" "languagetool-language-modules/en/src/test/"

Length of output: 5905


Script:

#!/bin/bash
# Find all spelling-related dictionary files
echo "Locating spelling dictionary files..."
fd -t f -e txt spelling

# Find specific files we're interested in
echo -e "\nLooking for main spelling files..."
fd -t f "spelling\.txt$|spelling_en-US\.txt$"

# Show directory structure for language resources
echo -e "\nChecking resource directory structure..."
fd . "languagetool-language-modules/en/src/main/resources" -t d

Length of output: 10113


Script:

#!/bin/bash
# Define the list of removed words
words=(
  "calisthenics"
  "megadonor"
  "megadonors"
  "remodeler"
  "remodelers"
  "nanofiber"
  "nanofibers"
  "quantizers"
  "re-energize"
  "re-energized"
  "re-energizes"
  "re-energizing"
  "nonsublicensable"
  "sublicensable"
  "non-sublicensable"
)

# Check each word in the main spelling files
for file in "spelling.txt" "spelling_merged.txt" "spelling_custom.txt"; do
  echo "=== Checking $file ==="
  for word in "${words[@]}"; do
    echo "Looking for: $word"
    rg -i "^${word}$" "languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/hunspell/${file}"
  done
  echo
done

# Look for any migration-related comments
echo "=== Checking for migration comments ==="
rg -i "moved|migration|merged|consolidat" "languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/hunspell/"

Length of output: 10955

Comment on lines 8061 to 8098
tam-o'-shanter NN
tam-o'-shanters NNS
tam o'-shanter NN
tam o'-shanters NNS
Tam o' Shanter NNP
Ten Thousands Lakes NNP
status quo NN
Status Quo NN
status quos NNS
Status Quos NNS
status quo ante NN
Status Quo Ante NN
status quo ante bellum NN
Status Quo Ante Bellum NN
in situ RB
in situ JJ
hic et nunc RB
ac etiam RB
ceteris paribus RB
ib. RB
ibid. RB
a.k.a. RB
chef d'oeuvre NN
chefs d'oeuvre NNS
chef-d'oeuvre NN
chefs-d'oeuvre NNS
chef-d'œuvre NN
chefs-d'œuvre NNS
de facto RB
Genghis Khan NNP
'Ndrangheta NNP
’Ndrangheta NNP
Viet Cong NNP
Kublai Khan NNP
double Dutch NN:U
hacky sack NN
hacky sacks NN
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Review and standardize POS tagging.

Several inconsistencies in part-of-speech tagging need attention:

  1. Foreign phrases have inconsistent tags:

    • "vice versa" tagged as RB
    • "in situ" tagged as both RB and JJ
    • "de facto" tagged as RB
      Consider standardizing the tagging approach for foreign phrases.
  2. Compound terms need review:

    • "double Dutch" tagged as NN:U (uncountable)
    • "hacky sack" tagged as NN (should be countable)
  3. Some terms missing alternative forms:

    • "chef d'oeuvre"/"chef-d'oeuvre" - consider adding adjectival forms

@@ -3227,7 +3227,7 @@ For more detailed information, including tag definitions, please check out our d
<example>I like both chips and <marker>eggs and</marker> pie and peas.</example><!-- TODO. False negative -->
<example>Art, <marker>literature and</marker> film.</example><!-- TODO. False negative -->
<example>Art is used in rituals, <marker>performances and</marker> dances as a decoration or symbol.</example>
<example>Legislative power is vested in the President, the government and parliament.</example><!-- TODO. False negative -->
<!--<example correction="government, and">Legislative power is vested in the President, the <marker>government and</marker> parliament.</example>-->
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Resolve inconsistent example usage across rule sections.

The "Legislative power" example appears in multiple sections with different treatments (active vs. commented). This inconsistency could lead to confusion in rule maintenance.

Consider:

  1. Documenting why the example is commented out in some sections but active in others
  2. Using a consistent approach across all sections
  3. If the example is problematic, replace it with a more suitable one in all sections

Also applies to: 3337-3337

@@ -10606,7 +10617,7 @@ Incorrectly touched. <example type="untouched">Incorrect grammar. Those present
<example type="untouched">I thought I'd already told you oil and water don't mix.</example>
<example type="untouched">Who told you oil and water actually mixes well?</example>
<example type="untouched">... as he said to Jane and I milk and honey is the food of the gods.</example>
<example type="ambiguous" inputform="email[email/NN:U,email/VB,email/VBP]" outputform="email[email/VBP]">Currently we <marker>email</marker> and share the stuff.</example><!-- #2539-->
<!--<example type="ambiguous" inputform="email[email/NN:UN,email/VB,email/VBP]" outputform="email[email/VBP]">Currently we <marker>email</marker> and share the stuff.</example><!-\- #2539-\->-->
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codebase verification

Uncomment the email disambiguation example

The example should be restored as it provides correct disambiguation rules for "email" as a verb in contexts like "Currently we email", which aligns with the resolution needed for issue #2539.

🔗 Analysis chain

Verify if the email example should remain commented out

The example is commented out with reference to issue #2539. Please verify if this is still relevant or if the example should be restored.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Check if issue #2539 is still open and relevant
gh issue view 2539 2>/dev/null || echo "Issue not found or not accessible"

Length of output: 247

Comment on lines +32 to +40
<rule id="INSTAL_INSTALL" name="instal/install">
<pattern>
<and>
<token inflected="yes">install</token>
<token inflected="yes">instal</token>
</and>
</pattern>
<disambig action="remove"><wd lemma="instal"></wd></disambig>
</rule>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codebase verification

British English variant handling needs improvement

The review comment is valid. The en-US-GB.txt file confirms "instal" as a legitimate British English variant, similar to other single-'l' British spellings like "instil". The current disambiguation rule that unconditionally removes "instal" would incorrectly handle British English texts.

  • Consider modifying the rule to respect the text's language variant (US/GB) before removing "instal"
  • The rule should align with other British spelling patterns in the codebase, like "instil/instill"
🔗 Analysis chain

Consider British English variant handling

The rule removes "instal" when both forms are present, which might not be appropriate for British English texts where "instal" is a valid variant spelling.

Consider:

  1. Adding a language variant check before removing "instal"
  2. Including test examples for both American and British English cases
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Check if "instal" is used in other British English rules or dictionaries
rg -l 'instal' . | grep -i 'british\|gb'

Length of output: 136


Script:

#!/bin/bash
# Check the context of "instal" in the US-GB comparison file
echo "=== Content of en-US-GB.txt around 'instal' ==="
rg -C 3 'instal' ./languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/en-US-GB.txt

echo -e "\n=== Check for British/GB variant handling in disambiguation rules ==="
rg -C 3 'GB|British|variant' ./languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/disambiguation.xml

Length of output: 5555

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 842fa0f and 94bfef0.

📒 Files selected for processing (1)
  • languagetool-standalone/src/test/java/org/languagetool/LanguageAnnotatorTest.java (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (1)
  • GitHub Check: Analyze (java-kotlin)

@@ -37,7 +37,7 @@ public class LanguageAnnotatorTest {
public void testGetTokensWithPotentialLanguages() {
LanguageAnnotator annotator = new LanguageAnnotator();
List<LanguageAnnotator.TokenWithLanguages> tokens = annotator.getTokensWithPotentialLanguages("Der große Haus.", en, deList);
assertThat(tokens.toString(), is("[Der/de-DE, , große, , Haus/de-DE, .]")); // TODO: why no lang for 'große'?
assertThat(tokens.toString(), is("[Der/en-US/de-DE, , große, , Haus/de-DE, .]")); // TODO: why no lang for 'große'?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codebase verification

Test expectation for standalone "Der" should not include en-US language tag

The test expects "Der" to be detected as both English (en-US) and German (de-DE), but this is incorrect. While "Der" exists in English dictionaries, it's only as part of the proper noun "Der Spiegel" (German news magazine). When used as a standalone article, "Der" should only be tagged as German (de-DE).

  • The test at line 40 should expect [Der/de-DE, ...] instead of [Der/en-US/de-DE, ...]
🔗 Analysis chain

Verify if "Der" should be detected as an English word

The change suggests that the German article "Der" is now being detected as both English (en-US) and German (de-DE) word. This might indicate unintended side effects from the dictionary changes in this PR.

Please verify:

  1. Is this change intentional?
  2. Could the new English dictionaries be incorrectly classifying German words?

Run this script to check if "Der" appears in the English dictionaries:

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check if "Der" appears in English dictionaries
# Note: Checking case-sensitive to avoid false positives with "der"

echo "Checking English dictionaries for 'Der':"
fd -t f . "languagetool-language-modules/en/src/main/resources" -x rg -l "^Der\b" {}

Length of output: 378


Script:

#!/bin/bash
# Check the actual entries containing "Der" in the English dictionary files
echo "=== Entries in specific_case.txt ==="
rg "^Der\b" "languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/specific_case.txt"

echo -e "\n=== Entries in multiwords.txt ==="
rg "^Der\b" "languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/multiwords.txt"

Length of output: 432

[en] fix tests

[en] multiwords

[en] disambiguation: Even

[en] multiwords

[en] multiwords

[en] multiwords

[en] multiwords

[en] multiwords

[en] fix some FPs, by @AzadehSafakish

[en] multiwords

[en] do not synthesize: saith

[en] multiwords

[en] multiwords

[en] multiwords, empty files: spelling.txt, added.txt

[en] dummy commit
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
languagetool-language-modules/en/src/main/resources/org/languagetool/rules/en/replace.txt (1)

170-170: Add an explanatory message for the Burkina Faso correction.

Consider adding a message to explain that this is the country's official name, similar to other geographical name corrections in the file.

-Burkina-Faso=Burkina Faso
+Burkina-Faso=Burkina Faso	The official name of the country is "Burkina Faso" (without a hyphen).
🧰 Tools
🪛 LanguageTool

[uncategorized] ~170-~170: Did you mean “Maillard reaction”?
Context: ...ay=a long way Burkina-Faso=Burkina Faso Mallard reaction=Maillard reaction dev ops=DevOps port c...

(EN_SIMPLE_REPLACE_MALLARD_REACTION)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 94bfef0 and 56a79d0.

📒 Files selected for processing (24)
  • languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/added.txt (1 hunks)
  • languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/disambiguation.xml (7 hunks)
  • languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/do-not-synthesize.txt (1 hunks)
  • languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/hunspell/spelling.txt (0 hunks)
  • languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/hunspell/spelling_en-AU.txt (0 hunks)
  • languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/hunspell/spelling_en-CA.txt (0 hunks)
  • languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/hunspell/spelling_en-GB.txt (0 hunks)
  • languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/hunspell/spelling_en-US.txt (1 hunks)
  • languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/multiwords.txt (2 hunks)
  • languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/removed.txt (0 hunks)
  • languagetool-language-modules/en/src/main/resources/org/languagetool/rules/en/en-GB/grammar.xml (1 hunks)
  • languagetool-language-modules/en/src/main/resources/org/languagetool/rules/en/en-GB/style.xml (2 hunks)
  • languagetool-language-modules/en/src/main/resources/org/languagetool/rules/en/grammar.xml (12 hunks)
  • languagetool-language-modules/en/src/main/resources/org/languagetool/rules/en/replace.txt (2 hunks)
  • languagetool-language-modules/en/src/main/resources/org/languagetool/rules/en/style.xml (3 hunks)
  • languagetool-language-modules/en/src/test/java/org/languagetool/rules/en/AbstractEnglishSpellerRuleTest.java (1 hunks)
  • languagetool-language-modules/en/src/test/java/org/languagetool/rules/en/MorfologikAmericanSpellerRuleTest.java (4 hunks)
  • languagetool-language-modules/en/src/test/java/org/languagetool/rules/en/MorfologikAustralianSpellerRuleTest.java (1 hunks)
  • languagetool-language-modules/en/src/test/java/org/languagetool/rules/en/MorfologikBritishSpellerRuleTest.java (1 hunks)
  • languagetool-language-modules/en/src/test/java/org/languagetool/rules/en/MorfologikCanadianSpellerRuleTest.java (1 hunks)
  • languagetool-language-modules/en/src/test/java/org/languagetool/rules/en/MorfologikSouthAfricanSpellerRuleTest.java (1 hunks)
  • languagetool-language-modules/en/src/test/java/org/languagetool/rules/en/SpellingCheckRuleTest.java (1 hunks)
  • languagetool-standalone/src/test/java/org/languagetool/LanguageAnnotatorTest.java (1 hunks)
  • pom.xml (1 hunks)
💤 Files with no reviewable changes (5)
  • languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/hunspell/spelling_en-AU.txt
  • languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/hunspell/spelling_en-CA.txt
  • languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/hunspell/spelling_en-GB.txt
  • languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/hunspell/spelling.txt
  • languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/removed.txt
🚧 Files skipped from review as they are similar to previous changes (15)
  • languagetool-language-modules/en/src/test/java/org/languagetool/rules/en/SpellingCheckRuleTest.java
  • languagetool-language-modules/en/src/test/java/org/languagetool/rules/en/MorfologikSouthAfricanSpellerRuleTest.java
  • languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/do-not-synthesize.txt
  • languagetool-language-modules/en/src/test/java/org/languagetool/rules/en/MorfologikCanadianSpellerRuleTest.java
  • languagetool-language-modules/en/src/test/java/org/languagetool/rules/en/MorfologikAustralianSpellerRuleTest.java
  • languagetool-language-modules/en/src/test/java/org/languagetool/rules/en/AbstractEnglishSpellerRuleTest.java
  • languagetool-language-modules/en/src/test/java/org/languagetool/rules/en/MorfologikBritishSpellerRuleTest.java
  • languagetool-language-modules/en/src/main/resources/org/languagetool/rules/en/style.xml
  • languagetool-language-modules/en/src/test/java/org/languagetool/rules/en/MorfologikAmericanSpellerRuleTest.java
  • languagetool-standalone/src/test/java/org/languagetool/LanguageAnnotatorTest.java
  • languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/hunspell/spelling_en-US.txt
  • languagetool-language-modules/en/src/main/resources/org/languagetool/rules/en/grammar.xml
  • languagetool-language-modules/en/src/main/resources/org/languagetool/rules/en/en-GB/grammar.xml
  • languagetool-language-modules/en/src/main/resources/org/languagetool/rules/en/en-GB/style.xml
  • pom.xml
👮 Files not reviewed due to content moderation or server errors (2)
  • languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/added.txt
  • languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/multiwords.txt
🧰 Additional context used
🪛 LanguageTool
languagetool-language-modules/en/src/main/resources/org/languagetool/rules/en/replace.txt

[uncategorized] ~170-~170: Did you mean “Maillard reaction”?
Context: ...ay=a long way Burkina-Faso=Burkina Faso Mallard reaction=Maillard reaction dev ops=DevOps port c...

(EN_SIMPLE_REPLACE_MALLARD_REACTION)


[uncategorized] ~220-~220: Do not mix variants of the same word (‘zettagram’ and ‘zettagramme’) within a single text.
Context: ...re commonly used spelling. zettagrammes=zettagrams 'Zettagrams' is the more commonly used ...

(EN_WORD_COHERENCY)


[uncategorized] ~220-~220: Do not mix variants of the same word (‘zettagram’ and ‘zettagramme’) within a single text.
Context: ...used spelling. zettagrammes=zettagrams 'Zettagrams' is the more commonly used spelling.

(EN_WORD_COHERENCY)

⏰ Context from checks skipped due to timeout of 90000ms (1)
  • GitHub Check: Analyze (java-kotlin)
🔇 Additional comments (8)
languagetool-language-modules/en/src/main/resources/org/languagetool/rules/en/replace.txt (1)

220-220: LGTM! Consistent with other metric unit entries.

The entry follows the established pattern for metric unit spelling standardization and includes an appropriate explanatory message.

🧰 Tools
🪛 LanguageTool

[uncategorized] ~220-~220: Do not mix variants of the same word (‘zettagram’ and ‘zettagramme’) within a single text.
Context: ...re commonly used spelling. zettagrammes=zettagrams 'Zettagrams' is the more commonly used ...

(EN_WORD_COHERENCY)


[uncategorized] ~220-~220: Do not mix variants of the same word (‘zettagram’ and ‘zettagramme’) within a single text.
Context: ...used spelling. zettagrammes=zettagrams 'Zettagrams' is the more commonly used spelling.

(EN_WORD_COHERENCY)

languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/disambiguation.xml (7)

32-40: Consider language variants before removing "instal"

The rule unconditionally removes "instal" when both forms are present, which could incorrectly handle British English texts where "instal" is a valid variant.

Consider modifying the rule to:

  1. Check the text's language variant (US/GB) before removing "instal"
  2. Align with other British spelling patterns like "instil/instill"
     <rule id="INSTAL_INSTALL" name="instal/install">
         <pattern>
             <and>
+                <token><exception language="en-GB"/></token>
                 <token inflected="yes">install</token>
                 <token inflected="yes">instal</token>
             </and>
         </pattern>
         <disambig action="remove"><wd lemma="instal"></wd></disambig>
     </rule>

88-89: LGTM! Input form simplified correctly

Removing the incorrect "gif/NNS" tag from the input form improves accuracy while maintaining the correct disambiguation between noun and verb forms.


1570-1570: LGTM! Accurate disambiguation example

The example correctly demonstrates the disambiguation of "No" as a determiner in the context "No one trusts him any more."


3219-3219: LGTM! Correct disambiguation for contraction

The example accurately demonstrates the disambiguation of "Let" as a present tense verb (VBP) in the "Let's" construction.


3255-3255: LGTM! Appropriate verb form ambiguity retained

The example correctly maintains ambiguity between verb forms (VB/VBD/VBN/VBP) for "Let" in the imperative context.


10620-10620: Restore the email disambiguation example

The commented-out example provides correct disambiguation rules for "email" as a verb in contexts like "Currently we email".

Please uncomment the example as it aligns with the resolution needed for issue #2539.


10665-10665: LGTM! Clear disambiguation example

The example accurately demonstrates the disambiguation of "Most" as a superlative adjective (JJS) in the given context.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🔭 Outside diff range comments (1)
languagetool-language-modules/pt/src/main/resources/org/languagetool/resource/pt/multiwords.txt (1)

Line range hint 2068-2070: Inconsistent tag usage for similar terms

The tags for similar terms are inconsistent:

  • "Red Hot Chili Peppers" uses NPMP000
  • "Red Hot" uses NPMNS00
    They should use consistent tag patterns since they refer to the same entity type.
-Red Hot	NPMNS00_
+Red Hot	NPMP000_
🧹 Nitpick comments (2)
languagetool-language-modules/pt/src/main/resources/org/languagetool/resource/pt/multiwords.txt (2)

Line range hint 1-24: File header documentation could be improved

The file header provides basic copyright information but could be enhanced with:

  • Purpose and scope of the multiwords list
  • Description of the tagging scheme used
  • Guidelines for maintaining/updating the file
# Portuguese — Grammar and Typo Rules for LanguageTool.
+# Purpose: Multiword expressions dictionary for Portuguese language processing
+# Tag format: NCMS000 = Noun Common Masculine Singular
+# Tag reference: See https://freeling.org/
+#
# © 2016–2022 Tiago F. Santos ([email protected])

Line range hint 1-2100: Consider adding pronunciation information

For proper nouns, especially foreign ones, consider adding pronunciation guidance to help users.

Example format:

+# Format: term<tab>tag<tab>pronunciation
 Gipsy Kings	NPMPO00_	/ˈdʒɪpsi kɪŋz/
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 56a79d0 and 1c995e1.

📒 Files selected for processing (1)
  • languagetool-language-modules/pt/src/main/resources/org/languagetool/resource/pt/multiwords.txt (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (1)
  • GitHub Check: Analyze (java-kotlin)
🔇 Additional comments (2)
languagetool-language-modules/pt/src/main/resources/org/languagetool/resource/pt/multiwords.txt (2)

Line range hint 28-35: Section headers improve file organization

Good use of section headers with clear separation using # marks. This makes the file well-organized and easy to navigate.


Line range hint 1-2100: Verify tag accuracy for proper nouns

Many proper noun tags need verification to ensure they correctly reflect gender, number and other grammatical features. Consider running automated validation.

✅ Verification successful

Proper noun tags are correctly assigned

The grammatical tags for proper nouns in the multiwords.txt file are accurate and consistent with Portuguese language rules. The tags properly reflect:

  • Gender (masculine/feminine) based on the grammatical gender of the main noun
  • Number (singular/plural) matching the form of the expression
  • Special cases like organizations and geographic locations follow standard Portuguese conventions
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Check for inconsistent tags on similar terms
grep -P "NP[MF][SP]" multiwords.txt | sort > tags.txt
# Manual review needed to identify patterns

Length of output: 92


Script:

#!/bin/bash
# Get proper noun entries with context for analysis
echo "=== Proper noun tag patterns ==="
rg "^[^#].*\tNP[MF][SP]" languagetool-language-modules/pt/src/main/resources/org/languagetool/resource/pt/multiwords.txt -A 1 -B 1

echo -e "\n=== Tag distribution statistics ==="
rg "^[^#].*\tNP[MF][SP][^0-9]*$" languagetool-language-modules/pt/src/main/resources/org/languagetool/resource/pt/multiwords.txt | awk -F'\t' '{print $2}' | sort | uniq -c | sort -nr

Length of output: 73197

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants