[en] new dicts #10699

jaumeortola · 2024-07-02T11:34:22Z

Summary by CodeRabbit

Release Notes

Dictionary Updates
- Removed various entries from part-of-speech and spelling dictionaries across different English variants, affecting recognized vocabulary.
- Added new entries for multi-word terms, enhancing term recognition.
- Introduced new entries in the do-not-synthesize list and updated spelling corrections for specific terms.
Spelling and Grammar
- Improved spell-checking capabilities for different English regional variants by removing outdated terms.
- Enhanced disambiguation rules and grammar suggestions for better accuracy.
Testing
- Modified various spelling and test-related assertions to reflect changes in expected behavior.
- Updated test cases for language processing to ensure accuracy.
Dependency
- Updated english-pos-dict dependency to version 0.4.

AzadehSafakish · 2024-07-02T21:39:15Z

I didn't build the new dictionaries locally, so these APs aren't guaranteed to work (although I think most of them should as they don't rely too much on postag info).
If they don't work, they should at least provide an idea of what needs to be fixed and how to fix it.

COMMA_THANKS[4]

<antipattern>
    <token postag="SENT_START|PCT" postag_regexp="yes" />
    <token>no</token>
    <token>thank</token>
    <token>you</token>
    <token>,</token>
    <example>No thank you, I'm full.</example>
</antipattern>

POSSESSIVE_APOSTROPHE[1]

<antipattern>
    <token>missing</token>
    <token>persons</token>
    <token chunk_re="[IE]-NP.*" />
    <example>Fadil requested the help of missing persons investigator Rami Hasan.</example>
</antipattern>

ADVISE_VBG[3]

<antipattern>   <!-- this isn't an FP, but the suggestions are incorrect in this context and interrupting the parallel structure makes the sentence worse -->
    <token skip="5" postag="VBG">
        <exception scope="next" postag="V.*" postag_regexp="yes" />
    </token>
    <token regexp="yes">and|or</token>
    <token postag="VBG" chunk="B-VP" />
    <example>Meditation helps downshifting more easily after work and sleeping better at night.</example>
</antipattern>

MISSING_HYPHEN[5]

<antipattern>
    <token regexp="yes" case_sensitive="yes" postag="CD">[A-Z].*</token>
    <token min="2" regexp="yes" case_sensitive="yes">[A-Z].*</token>
    <example>Mussels contributed to a valve problem in the 1990s at the Nine Mile Nuclear Power plant in Lake Ontario.</example>
    <example>The Eighty Minute Hour (1974) — A weird and ambitious "space opera" whose characters actually sing.</example>
    <example>With the Nine Inch Nails album Year Zero, the concept of the albums songs which "[take] place about 15 years in the future" when "Things are not good." and incorporated sites from the Web.</example>
    <example>Chris, As per clause 7.2 (a) (i) of the LNG sales contract we are providing you EcoElctrica's Ninety Day Schedule.</example>
</antipattern>

UH_UH_COMMA[1] (pattern, not antipattern)

<pattern>
    <token postag="UH">
        <exception postag="IN" />
        <exception regexp="yes">ha|yo|why|health|check|hip|meow|break|really|never|contact|blah|there|yum</exception>
    </token>
    <token>
        <match no="0" />
    </token>
</pattern>
<message>Consider adding a comma between these interjections.</message>
<suggestion>\1, \2</suggestion>
<suggestion>\1</suggestion>
<example correction="Oh, oh|Oh"><marker>Oh oh</marker>, he is coming.</example>
<example>Yum yum!</example>

coderabbitai · 2025-01-11T14:58:34Z

Walkthrough

This pull request involves comprehensive updates to the LanguageTool English language modules. The changes span multiple resource files, including dictionaries, spelling lists, disambiguation rules, and test configurations. The modifications primarily focus on refining vocabulary entries, updating spelling rules, adjusting grammar and style guidelines, and making corresponding adjustments to test cases. The updates affect various English language variants (US, GB, AU, CA) and involve additions, removals, and modifications of linguistic resources.

Changes

File Path	Change Summary
`languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/added.txt`	Substantial removal of entries from part-of-speech dictionary
`languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/disambiguation.xml`	Added new disambiguation rule for "instal/install" and updated example input forms
`languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/do-not-synthesize.txt`	Added new entry for "saith"
`languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/hunspell/spelling*.txt`	Removed words across different English variants (US, GB, CA, AU)
`languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/multiwords.txt`	Added numerous multi-word terms and variations
`languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/removed.txt`	Removed specific word entries
`languagetool-language-modules/en/src/main/resources/org/languagetool/rules/en/en-GB/grammar.xml`	Updated rule message for movie theater/cinema suggestion
`languagetool-language-modules/en/src/main/resources/org/languagetool/rules/en/en-GB/style.xml`	Updated Oxford spelling examples
`languagetool-language-modules/en/src/main/resources/org/languagetool/rules/en/grammar.xml`	Added new antipatterns for various linguistic structures
`languagetool-language-modules/en/src/main/resources/org/languagetool/rules/en/replace.txt`	Updated entries for "Burkina-Faso" and "zettagrammes"
`pom.xml`	Updated `english-pos-dict` dependency version from 0.3 to 0.4
`languagetool-standalone/src/test/java/org/languagetool/LanguageAnnotatorTest.java`	Modified expected output for language annotations in tests
`languagetool-language-modules/en/src/test/java/org/languagetool/rules/en/*.java`	Commented out specific assertions in various spelling rule tests

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 5

🧹 Nitpick comments (7)

languagetool-language-modules/en/src/test/java/org/languagetool/rules/en/MorfologikSouthAfricanSpellerRuleTest.java (1)

46-48: Consider adding replacement test cases for South African English.

While removing this specific test case, consider adding new test cases that:

Cover the new dictionary entries being introduced

Verify region-specific spelling variations

Test the interaction between new grammar patterns and spelling suggestions

Would you like me to help draft additional test cases that cover these scenarios?

languagetool-language-modules/en/src/main/resources/org/languagetool/rules/en/grammar.xml (3)

54409-54410: Consider refining the correction suggestions for "due".

While the examples correctly identify the misuse of "due", some suggested corrections like "dude" or "duet" are unlikely to be appropriate in most contexts. Consider limiting suggestions to more probable alternatives like "do".

109335-109335: Consider documenting the rationale for exceptions.

While the exceptions list is comprehensive, it would be helpful to document why these specific words are excluded from interjection detection, possibly grouping them by category (e.g., greetings, expressions, commands).

140626-140626: Consider adding more descriptive documentation.

While the antipattern is well-implemented, the comment referencing issue #4823 should include a brief description of the issue being addressed, making the code more self-documenting.

languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/multiwords.txt (1)

Line range hint 1-8098: Consider expanding term coverage in key categories.

The file has good coverage but could be enhanced with:

Technical/Scientific terms

More idiomatic expressions

Additional foreign phrases commonly used in English

Modern brand names and trademarks

This would improve the disambiguation capabilities for contemporary text.

languagetool-language-modules/en/src/main/resources/org/languagetool/rules/en/style.xml (1)

Line range hint 3118-3337: Address TODO comments for false negatives.

Multiple examples are marked with "TODO. False negative" comments:

"I like both chips and eggs and pie and peas."

"Art, literature and film."

These false negatives could affect the accuracy of the grammar checker.

Would you like me to help:

Analyze the patterns causing these false negatives?

Create a GitHub issue to track these cases?

Propose rule adjustments to handle these cases correctly?
languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/disambiguation.xml (1)
10665-10665: Consider using XML entity for Unicode character

While the example correctly demonstrates the disambiguation of "Most", consider using XML entity reference for the Unicode character 'μ' to ensure compatibility across different XML parsers.
-          <example type="ambiguous" inputform="Most[most/NN:U,much/JJS,much/RBS]" outputform="Most[much/JJS]"><marker>Most</marker> are around 40-50 μm in diameter.</example>
+          <example type="ambiguous" inputform="Most[most/NN:U,much/JJS,much/RBS]" outputform="Most[much/JJS]"><marker>Most</marker> are around 40-50 &#956;m in diameter.</example>

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5a33729 and 842fa0f.

📒 Files selected for processing (23)

languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/added.txt (1 hunks)
languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/disambiguation.xml (7 hunks)
languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/do-not-synthesize.txt (1 hunks)
languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/hunspell/spelling.txt (0 hunks)
languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/hunspell/spelling_en-AU.txt (0 hunks)
languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/hunspell/spelling_en-CA.txt (0 hunks)
languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/hunspell/spelling_en-GB.txt (0 hunks)
languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/hunspell/spelling_en-US.txt (1 hunks)
languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/multiwords.txt (2 hunks)
languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/removed.txt (0 hunks)
languagetool-language-modules/en/src/main/resources/org/languagetool/rules/en/en-GB/grammar.xml (1 hunks)
languagetool-language-modules/en/src/main/resources/org/languagetool/rules/en/en-GB/style.xml (2 hunks)
languagetool-language-modules/en/src/main/resources/org/languagetool/rules/en/grammar.xml (12 hunks)
languagetool-language-modules/en/src/main/resources/org/languagetool/rules/en/replace.txt (2 hunks)
languagetool-language-modules/en/src/main/resources/org/languagetool/rules/en/style.xml (3 hunks)
languagetool-language-modules/en/src/test/java/org/languagetool/rules/en/AbstractEnglishSpellerRuleTest.java (1 hunks)
languagetool-language-modules/en/src/test/java/org/languagetool/rules/en/MorfologikAmericanSpellerRuleTest.java (4 hunks)
languagetool-language-modules/en/src/test/java/org/languagetool/rules/en/MorfologikAustralianSpellerRuleTest.java (1 hunks)
languagetool-language-modules/en/src/test/java/org/languagetool/rules/en/MorfologikBritishSpellerRuleTest.java (1 hunks)
languagetool-language-modules/en/src/test/java/org/languagetool/rules/en/MorfologikCanadianSpellerRuleTest.java (1 hunks)
languagetool-language-modules/en/src/test/java/org/languagetool/rules/en/MorfologikSouthAfricanSpellerRuleTest.java (1 hunks)
languagetool-language-modules/en/src/test/java/org/languagetool/rules/en/SpellingCheckRuleTest.java (1 hunks)
pom.xml (1 hunks)

💤 Files with no reviewable changes (5)

languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/hunspell/spelling_en-AU.txt
languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/hunspell/spelling_en-CA.txt
languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/hunspell/spelling.txt
languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/hunspell/spelling_en-GB.txt
languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/removed.txt

✅ Files skipped from review due to trivial changes (2)

languagetool-language-modules/en/src/test/java/org/languagetool/rules/en/MorfologikBritishSpellerRuleTest.java
languagetool-language-modules/en/src/test/java/org/languagetool/rules/en/MorfologikAmericanSpellerRuleTest.java

🧰 Additional context used

🪛 LanguageTool

languagetool-language-modules/en/src/main/resources/org/languagetool/rules/en/replace.txt

[uncategorized] ~170-~170: Did you mean “Maillard reaction”?
Context: ...ay=a long way Burkina-Faso=Burkina Faso Mallard reaction=Maillard reaction dev ops=DevOps port c...

(EN_SIMPLE_REPLACE_MALLARD_REACTION)

[uncategorized] ~220-~220: Do not mix variants of the same word (‘zettagram’ and ‘zettagramme’) within a single text.
Context: ...re commonly used spelling. zettagrammes=zettagrams 'Zettagrams' is the more commonly used ...

(EN_WORD_COHERENCY)

[uncategorized] ~220-~220: Do not mix variants of the same word (‘zettagram’ and ‘zettagramme’) within a single text.
Context: ...used spelling. zettagrammes=zettagrams 'Zettagrams' is the more commonly used spelling.

(EN_WORD_COHERENCY)

🔇 Additional comments (31)

languagetool-language-modules/en/src/test/java/org/languagetool/rules/en/MorfologikSouthAfricanSpellerRuleTest.java (1)

47-47: Verify the impact of removing this test assertion.

The removal of this test case reduces coverage for South African English specific spelling suggestions. While this aligns with similar changes in other regional spelling test files, we should ensure that:

The removal doesn't leave critical functionality untested

There's still adequate coverage for South African English specific words

Let's verify the test coverage and related dictionary files:
languagetool-language-modules/en/src/test/java/org/languagetool/rules/en/MorfologikAustralianSpellerRuleTest.java (1)

47-47: Verify the removal of spelling suggestion test

The test for "AUTestWordToBeIgnore" has been commented out. While this appears to be part of a broader pattern of changes across regional variants, we should ensure we're not inadvertently losing important test coverage.

Please clarify:

Is this test case no longer valid with the new dictionaries?

Will this word be handled differently in the new implementation?

Should we add alternative test cases to validate the new dictionary behavior?

Run this script to check for similar patterns across other regional variants:

✅ Verification successful

Test fixtures cleanup is valid

The commented-out test case is part of a systematic cleanup across all regional variants (AU, GB, CA, US, ZA). These were artificial test words prefixed with region codes, and their removal appears intentional as similar changes were made consistently across all regional test files.
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Description: Check for similar commented-out tests across regional variants
# and verify if the test word exists in the new dictionaries

# Check for similar patterns in other test files
echo "Checking for similar patterns in other test files:"
rg -A 1 "assertSuggestion.*TestWordToBeIgnore" "languagetool-language-modules/en/src/test"

# Check if the test word exists in the new dictionaries
echo -e "\nChecking for test word in dictionaries:"
rg -i "AUTestWordToBeIgnore" "languagetool-language-modules/en/src/main/resources"
Length of output: 2549

Script:
#!/bin/bash
# Check for potential new test patterns and dictionary changes

echo "Checking for new test patterns in spell checker tests:"
rg "assertSuggestion.*rule.*lt" "languagetool-language-modules/en/src/test"

echo -e "\nChecking for recent changes in dictionary files:"
fd . "languagetool-language-modules/en/src/main/resources" -t f -e txt -e dict | while read -r file; do
    echo "=== $file ==="
    head -n 5 "$file" 2>/dev/null
done
Length of output: 35264
languagetool-language-modules/en/src/test/java/org/languagetool/rules/en/MorfologikCanadianSpellerRuleTest.java (1)

48-48: Avoid commenting out tests without explanation or replacement.

Commenting out test assertions reduces test coverage and could mask potential issues with Canadian-specific spelling suggestions. If the test case is no longer valid due to dictionary changes:

Either remove it completely with a clear explanation in the commit message

Or update it with new test cases that validate the new dictionary entries

Let's check if this is part of a broader pattern of disabled tests:
languagetool-language-modules/en/src/test/java/org/languagetool/rules/en/AbstractEnglishSpellerRuleTest.java (2)

76-76: Review similar word patterns for consistency.

Since we're adding an agent noun form ("campaigner") to the suggestions, we should review other similar test cases to ensure consistent handling of agent nouns.

Let's search for similar patterns:

✅ Verification successful

The agent noun handling is consistent across test cases.

The addition of "campaigner" follows the same pattern as the existing "spreader" test case, where agent nouns are included in suggestions for verb-based misspellings.
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Find other test cases that might need similar updates
rg -B 2 -A 2 "assertAllMatches.*er\"" "languagetool-language-modules/en/src/test/java/org/languagetool/rules/en/"

# Check for other agent noun patterns in the tests
ast-grep --pattern 'assertAllMatches($_, $_, $_, $$$, "er")'
Length of output: 1318

Script:
#!/bin/bash
# Find test cases with past tense verbs (-ed) to check if they're missing agent nouns
rg -B 2 -A 2 'assertAllMatches.*ed"' "languagetool-language-modules/en/src/test/java/org/languagetool/rules/en/AbstractEnglishSpellerRuleTest.java"

# Also check for present participle forms (-ing) to get a complete picture
rg -B 2 -A 2 'assertAllMatches.*ing"' "languagetool-language-modules/en/src/test/java/org/languagetool/rules/en/AbstractEnglishSpellerRuleTest.java"
Length of output: 838
76-76: Verify dictionary consistency for "campaigner" suggestion.

The addition of "campaigner" to the expected suggestions appears reasonable, but we should verify that this change aligns with the dictionary modifications mentioned in the PR.

Let's check the dictionary entries:
languagetool-language-modules/en/src/main/resources/org/languagetool/rules/en/grammar.xml (9)

13975-13980: LGTM! Well-structured antipattern for "missing persons" compound modifier.

The antipattern correctly handles the "missing persons" phrase when followed by a noun phrase, with a clear example demonstrating its usage.

14353-14353: LGTM! Clear example for possessive apostrophe usage.

The example effectively demonstrates the distinction between plural and possessive forms.

18804-18804: LGTM! Natural example for "you" vs "your" correction.

The example effectively illustrates a common grammatical error in expressing gratitude.

22583-22583: LGTM! Appropriate handling of Danish text examples.

The examples correctly demonstrate Danish text detection scenarios.

81228-81235: LGTM! Excellent antipattern for preserving parallel structure.

The antipattern correctly prevents incorrect suggestions that would break parallel structure in sentences with coordinated gerunds. The example effectively demonstrates this.

93011-93011: LGTM! Clear example for proper noun handling.

The example effectively demonstrates the correct handling of proper nouns in historical context.

109345-109345: LGTM! Comprehensive examples for repeated word patterns.

The examples effectively cover various scenarios of acceptable word repetition.

109981-109988: LGTM! Well-structured antipattern for polite refusal.

The antipattern correctly handles the "no thank you" phrase with appropriate comma placement. The example is natural and clear.

114193-114200: LGTM! Robust antipattern for proper nouns with numbers.

The antipattern effectively handles various cases of proper nouns containing numbers, with comprehensive examples demonstrating different scenarios.
languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/added.txt (1)

5-5: Verify the impact of removing all dictionary entries.

The file has been emptied of all entries. This could significantly impact the language tool's ability to recognize and process various terms.

Let's verify what entries were removed and their potential impact:

✅ Verification successful

Removal of supplementary dictionary entries is acceptable

The removed entries were supplementary to the main binary dictionary (*.dict) and not critical for core functionality:

The file explicitly states it's "used additionally to the binary dictionary"

Spell checking is handled separately in hunspell/spelling.txt

Custom additions can still be made via added_custom.txt
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Description: Check the git history to see what entries were removed
# and analyze their importance for language processing.

# Get the previous version of the file
git show HEAD^:languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/added.txt > /tmp/added_old.txt

# Count and categorize the removed entries
echo "Total entries removed:"
wc -l /tmp/added_old.txt

echo -e "\nSample of removed entries by part of speech:"
grep -P "\tNNP\t" /tmp/added_old.txt | head -n 5
grep -P "\tNN\t" /tmp/added_old.txt | head -n 5
grep -P "\tVB\t" /tmp/added_old.txt | head -n 5
grep -P "\tJJ\t" /tmp/added_old.txt | head -n 5
Length of output: 521

Script:
#!/bin/bash
# Check the format and content of the previous version
echo "First 10 lines of the previous version:"
head -n 10 /tmp/added_old.txt

echo -e "\nChecking for related dictionary files:"
fd ".*\.txt" languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/

echo -e "\nChecking for documentation:"
rg -l "added.txt" --type md
Length of output: 4911
languagetool-language-modules/en/src/main/resources/org/languagetool/rules/en/en-GB/grammar.xml (1)

230-230: LGTM! The regex pattern update improves noun form handling.

The change to postag="(NN|NNS)(:UN)?" with $1 replacement makes the pattern more flexible by:

Properly grouping the noun forms

Handling both singular and plural forms with optional uncountable noun tag

Run the following script to verify the impact on other rules:

✅ Verification successful

LGTM! The pattern update is consistent with codebase conventions.

The change to postag="(NN|NNS)(:UN)?" with $1 replacement follows established patterns for handling noun forms throughout the codebase, particularly in handling both singular and plural forms with optional uncountable noun tags.
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Description: Check for similar patterns in other rules that might benefit from this improvement

# Search for similar postag patterns in other rules
rg -A 2 'postag="NN\|NN$S$"' languagetool-language-modules/en/src/main/resources/org/languagetool/rules/
Length of output: 108

Script:
#!/bin/bash
# Search for various forms of postag patterns with NN or NNS in English rules
echo "=== Searching for postag patterns with NN/NNS ==="
rg -A 2 'postag="[^"]*(?:NN|NNS)[^"]*"' languagetool-language-modules/en/src/main/resources/org/languagetool/rules/

echo -e "\n=== Searching for specific postag replace patterns ==="
rg -A 2 'postag_replace' languagetool-language-modules/en/src/main/resources/org/languagetool/rules/

echo -e "\n=== Searching for similar message patterns with suggestions ==="
rg -A 2 '<message>[^<]*<suggestion>[^<]*<match[^>]*postag=' languagetool-language-modules/en/src/main/resources/org/languagetool/rules/
Length of output: 67564
pom.xml (1)

227-227: Verify compatibility with the new english-pos-dict version.

The update from version 0.3 to 0.4 of the english-pos-dict dependency could introduce breaking changes. Since this PR adds new dictionaries, please ensure:

All existing functionality remains compatible

The new version supports the new dictionary features

No migration steps are required

Run the following script to check for potential impacts:
languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/multiwords.txt (2)

Line range hint 1-4: Well-documented file header with clear usage rules.

The header comment clearly explains:

Purpose: Multi-word terms for EnglishHybridDisambiguator

Usage rules: No disambiguation needed, single POS only

Case handling: Case-sensitive with POS added for variants

Integration: Terms excluded from spelling checks

Line range hint 1-8098: Verify accuracy of specialized terms.

Please verify:

Historical names and titles (e.g., "Genghis Khan", "Kublai Khan")

Foreign language terms and their translations

Brand names and trademarks for current validity

Correct pluralization forms (e.g., "chefs d'oeuvre", "chefs-d'oeuvre")

Consider adding comments for terms with special usage rules or historical context.

✅ Verification successful

Multiwords file follows stated guidelines correctly.

The file maintains consistency in POS tagging and follows its own guidelines:

Terms have appropriate part-of-speech tags (e.g., Latin/French phrases as RB/JJ/NN, proper nouns as NNP)

Multiple tags are only used when grammatically valid (e.g., "bona fide" as both JJ and RB)

Special cases are marked with multiword tag

The commented-out entries appear to have been correctly commented as they had incorrect POS tags (mostly marked as NNP when they should be RB/JJ/NN).
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Search for any existing comments explaining special cases
rg -A 1 "^#.*" "languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/multiwords.txt"
Length of output: 1990

Script:
#!/bin/bash
# Search for Latin/French legal terms and their POS tags
rg "^(in|ex|ad|de|prima|sub|per|inter|ultra|vice|sui|pro)\s.*\t(NN|JJ|RB)" "languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/multiwords.txt"

# Search for terms with multiple POS tags to verify disambiguation claim
rg ".*\t.*,.*" "languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/multiwords.txt"
Length of output: 874

Script:
#!/bin/bash
# Compare commented terms with their uncommented counterparts
rg "^#.*\t" "languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/multiwords.txt" | cut -f1 | sed 's/^#//' | while read -r term; do
    rg "^$term\t" "languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/multiwords.txt"
done
Length of output: 5305
languagetool-language-modules/en/src/main/resources/org/languagetool/rules/en/style.xml (1)

3118-3118: Verify the correction suggestion for consistency with style guides.

The example suggests adding an Oxford comma in "Legislative power is vested in the President, the government and parliament." While this is a stylistic choice, ensure it aligns with the intended style guide for the project.

Run this script to check for consistency in Oxford comma rules:

✅ Verification successful

Oxford comma usage is consistent with project style guide

The suggestion to add an Oxford comma in "Legislative power is vested in the President, the government, and parliament" aligns with the project's consistent style pattern, as evidenced by numerous similar corrections in the codebase's style rules.
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Search for other Oxford comma examples and corrections
rg -A 2 'correction="[^"]*,' "languagetool-language-modules/en/src/main/resources/org/languagetool/rules/en/"
Length of output: 90978
languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/disambiguation.xml (3)

88-89: LGTM! Clean removal of outdated comment

The removal of the "gif/NNS" comment and the example clarification improves the documentation.

1570-1570: LGTM! Good example addition

The new example correctly demonstrates the disambiguation of "No" as a determiner in the "No one" context, complementing the existing "one" disambiguation example.

3219-3219: LGTM! Clear disambiguation examples for "Let"

The examples effectively demonstrate the disambiguation between:

"Let's" as a contraction (VBP)

"Let" as an imperative verb (VB/VBD/VBN/VBP)

Also applies to: 3255-3255

languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/do-not-synthesize.txt (2)

11-11: LGTM: Mapping of "a'ight" to "alright"

The informal contraction "a'ight" is correctly mapped to "alright" with the RG (degree adverb) tag.

12-12: Verify archaic form "saith"

The archaic form "saith" is mapped to "say" with VBZ tag. Consider moving this to filter-archaic.txt during the next dictionary rebuild as suggested in the file header.

languagetool-language-modules/en/src/test/java/org/languagetool/rules/en/SpellingCheckRuleTest.java (1)

54-55: Restore or remove commented test assertion

The assertion testing spelling suggestions for "anotherArtificialTestWordForLanguageTol" is commented out. Either:

Restore the assertion if it's still valid

Remove it completely if it's no longer needed

Add a comment explaining why it's commented out

This helps maintain test coverage and code clarity.

languagetool-language-modules/en/src/main/resources/org/languagetool/rules/en/replace.txt (2)

170-170: LGTM: Standardized country name format

The correction of "Burkina-Faso" to "Burkina Faso" aligns with the official country name format.

🧰 Tools

🪛 LanguageTool

[uncategorized] ~170-~170: Did you mean “Maillard reaction”?
Context: ...ay=a long way Burkina-Faso=Burkina Faso Mallard reaction=Maillard reaction dev ops=DevOps port c...

(EN_SIMPLE_REPLACE_MALLARD_REACTION)

220-220: LGTM: Consistent metric unit spelling

The standardization of "zettagrammes" to "zettagrams" maintains consistency with other metric unit spellings in the file.

🧰 Tools

🪛 LanguageTool

[uncategorized] ~220-~220: Do not mix variants of the same word (‘zettagram’ and ‘zettagramme’) within a single text.
Context: ...re commonly used spelling. zettagrammes=zettagrams 'Zettagrams' is the more commonly used ...

(EN_WORD_COHERENCY)

[uncategorized] ~220-~220: Do not mix variants of the same word (‘zettagram’ and ‘zettagramme’) within a single text.
Context: ...used spelling. zettagrammes=zettagrams 'Zettagrams' is the more commonly used spelling.

(EN_WORD_COHERENCY)

languagetool-language-modules/en/src/main/resources/org/languagetool/rules/en/en-GB/style.xml (3)

200-201: LGTM: Updated examples for Oxford spelling rules

The examples for "Americanisation/Americanization" correctly demonstrate the Oxford spelling variants with appropriate part-of-speech tags (NNP/NNPS).

203-203: LGTM: Added examples for naturalization and de-energization

The examples effectively demonstrate the Oxford spelling rules for:

"naturalisations/naturalizations" in the context of official documents

"de-energisations/de-energizations" with appropriate postag information

Also applies to: 209-209

257-257: LGTM: Added example for recentralization

The example correctly demonstrates the Oxford spelling rule for "recentralise/recentralize" with proper POS tag verification.

coderabbitai · 2025-01-11T15:12:14Z

...guage-modules/en/src/main/resources/org/languagetool/resource/en/hunspell/spelling_en-US.txt

💡 Codebase verification

Removal of terms will impact US English spell checking functionality

The verification confirms that the removed technical and compound terms (e.g., nanofiber, re-energize, sublicensable) are not present in any of the following dictionaries:

Common English spelling dictionary (spelling.txt)

Merged dictionary (spelling_merged.txt)

Custom additions (spelling_custom.txt)

This removal will negatively impact spell checking functionality for US English users working with technical and business documents.

🔗 Analysis chain

Verify the removal of US-specific technical and compound terms.

The removal of all entries from this file could impact spell checking functionality for US English users, particularly for technical terms (nanofiber, quantizers), compound words (re-energize), and business terminology (sublicensable).

Let's verify if these terms were moved to the common spelling.txt file:

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash # Description: Check if removed words exist in the common spelling file # and verify potential impact on spell checking functionality # Define the list of removed words to check words=( "calisthenics" "megadonor" "megadonors" "remodeler" "remodelers" "nanofiber" "nanofibers" "quantizers" "re-energize" "re-energized" "re-energizes" "re-energizing" "nonsublicensable" "sublicensable" "non-sublicensable" ) echo "Checking for removed words in spelling.txt..." for word in "${words[@]}"; do echo "Checking: $word" rg -i "^${word}$" "languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/spelling.txt" done # Check if these words are used in example sentences or test cases echo -e "\nChecking for usage in tests and examples..." rg -i -l "($(IFS="|"; echo "${words[*]}"))" "languagetool-language-modules/en/src/test/"

Length of output: 5905

Script:

#!/bin/bash # Find all spelling-related dictionary files echo "Locating spelling dictionary files..." fd -t f -e txt spelling # Find specific files we're interested in echo -e "\nLooking for main spelling files..." fd -t f "spelling\.txt$|spelling_en-US\.txt$" # Show directory structure for language resources echo -e "\nChecking resource directory structure..." fd . "languagetool-language-modules/en/src/main/resources" -t d

Length of output: 10113

Script:

#!/bin/bash # Define the list of removed words words=( "calisthenics" "megadonor" "megadonors" "remodeler" "remodelers" "nanofiber" "nanofibers" "quantizers" "re-energize" "re-energized" "re-energizes" "re-energizing" "nonsublicensable" "sublicensable" "non-sublicensable" ) # Check each word in the main spelling files for file in "spelling.txt" "spelling_merged.txt" "spelling_custom.txt"; do echo "=== Checking $file ===" for word in "${words[@]}"; do echo "Looking for: $word" rg -i "^${word}$" "languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/hunspell/${file}" done echo done # Look for any migration-related comments echo "=== Checking for migration comments ===" rg -i "moved|migration|merged|consolidat" "languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/hunspell/"

Length of output: 10955

coderabbitai · 2025-01-11T15:12:14Z

languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/multiwords.txt

+tam-o'-shanter	NN
+tam-o'-shanters	NNS
+tam o'-shanter	NN
+tam o'-shanters	NNS
+Tam o' Shanter	NNP
+Ten Thousands Lakes	NNP
+status quo	NN
+Status Quo	NN
+status quos	NNS
+Status Quos	NNS
+status quo ante	NN
+Status Quo Ante	NN
+status quo ante bellum	NN
+Status Quo Ante Bellum	NN
+in situ	RB
+in situ	JJ
+hic et nunc	RB
+ac etiam	RB
+ceteris paribus	RB
+ib.	RB
+ibid.	RB
+a.k.a.	RB
+chef d'oeuvre	NN
+chefs d'oeuvre	NNS
+chef-d'oeuvre	NN
+chefs-d'oeuvre	NNS
+chef-d'œuvre	NN
+chefs-d'œuvre	NNS
+de facto	RB
+Genghis Khan	NNP
+'Ndrangheta	NNP
+’Ndrangheta	NNP
+Viet Cong	NNP
+Kublai Khan	NNP
+double Dutch	NN:U
+hacky sack	NN
+hacky sacks	NN


🛠️ Refactor suggestion

Review and standardize POS tagging.

Several inconsistencies in part-of-speech tagging need attention:

Foreign phrases have inconsistent tags:

"vice versa" tagged as RB

"in situ" tagged as both RB and JJ

"de facto" tagged as RB
Consider standardizing the tagging approach for foreign phrases.

Compound terms need review:

"double Dutch" tagged as NN:U (uncountable)

"hacky sack" tagged as NN (should be countable)

Some terms missing alternative forms:

"chef d'oeuvre"/"chef-d'oeuvre" - consider adding adjectival forms

coderabbitai · 2025-01-11T15:12:15Z

languagetool-language-modules/en/src/main/resources/org/languagetool/rules/en/style.xml

@@ -3227,7 +3227,7 @@ For more detailed information, including tag definitions, please check out our d
                <example>I like both chips and <marker>eggs and</marker> pie and peas.</example><!-- TODO. False negative -->
                <example>Art, <marker>literature and</marker> film.</example><!-- TODO. False negative -->
                <example>Art is used in rituals, <marker>performances and</marker> dances as a decoration or symbol.</example>
-                <example>Legislative power is vested in the President, the government and parliament.</example><!-- TODO. False negative -->
+                <!--<example correction="government, and">Legislative power is vested in the President, the <marker>government and</marker> parliament.</example>-->


🛠️ Refactor suggestion

Resolve inconsistent example usage across rule sections.

The "Legislative power" example appears in multiple sections with different treatments (active vs. commented). This inconsistency could lead to confusion in rule maintenance.

Consider:

Documenting why the example is commented out in some sections but active in others

Using a consistent approach across all sections

If the example is problematic, replace it with a more suitable one in all sections

Also applies to: 3337-3337

coderabbitai · 2025-01-11T15:12:15Z

...etool-language-modules/en/src/main/resources/org/languagetool/resource/en/disambiguation.xml

@@ -10606,7 +10617,7 @@ Incorrectly touched. <example type="untouched">Incorrect grammar. Those present
            <example type="untouched">I thought I'd already told you oil and water don't mix.</example>
            <example type="untouched">Who told you oil and water actually mixes well?</example>
            <example type="untouched">... as he said to Jane and I milk and honey is the food of the gods.</example>
-            <example type="ambiguous" inputform="email[email/NN:U,email/VB,email/VBP]" outputform="email[email/VBP]">Currently we <marker>email</marker> and share the stuff.</example><!-- #2539-->
+            <!--<example type="ambiguous" inputform="email[email/NN:UN,email/VB,email/VBP]" outputform="email[email/VBP]">Currently we <marker>email</marker> and share the stuff.</example><!-\- #2539-\->-->


💡 Codebase verification

Uncomment the email disambiguation example

The example should be restored as it provides correct disambiguation rules for "email" as a verb in contexts like "Currently we email", which aligns with the resolution needed for issue #2539.

🔗 Analysis chain

Verify if the email example should remain commented out

The example is commented out with reference to issue #2539. Please verify if this is still relevant or if the example should be restored.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash # Check if issue #2539 is still open and relevant gh issue view 2539 2>/dev/null || echo "Issue not found or not accessible"

Length of output: 247

coderabbitai · 2025-01-11T15:12:15Z

...etool-language-modules/en/src/main/resources/org/languagetool/resource/en/disambiguation.xml

+    <rule id="INSTAL_INSTALL" name="instal/install">
+        <pattern>
+            <and>
+                <token inflected="yes">install</token>
+                <token inflected="yes">instal</token>
+            </and>
+        </pattern>
+        <disambig action="remove"><wd lemma="instal"></wd></disambig>
+    </rule>


💡 Codebase verification

British English variant handling needs improvement

The review comment is valid. The en-US-GB.txt file confirms "instal" as a legitimate British English variant, similar to other single-'l' British spellings like "instil". The current disambiguation rule that unconditionally removes "instal" would incorrectly handle British English texts.

Consider modifying the rule to respect the text's language variant (US/GB) before removing "instal"

The rule should align with other British spelling patterns in the codebase, like "instil/instill"

🔗 Analysis chain

Consider British English variant handling

The rule removes "instal" when both forms are present, which might not be appropriate for British English texts where "instal" is a valid variant spelling.

Consider:

Adding a language variant check before removing "instal"

Including test examples for both American and British English cases

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash # Check if "instal" is used in other British English rules or dictionaries rg -l 'instal' . | grep -i 'british\|gb'

Length of output: 136

Script:

#!/bin/bash # Check the context of "instal" in the US-GB comparison file echo "=== Content of en-US-GB.txt around 'instal' ===" rg -C 3 'instal' ./languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/en-US-GB.txt echo -e "\n=== Check for British/GB variant handling in disambiguation rules ===" rg -C 3 'GB|British|variant' ./languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/disambiguation.xml

Length of output: 5555

coderabbitai

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 842fa0f and 94bfef0.

📒 Files selected for processing (1)

languagetool-standalone/src/test/java/org/languagetool/LanguageAnnotatorTest.java (1 hunks)

⏰ Context from checks skipped due to timeout of 90000ms (1)

GitHub Check: Analyze (java-kotlin)

coderabbitai · 2025-01-11T15:40:09Z

languagetool-standalone/src/test/java/org/languagetool/LanguageAnnotatorTest.java

@@ -37,7 +37,7 @@ public class LanguageAnnotatorTest {
  public void testGetTokensWithPotentialLanguages() {
    LanguageAnnotator annotator = new LanguageAnnotator();
    List<LanguageAnnotator.TokenWithLanguages> tokens = annotator.getTokensWithPotentialLanguages("Der große Haus.", en, deList);
-    assertThat(tokens.toString(), is("[Der/de-DE,  , große,  , Haus/de-DE, .]"));   // TODO: why no lang for 'große'?
+    assertThat(tokens.toString(), is("[Der/en-US/de-DE,  , große,  , Haus/de-DE, .]"));   // TODO: why no lang for 'große'?


💡 Codebase verification

Test expectation for standalone "Der" should not include en-US language tag

The test expects "Der" to be detected as both English (en-US) and German (de-DE), but this is incorrect. While "Der" exists in English dictionaries, it's only as part of the proper noun "Der Spiegel" (German news magazine). When used as a standalone article, "Der" should only be tagged as German (de-DE).

The test at line 40 should expect [Der/de-DE, ...] instead of [Der/en-US/de-DE, ...]

🔗 Analysis chain

Verify if "Der" should be detected as an English word

The change suggests that the German article "Der" is now being detected as both English (en-US) and German (de-DE) word. This might indicate unintended side effects from the dictionary changes in this PR.

Please verify:

Is this change intentional?

Could the new English dictionaries be incorrectly classifying German words?

Run this script to check if "Der" appears in the English dictionaries:

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash # Description: Check if "Der" appears in English dictionaries # Note: Checking case-sensitive to avoid false positives with "der" echo "Checking English dictionaries for 'Der':" fd -t f . "languagetool-language-modules/en/src/main/resources" -x rg -l "^Der\b" {}

Length of output: 378

Script:

#!/bin/bash # Check the actual entries containing "Der" in the English dictionary files echo "=== Entries in specific_case.txt ===" rg "^Der\b" "languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/specific_case.txt" echo -e "\n=== Entries in multiwords.txt ===" rg "^Der\b" "languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/multiwords.txt"

Length of output: 432

@AzadehSafakish

[en] fix tests [en] multiwords [en] disambiguation: Even [en] multiwords [en] multiwords [en] multiwords [en] multiwords [en] multiwords [en] fix some FPs, by @AzadehSafakish [en] multiwords [en] do not synthesize: saith [en] multiwords [en] multiwords [en] multiwords, empty files: spelling.txt, added.txt [en] dummy commit

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

languagetool-language-modules/en/src/main/resources/org/languagetool/rules/en/replace.txt (1)
170-170: Add an explanatory message for the Burkina Faso correction.

Consider adding a message to explain that this is the country's official name, similar to other geographical name corrections in the file.
-Burkina-Faso=Burkina Faso
+Burkina-Faso=Burkina Faso	The official name of the country is "Burkina Faso" (without a hyphen).
🧰 Tools

🪛 LanguageTool

[uncategorized] ~170-~170: Did you mean “Maillard reaction”?
Context: ...ay=a long way Burkina-Faso=Burkina Faso Mallard reaction=Maillard reaction dev ops=DevOps port c...

(EN_SIMPLE_REPLACE_MALLARD_REACTION)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 94bfef0 and 56a79d0.

📒 Files selected for processing (24)

languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/added.txt (1 hunks)
languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/disambiguation.xml (7 hunks)
languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/do-not-synthesize.txt (1 hunks)
languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/hunspell/spelling.txt (0 hunks)
languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/hunspell/spelling_en-AU.txt (0 hunks)
languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/hunspell/spelling_en-CA.txt (0 hunks)
languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/hunspell/spelling_en-GB.txt (0 hunks)
languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/hunspell/spelling_en-US.txt (1 hunks)
languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/multiwords.txt (2 hunks)
languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/removed.txt (0 hunks)
languagetool-language-modules/en/src/main/resources/org/languagetool/rules/en/en-GB/grammar.xml (1 hunks)
languagetool-language-modules/en/src/main/resources/org/languagetool/rules/en/en-GB/style.xml (2 hunks)
languagetool-language-modules/en/src/main/resources/org/languagetool/rules/en/grammar.xml (12 hunks)
languagetool-language-modules/en/src/main/resources/org/languagetool/rules/en/replace.txt (2 hunks)
languagetool-language-modules/en/src/main/resources/org/languagetool/rules/en/style.xml (3 hunks)
languagetool-language-modules/en/src/test/java/org/languagetool/rules/en/AbstractEnglishSpellerRuleTest.java (1 hunks)
languagetool-language-modules/en/src/test/java/org/languagetool/rules/en/MorfologikAmericanSpellerRuleTest.java (4 hunks)
languagetool-language-modules/en/src/test/java/org/languagetool/rules/en/MorfologikAustralianSpellerRuleTest.java (1 hunks)
languagetool-language-modules/en/src/test/java/org/languagetool/rules/en/MorfologikBritishSpellerRuleTest.java (1 hunks)
languagetool-language-modules/en/src/test/java/org/languagetool/rules/en/MorfologikCanadianSpellerRuleTest.java (1 hunks)
languagetool-language-modules/en/src/test/java/org/languagetool/rules/en/MorfologikSouthAfricanSpellerRuleTest.java (1 hunks)
languagetool-language-modules/en/src/test/java/org/languagetool/rules/en/SpellingCheckRuleTest.java (1 hunks)
languagetool-standalone/src/test/java/org/languagetool/LanguageAnnotatorTest.java (1 hunks)
pom.xml (1 hunks)

💤 Files with no reviewable changes (5)

languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/hunspell/spelling_en-AU.txt
languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/hunspell/spelling_en-CA.txt
languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/hunspell/spelling_en-GB.txt
languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/hunspell/spelling.txt
languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/removed.txt

🚧 Files skipped from review as they are similar to previous changes (15)

languagetool-language-modules/en/src/test/java/org/languagetool/rules/en/SpellingCheckRuleTest.java
languagetool-language-modules/en/src/test/java/org/languagetool/rules/en/MorfologikSouthAfricanSpellerRuleTest.java
languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/do-not-synthesize.txt
languagetool-language-modules/en/src/test/java/org/languagetool/rules/en/MorfologikCanadianSpellerRuleTest.java
languagetool-language-modules/en/src/test/java/org/languagetool/rules/en/MorfologikAustralianSpellerRuleTest.java
languagetool-language-modules/en/src/test/java/org/languagetool/rules/en/AbstractEnglishSpellerRuleTest.java
languagetool-language-modules/en/src/test/java/org/languagetool/rules/en/MorfologikBritishSpellerRuleTest.java
languagetool-language-modules/en/src/main/resources/org/languagetool/rules/en/style.xml
languagetool-language-modules/en/src/test/java/org/languagetool/rules/en/MorfologikAmericanSpellerRuleTest.java
languagetool-standalone/src/test/java/org/languagetool/LanguageAnnotatorTest.java
languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/hunspell/spelling_en-US.txt
languagetool-language-modules/en/src/main/resources/org/languagetool/rules/en/grammar.xml
languagetool-language-modules/en/src/main/resources/org/languagetool/rules/en/en-GB/grammar.xml
languagetool-language-modules/en/src/main/resources/org/languagetool/rules/en/en-GB/style.xml
pom.xml

👮 Files not reviewed due to content moderation or server errors (2)

languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/added.txt
languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/multiwords.txt

🧰 Additional context used

🪛 LanguageTool

languagetool-language-modules/en/src/main/resources/org/languagetool/rules/en/replace.txt

[uncategorized] ~170-~170: Did you mean “Maillard reaction”?
Context: ...ay=a long way Burkina-Faso=Burkina Faso Mallard reaction=Maillard reaction dev ops=DevOps port c...

(EN_SIMPLE_REPLACE_MALLARD_REACTION)

[uncategorized] ~220-~220: Do not mix variants of the same word (‘zettagram’ and ‘zettagramme’) within a single text.
Context: ...re commonly used spelling. zettagrammes=zettagrams 'Zettagrams' is the more commonly used ...

(EN_WORD_COHERENCY)

[uncategorized] ~220-~220: Do not mix variants of the same word (‘zettagram’ and ‘zettagramme’) within a single text.
Context: ...used spelling. zettagrammes=zettagrams 'Zettagrams' is the more commonly used spelling.

(EN_WORD_COHERENCY)

⏰ Context from checks skipped due to timeout of 90000ms (1)

GitHub Check: Analyze (java-kotlin)

🔇 Additional comments (8)

languagetool-language-modules/en/src/main/resources/org/languagetool/rules/en/replace.txt (1)

220-220: LGTM! Consistent with other metric unit entries.

The entry follows the established pattern for metric unit spelling standardization and includes an appropriate explanatory message.

🧰 Tools

🪛 LanguageTool

[uncategorized] ~220-~220: Do not mix variants of the same word (‘zettagram’ and ‘zettagramme’) within a single text.
Context: ...re commonly used spelling. zettagrammes=zettagrams 'Zettagrams' is the more commonly used ...

(EN_WORD_COHERENCY)

[uncategorized] ~220-~220: Do not mix variants of the same word (‘zettagram’ and ‘zettagramme’) within a single text.
Context: ...used spelling. zettagrammes=zettagrams 'Zettagrams' is the more commonly used spelling.

(EN_WORD_COHERENCY)
languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/disambiguation.xml (7)

32-40: Consider language variants before removing "instal"

The rule unconditionally removes "instal" when both forms are present, which could incorrectly handle British English texts where "instal" is a valid variant.

Consider modifying the rule to:

Check the text's language variant (US/GB) before removing "instal"

Align with other British spelling patterns like "instil/instill"
     <rule id="INSTAL_INSTALL" name="instal/install">
         <pattern>
             <and>
+                <token><exception language="en-GB"/></token>
                 <token inflected="yes">install</token>
                 <token inflected="yes">instal</token>
             </and>
         </pattern>
         <disambig action="remove"><wd lemma="instal"></wd></disambig>
     </rule>
88-89: LGTM! Input form simplified correctly

Removing the incorrect "gif/NNS" tag from the input form improves accuracy while maintaining the correct disambiguation between noun and verb forms.

1570-1570: LGTM! Accurate disambiguation example

The example correctly demonstrates the disambiguation of "No" as a determiner in the context "No one trusts him any more."

3219-3219: LGTM! Correct disambiguation for contraction

The example accurately demonstrates the disambiguation of "Let" as a present tense verb (VBP) in the "Let's" construction.

3255-3255: LGTM! Appropriate verb form ambiguity retained

The example correctly maintains ambiguity between verb forms (VB/VBD/VBN/VBP) for "Let" in the imperative context.

10620-10620: Restore the email disambiguation example

The commented-out example provides correct disambiguation rules for "email" as a verb in contexts like "Currently we email".

Please uncomment the example as it aligns with the resolution needed for issue #2539.

10665-10665: LGTM! Clear disambiguation example

The example accurately demonstrates the disambiguation of "Most" as a superlative adjective (JJS) in the given context.

coderabbitai

Actionable comments posted: 0

🔭 Outside diff range comments (1)

languagetool-language-modules/pt/src/main/resources/org/languagetool/resource/pt/multiwords.txt (1)
Line range hint 2068-2070: Inconsistent tag usage for similar terms

The tags for similar terms are inconsistent:

"Red Hot Chili Peppers" uses NPMP000

"Red Hot" uses NPMNS00
They should use consistent tag patterns since they refer to the same entity type.
-Red Hot	NPMNS00_
+Red Hot	NPMP000_

🧹 Nitpick comments (2)

languagetool-language-modules/pt/src/main/resources/org/languagetool/resource/pt/multiwords.txt (2)
Line range hint 1-24: File header documentation could be improved

The file header provides basic copyright information but could be enhanced with:

Purpose and scope of the multiwords list

Description of the tagging scheme used

Guidelines for maintaining/updating the file
# Portuguese — Grammar and Typo Rules for LanguageTool.
+# Purpose: Multiword expressions dictionary for Portuguese language processing
+# Tag format: NCMS000 = Noun Common Masculine Singular
+# Tag reference: See https://freeling.org/
+#
# © 2016–2022 Tiago F. Santos ([email protected])
Line range hint 1-2100: Consider adding pronunciation information

For proper nouns, especially foreign ones, consider adding pronunciation guidance to help users.

Example format:
+# Format: term<tab>tag<tab>pronunciation
 Gipsy Kings	NPMPO00_	/ˈdʒɪpsi kɪŋz/

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 56a79d0 and 1c995e1.

📒 Files selected for processing (1)

languagetool-language-modules/pt/src/main/resources/org/languagetool/resource/pt/multiwords.txt (1 hunks)

⏰ Context from checks skipped due to timeout of 90000ms (1)

GitHub Check: Analyze (java-kotlin)

🔇 Additional comments (2)

languagetool-language-modules/pt/src/main/resources/org/languagetool/resource/pt/multiwords.txt (2)

Line range hint 28-35: Section headers improve file organization

Good use of section headers with clear separation using # marks. This makes the file well-organized and easy to navigate.

Line range hint 1-2100: Verify tag accuracy for proper nouns

Many proper noun tags need verification to ensure they correctly reflect gender, number and other grammatical features. Consider running automated validation.

✅ Verification successful

Proper noun tags are correctly assigned

The grammatical tags for proper nouns in the multiwords.txt file are accurate and consistent with Portuguese language rules. The tags properly reflect:

Gender (masculine/feminine) based on the grammatical gender of the main noun

Number (singular/plural) matching the form of the expression

Special cases like organizations and geographic locations follow standard Portuguese conventions
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Check for inconsistent tags on similar terms
grep -P "NP[MF][SP]" multiwords.txt | sort > tags.txt
# Manual review needed to identify patterns
Length of output: 92

Script:
#!/bin/bash
# Get proper noun entries with context for analysis
echo "=== Proper noun tag patterns ==="
rg "^[^#].*\tNP[MF][SP]" languagetool-language-modules/pt/src/main/resources/org/languagetool/resource/pt/multiwords.txt -A 1 -B 1

echo -e "\n=== Tag distribution statistics ==="
rg "^[^#].*\tNP[MF][SP][^0-9]*$" languagetool-language-modules/pt/src/main/resources/org/languagetool/resource/pt/multiwords.txt | awk -F'\t' '{print $2}' | sort | uniq -c | sort -nr
Length of output: 73197

jaumeortola requested review from evan-defran-lt and AzadehSafakish and removed request for evan-defran-lt July 2, 2024 11:42

AzadehSafakish approved these changes Jul 2, 2024

View reviewed changes

jaumeortola force-pushed the en-new-dict branch 2 times, most recently from 923ebfd to 80c6f5f Compare July 12, 2024 10:43

jaumeortola mentioned this pull request Jul 12, 2024

[en] For British English, the LT spell checker does not find upper-case CENTER, but it finds lower-case #10721

Open

jaumeortola force-pushed the en-new-dict branch from 6d936b5 to 842fa0f Compare January 11, 2025 14:58

coderabbitai bot reviewed Jan 11, 2025

View reviewed changes

jaumeortola added 3 commits January 11, 2025 16:43

fix PT test

8e45e50

fix postag in multiwords

56a79d0

jaumeortola force-pushed the en-new-dict branch from 94bfef0 to 56a79d0 Compare January 11, 2025 15:44

coderabbitai bot reviewed Jan 11, 2025

View reviewed changes

[pt] fix test: Gipsy Kings

1c995e1

coderabbitai bot reviewed Jan 11, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[en] new dicts #10699

[en] new dicts #10699

jaumeortola commented Jul 2, 2024 •

edited by coderabbitai bot

Loading

AzadehSafakish commented Jul 2, 2024

coderabbitai bot commented Jan 11, 2025 •

edited

Loading

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (`.coderabbit.yaml`)

Documentation and Community

coderabbitai bot left a comment

coderabbitai bot Jan 11, 2025

coderabbitai bot Jan 11, 2025

coderabbitai bot Jan 11, 2025

coderabbitai bot Jan 11, 2025

coderabbitai bot Jan 11, 2025

coderabbitai bot left a comment

coderabbitai bot Jan 11, 2025

coderabbitai bot left a comment

coderabbitai bot left a comment

[en] new dicts #10699

Are you sure you want to change the base?

[en] new dicts #10699

Conversation

jaumeortola commented Jul 2, 2024 • edited by coderabbitai bot Loading

Summary by CodeRabbit

Release Notes

AzadehSafakish commented Jul 2, 2024

coderabbitai bot commented Jan 11, 2025 • edited Loading

Walkthrough

Changes

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot Jan 11, 2025

Choose a reason for hiding this comment

coderabbitai bot Jan 11, 2025

Choose a reason for hiding this comment

coderabbitai bot Jan 11, 2025

Choose a reason for hiding this comment

coderabbitai bot Jan 11, 2025

Choose a reason for hiding this comment

coderabbitai bot Jan 11, 2025

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot Jan 11, 2025

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

jaumeortola commented Jul 2, 2024 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jan 11, 2025 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)