Cb changes 20241213 #11107

tiff · 2024-12-13T13:59:31Z

Summary by CodeRabbit

New Features
- Expanded vocabulary with numerous scientific names for species and various proper names.
- Added new entries for German and English compound words, enhancing language processing capabilities.
- Introduced new rules and patterns for improved grammar checking in German.
- Enhanced multi-word term recognition in English with contemporary references.
- Added new terms to common vocabulary lists for Dutch.
Bug Fixes
- Improved handling of word repetition for specific terms in both English and German.
Documentation
- Updated regex patterns and exception lists for capitalization rules in German.
Chores
- Added various terms to ignore lists for both English and German spell checkers.

coderabbitai · 2024-12-13T13:59:40Z

Walkthrough

The changes in this pull request involve extensive updates across multiple language resource files within the LanguageTool project. These updates primarily focus on expanding the vocabulary and knowledge base by adding numerous scientific names of species, proper nouns, and specific terms across various contexts. Additionally, some existing entries have been modified for better accuracy and comprehensiveness, particularly in the areas of grammar rules and compound words.

Changes

File	Change Summary
`languagetool-core/src/main/resources/org/languagetool/resource/spelling_global.txt`	Added a large number of scientific names for species, proper names (people, places, organizations), and updated existing entries with middle initials or full names.
`languagetool-language-modules/de/src/main/java/org/languagetool/rules/de/GermanWordRepeatRule.java`	Added a new anti-pattern for the repetition of the token "gout" to the `ANTI_PATTERNS` list.
`languagetool-language-modules/de/src/main/resources/org/languagetool/resource/de/case_rule_exceptions.txt`	Updated regex pattern to include "Bachelors?
`languagetool-language-modules/de/src/main/resources/org/languagetool/resource/de/compounds.txt`	Added new compound words and updated existing entries for pluralization and grammatical forms.
`languagetool-language-modules/de/src/main/resources/org/languagetool/resource/de/hunspell/ignore.txt`	Added new terms to the ignore list, including `XYZ`, `abc`, and `al-Bashir/S #name`.
`languagetool-language-modules/de/src/main/resources/org/languagetool/resource/de/hunspell/spelling.txt`	Introduced various new terms related to academic and medical contexts, including several variations of "Coxsackievirus" and terms related to Ivy League institutions.
`languagetool-language-modules/de/src/main/resources/org/languagetool/resource/de/multitoken-suggest.txt`	Added new entries for "Low Performer," "High Performer," and various names and terms.
`languagetool-language-modules/de/src/main/resources/org/languagetool/rules/de/grammar.xml`	Expanded regex patterns in rules to include the term "phas" for matching German numerical adjectives.
`languagetool-language-modules/de/src/main/resources/org/languagetool/rules/de/remote-rule-filters.xml`	Added new rules and adjusted existing ones to enhance grammar checking capabilities.
`languagetool-language-modules/en/src/main/java/org/languagetool/rules/en/EnglishWordRepeatRule.java`	Added a condition to ignore the repetition of the word "gout" in the `ignore` method.
`languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/compounds.txt`	Added numerous compound words and modified existing entries for suggestion behavior.
`languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/hunspell/ignore.txt`	Added a variety of new terms to the ignore list, including proper nouns and acronyms.
`languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/hunspell/spelling.txt`	Introduced new vocabulary related to biology, technology, and culture, expanding the spell checker’s lexicon.
`languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/multiwords.txt`	Updated with new multi-word terms and ensured consistency in formatting, including the removal of outdated terms.
`languagetool-language-modules/nl/src/main/resources/org/languagetool/resource/nl/common_words.txt`	Added numerous new common Dutch words to expand the vocabulary.

Possibly related PRs

Cb changes 20240830 #10856: Adds numerous proper nouns to spelling_global.txt, similar to the current PR's focus on expanding the knowledge base.
Cb changes 20240905 #10865: Enhances vocabulary in spelling_global.txt with new entries, aligning with the current PR's objectives.
Cb changes 20241105 #10987: Introduces a large number of scientific names and proper nouns to spelling_global.txt, closely related to the current PR.
Cb changes 20241111 #11010: Continues the addition of scientific names and proper nouns to spelling_global.txt, directly relating to the current changes.
Cb changes 20241113 #11021: Adds significant proper nouns and scientific names to spelling_global.txt, reinforcing the current PR's vocabulary expansion.
Cb changes 20241203 #11081: Introduces scientific names and proper nouns to spelling_global.txt, consistent with the current PR's focus.
(FR) Adding new words #11083: Adds new words and proper nouns to the French language module, sharing the common theme of vocabulary expansion.

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR. (Experiment)
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 5

🧹 Outside diff range and nitpick comments (4)

languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/multiwords.txt (3)
Line range hint 1-8019: Consider standardizing the format.

Minor format inconsistencies found:

Inconsistent spacing around tabs

Inconsistent comment formatting (some with spaces after #, some without)

Consider using a script to standardize:

Exactly one tab between term and POS tag

One space after # for comments

Line range hint 1-8019: Consider improving content organization.

The current organization could be enhanced by:

Maintaining strict alphabetical ordering

Grouping related terms together

Consolidating duplicate variations

Consider:

Using a script to maintain alphabetical order

Adding section headers for different categories

Adding cross-references between related terms

Based on the verification results, I can now provide a final response about the POS tagging:

POS tag consistency needs improvement in multiwords.txt

Several issues were identified:

Inconsistent usage of UN/U suffix for uncountable nouns (e.g., NN:UN vs NN:U)

Non-standard _multiword_ tag used for 60+ entries without clear criteria

Some entries have comments mixed with tags

Inconsistent handling of similar terms (e.g., some Latin phrases tagged as _multiword_, others as NN or RB)

Recommendations:

Standardize uncountable noun tags to either NN:U or NN:UN

Replace _multiword_ with standard POS tags based on usage

Move comments to separate lines or dedicated fields

Apply consistent tagging for similar phrase types (Latin expressions, foreign phrases, etc.)

🔗 Analysis chain

Line range hint 1-8019: Verify POS tag consistency and accuracy.

Some observations:

Non-standard tag _multiword_ is used without documentation

Some entries could benefit from more specific POS tags

Let me try a different approach to verify the POS tags.

Let me try to examine the specific file we're interested in.

Let me try one more focused verification approach.
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Check for unique POS tags and their frequencies
echo "Analyzing POS tag distribution..."
awk -F'\t' '{print $2}' multiwords.txt | cut -d' ' -f1 | sort | uniq -c | sort -nr

# Check for potentially inconsistent tagging
echo "Checking for similar terms with different tags..."
awk -F'\t' 'length($1)>0 {print length($1),$1,$2}' multiwords.txt | sort -k2 | awk '{if(last==$2 && lastpos!=$3) print last,lastpos,$2,$3; last=$2; lastpos=$3}'
Length of output: 603

Script:
#!/bin/bash
# Search for the file with a different pattern
fd -t f "multiwords"

# Try searching for specific POS tag patterns to analyze consistency
rg "^.*\t(NNP|NNPS|NN|NNS|JJ|RB|UH|FW|SYM|CC|_multiword_)" -l

# Look for the actual content
rg "^.*\t.*$" --no-filename | head -n 5
Length of output: 3627

Script:
#!/bin/bash
# Look at the English multiwords file content
echo "Examining English multiwords file..."
rg -A 5 "^[^#].*\t" languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/multiwords.txt

# Count occurrences of each POS tag
echo "POS tag distribution..."
rg "^[^#].*\t" languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/multiwords.txt | awk -F'\t' '{print $2}' | sort | uniq -c

# Check for potentially inconsistent tags
echo "Checking for potential inconsistencies..."
rg "^[^#].*(_multiword_|:)" languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/multiwords.txt
Length of output: 73804

Script:
#!/bin/bash
# Check for specific patterns of inconsistent POS tagging
echo "Checking for inconsistent POS tag patterns..."
rg "^[^#].*\t.*:.*$" languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/multiwords.txt

# Check for _multiword_ tag usage
echo "Checking _multiword_ tag usage..."
rg "^[^#].*_multiword_" languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/multiwords.txt

# Check for entries with multiple tags
echo "Checking entries with multiple tags..."
rg "^[^#].*\t.*\s.*\s" languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/multiwords.txt
Length of output: 5599
languagetool-core/src/main/resources/org/languagetool/resource/spelling_global.txt (1)
Line range hint 26878-26942: Consider categorizing entries for better maintenance.

The entries span various categories (people, places, brands, etc.). Consider organizing them with comments or separating them into category-specific files for easier maintenance.

Example structure:
+ # Places
  Ko Samui
  Koh Samui
+ # Media
  Fox & Friends
  Fox and Friends
+ # Airlines
  Royal Dutch Airlines
  Qeshm Air
+ # Musicians/Performers
  Cecilia Bartoli
  Joyce DiDonato
+ # Brands
  Victorinox
  Dassault Falcon

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f698253 and 61872b0.

📒 Files selected for processing (16)

languagetool-core/src/main/resources/org/languagetool/resource/spelling_global.txt (3 hunks)
languagetool-language-modules/de/src/main/java/org/languagetool/rules/de/GermanWordRepeatRule.java (1 hunks)
languagetool-language-modules/de/src/main/resources/org/languagetool/resource/de/case_rule_exceptions.txt (1 hunks)
languagetool-language-modules/de/src/main/resources/org/languagetool/resource/de/compounds.txt (1 hunks)
languagetool-language-modules/de/src/main/resources/org/languagetool/resource/de/disambiguation.xml (1 hunks)
languagetool-language-modules/de/src/main/resources/org/languagetool/resource/de/hunspell/ignore.txt (1 hunks)
languagetool-language-modules/de/src/main/resources/org/languagetool/resource/de/hunspell/spelling.txt (2 hunks)
languagetool-language-modules/de/src/main/resources/org/languagetool/resource/de/multitoken-suggest.txt (2 hunks)
languagetool-language-modules/de/src/main/resources/org/languagetool/rules/de/grammar.xml (3 hunks)
languagetool-language-modules/de/src/main/resources/org/languagetool/rules/de/remote-rule-filters.xml (2 hunks)
languagetool-language-modules/en/src/main/java/org/languagetool/rules/en/EnglishWordRepeatRule.java (1 hunks)
languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/compounds.txt (1 hunks)
languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/hunspell/ignore.txt (1 hunks)
languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/hunspell/spelling.txt (1 hunks)
languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/multiwords.txt (1 hunks)
languagetool-language-modules/nl/src/main/resources/org/languagetool/resource/nl/common_words.txt (0 hunks)

🔥 Files not summarized due to errors (1)

languagetool-language-modules/de/src/main/resources/org/languagetool/resource/de/disambiguation.xml: Error: Server error: no LLM provider could handle the message

💤 Files with no reviewable changes (1)

languagetool-language-modules/nl/src/main/resources/org/languagetool/resource/nl/common_words.txt

✅ Files skipped from review due to trivial changes (1)

languagetool-language-modules/de/src/main/resources/org/languagetool/resource/de/hunspell/ignore.txt

🔇 Additional comments (20)

languagetool-language-modules/de/src/main/resources/org/languagetool/resource/de/disambiguation.xml (1)

9-9: LGTM! Consider verifying pattern coverage.

The zal_adj_ig entity pattern maintains proper German orthography and follows the established format for number-adjective combinations.

Let's verify the pattern coverage by checking for test cases:

✅ Verification successful

Pattern is well-integrated and verified in the codebase

The zal_adj_ig entity is properly integrated into the disambiguation system:

It's used in conjunction with zal_adj entity for number-adjective combinations
The pattern is referenced in disambiguation rules (e.g., "-köpfig" rule)
Found test examples with number-adjective combinations (e.g., "250-jährige", "14-jähriges", "40-jähriger")
The pattern follows the established format and is used consistently throughout the German language module

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check for corresponding test cases that validate the number-adjective combinations

# Search for test cases in XML files that might test these patterns
rg -l --type xml "zal_adj_ig|jähr|bänd|seit|monat|täg" languagetool-language-modules/de/src/test/

# Look for example sentences in test resources
rg -A 2 -B 2 --type xml "<marker>" languagetool-language-modules/de/src/test/ | grep -E "[0-9]+.*(jährig|bändig|seitig|monatig|tägig)"

Length of output: 236

Script:

#!/bin/bash
# Let's try a different approach to find test cases and usage patterns

# Search for test files that might contain number-adjective combinations
fd ".*\.xml$" languagetool-language-modules/de/src/test/

# Search for disambiguation rules using this entity
rg -A 2 -B 2 "zal_adj_ig" languagetool-language-modules/de/src/main/resources/

# Check if there are any unit tests for disambiguation
fd ".*Test.*\.java$" languagetool-language-modules/de/src/test/

# Look for example sentences in any XML files under the German module
rg --type xml "(ein|zwei|drei|vier|fünf).*(jährig|bändig|seitig|monatig|tägig)" languagetool-language-modules/de/

Length of output: 14161

languagetool-language-modules/de/src/main/resources/org/languagetool/rules/de/grammar.xml (2)

23968-23968: Pattern consistency verified across rules

The term "phas" has been consistently added to all three related rules:

Rule for separated numbers (e.g., "3 phasig")
Rule for connected numbers (e.g., "3phasig")
Rule for written numbers (e.g., "dreiphasig")

This maintains the pattern consistency required for comprehensive grammar checking.

Also applies to: 23980-23980, 24001-24001

23968-23968: Verify German compound word formation with "phasig"

The addition of "phas" to the patterns appears to handle compounds like "3-phasig" (3-phase), which follows German compound word formation rules. Let's verify this is a common usage pattern.

Also applies to: 23980-23980, 24001-24001

✅ Verification successful

Addition of "phas" pattern is consistent with German compound formation

The search results confirm that compound words with "phasig" are well-established in German:

Found in all three major German dictionaries (DE, CH, AT)
Consistent pattern with written numbers (ein-, zwei-, drei-, vier-, fünf-, sechs-)
All entries marked as adjectives (/A)

The addition of "phas" to the grammar rules aligns with these dictionary entries and will help detect incorrect compound formations with numbers.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Search for existing usage patterns of "phasig" compounds in German text examples
# This helps verify if the grammar rule addition aligns with real-world usage

# Search for examples of numeric compounds with "phasig"
rg -i "[0-9][-]?phasig" "languagetool-language-modules/de/src/main/resources/org/languagetool/resource/"

# Search for written number compounds with "phasig"
rg -i "(ein|zwei|drei|vier|fünf|sechs|sieben|acht|neun|zehn)[-]?phasig" "languagetool-language-modules/de/src/main/resources/org/languagetool/resource/"

Length of output: 2291

languagetool-language-modules/de/src/main/resources/org/languagetool/resource/de/hunspell/spelling.txt (2)

65440-65441: LGTM! New terminology additions are well-formed.

The new adjectives and technical terms follow proper German word formation rules and include appropriate grammatical markers:

syrischstämmig/A (adjective)
privatinsolvent/A (adjective)
Techanalyse/N (noun with plural form)
Gender-specific forms for Techanalyst/in

Also applies to: 65443-65444, 65446-65451

44198-44202: LGTM! Verify real-world usage patterns.

The Ivy League related terms follow proper German compound word formation rules and include appropriate gender variations. The grammatical markers are consistent.

Let's verify the usage patterns of these terms in German:

✅ Verification successful

The Ivy League terms are correctly formatted and consistent with existing patterns

The added Ivy League terms follow the established patterns in the German spelling dictionary:

They use proper hyphenation similar to other compound foreign terms (e.g., "Design-Thinking-Konzept/S", "Coffee-to-go-Shop/S")
They include appropriate grammatical markers (/S for singular, /N for plural)
They provide comprehensive variations (university, student, gender-specific forms)
The compound structure matches other established foreign term compounds in the dictionary

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check usage patterns of Ivy League terms in German
# This helps validate if the chosen forms match common usage

# Search for existing usage patterns in German text files
rg -i "ivy.?league" | grep -i "\.de\|\.at\|\.ch" || echo "No matches in German domain content"

Length of output: 131

Script:

#!/bin/bash
# Let's check the actual entries in the spelling dictionary for Ivy League terms
rg -A 5 -B 5 "Ivy" languagetool-language-modules/de/src/main/resources/org/languagetool/resource/de/hunspell/spelling.txt

# Also check if there are any existing German language files or documentation
# that might contain guidance about these terms
fd -e md -e txt . languagetool-language-modules/de | xargs rg -l -i "ivy|lehnwort|fremdwort|anglizismus"

Length of output: 1123

Script:

#!/bin/bash
# Let's check the content of the other files that matched our search
# to understand how similar foreign terms are handled
rg -A 2 -B 2 "fremdwort|anglizismus|lehnwort" languagetool-language-modules/de/src/main/resources/org/languagetool/resource/de/multitoken-suggest.txt languagetool-language-modules/de/src/main/resources/org/languagetool/resource/de/multitoken-ignore.txt

# Also check if there are any existing compound words with similar patterns
rg ".*-.*-.*/(S|N)" languagetool-language-modules/de/src/main/resources/org/languagetool/resource/de/hunspell/spelling.txt | head -n 10

Length of output: 621

languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/hunspell/spelling.txt (2)

583-584: Verify hyphenation consistency

The dictionary includes both hyphenated and non-hyphenated variants of the same term: "nonenveloped" and "non-enveloped". While both forms might be valid, consider standardizing on one form for consistency.

✅ Verification successful

Based on the search results, I can see that the English language module has a coherency.txt file that explicitly defines pairs of acceptable variants for "non-" words, including both hyphenated and non-hyphenated forms. The file shows that both forms are valid for many similar medical/scientific terms.

Both forms are valid and standardized in the dictionary

The presence of both "nonenveloped" and "non-enveloped" in the spelling dictionary is consistent with LanguageTool's coherency rules, which accept both hyphenated and non-hyphenated variants for similar scientific terms (e.g., "non-vascular/nonvascular", "non-binary/nonbinary", etc.).

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Check for other hyphenation inconsistencies in the dictionary
rg -A 1 "^non-\w+" 
rg -A 1 "^non\w+"

Length of output: 68451

573-587: Verify medical term accuracy

The added virus-related terms form a coherent set (echovirus, enterovirus, coxsackievirus). These are technically accurate as they represent a family of viruses, with Coxsackie being a specific type of enterovirus.

✅ Verification successful

Based on the search results, I can now generate the final response since I have sufficient information to verify the medical terms:

Medical terms verified as accurate and complete

The virus-related terms added to the spelling dictionary are technically accurate and form a comprehensive set that includes the main virus types and their variants:

echovirus/echoviruses
enterovirus/enteroviruses/enteroviral
coxsackievirus/coxsackieviruses/Coxsackie
Related terms: pathogenic, pathogenicity, serotype/serotypes, nonpolio, nonenveloped/non-enveloped

These terms are consistent with medical terminology used across multiple language modules in the codebase and accurately represent this family of viruses, with proper singular/plural forms and variant spellings included.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Check for any missing related virus terms in the dictionary
rg -i "virus"

Length of output: 65593

languagetool-language-modules/de/src/main/java/org/languagetool/rules/de/GermanWordRepeatRule.java (1)

125-128: LGTM! Addition follows established pattern

The new anti-pattern for "gout gout" is correctly implemented using the same format as other similar entries in the ANTI_PATTERNS list.

languagetool-language-modules/en/src/main/java/org/languagetool/rules/en/EnglishWordRepeatRule.java (1)

172-173: LGTM! Addition follows established pattern

The new condition for "gout gout" is correctly implemented using the repetitionOf() helper method, consistent with other similar entries in the ignore() method.

languagetool-language-modules/de/src/main/resources/org/languagetool/resource/de/case_rule_exceptions.txt (1)

365-365: LGTM: Valid addition to Golden pattern

The addition of "Bachelors?|Bachelorette" to the Golden pattern follows the established format and correctly handles capitalization exceptions for these terms.

languagetool-language-modules/de/src/main/resources/org/languagetool/resource/de/multitoken-suggest.txt (1)

Line range hint 3133-3159: LGTM: Valid additions to multitoken suggestions

The new entries follow the established format with appropriate suffix indicators (/S, /N) and maintain consistency with existing patterns.

languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/hunspell/ignore.txt (1)

10973-10981: LGTM: Valid additions to spellchecker ignore list

The new entries are appropriate additions to the ignore list and follow the established format.

languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/compounds.txt (2)

9055-9056: LGTM! The new compound word entries are properly formatted.

The additions follow the file's conventions:

Includes both singular and plural forms
Uses the "?" marker consistently for suggesting lower-cased joined variants
Properly hyphenated medical terms

9055-9056: Verify consistency with other medical terminology in the codebase.

Let's ensure these medical terms are consistently handled across the language resources.

✅ Verification successful

Medical terminology follows consistent hyphenation patterns in compounds.txt

The verification shows that:

The hyphenated forms "coxsackie-virus" and "coxsackie-viruses" follow the same pattern as other virus compounds like "rota-virus", "polio-virus", "noro-virus", etc.
The spelling dictionary includes both hyphenated and non-hyphenated variants (coxsackievirus, Coxsackie)
The optional question mark suffix is consistently used across all virus compounds to make the hyphenation optional

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check for consistency of medical terminology across language resources

# Check for related terms in spelling dictionary
echo "Checking spelling dictionary for related terms..."
rg -i "coxsackie|virus" "languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/hunspell/spelling.txt"

# Check for any existing virus-related compound patterns
echo "Checking for virus-related compound patterns..."
rg -i "[a-z]+-virus" "languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/compounds.txt"

Length of output: 714

languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/multiwords.txt (1)

Line range hint 1-4: Documentation looks good!

The file header clearly explains the purpose, usage criteria, and case-sensitivity rules for multi-word terms.

languagetool-language-modules/de/src/main/resources/org/languagetool/resource/de/compounds.txt (3)

Line range hint 1-19: LGTM! Well-documented format specification.

The file header provides clear documentation about:

Purpose: German compound words that are usually not written as separate words
Encoding: UTF-8
Special markers ($, ?, *, +) and their effects on suggestions
Pattern support (\d+)
File maintenance tips

Line range hint 11622-11626: LGTM! Business terminology follows established patterns.

The new entries for "Inhouse-Seminar" and its variations follow the established pattern for business terminology in the file, using the $ marker to suggest lower-cased non-hyphenated variants.

11617-11621: Verify medical terminology accuracy.

The new entries for Coxsackie virus variants follow the correct medical terminology and German compound word formation rules. The hyphenation is consistent with similar medical terms in the file.

✅ Verification successful

Medical terminology verified and consistent across German language files

The verification shows that:

The Coxsackie virus entries are properly documented in both compounds.txt and spelling.txt
Both hyphenated (Coxsackie-Virus) and non-hyphenated (Coxsackievirus) forms are included in spelling.txt
The hyphenation pattern is consistent with other virus entries (e.g., RS-Virus, Noro-Virus, Mpox-Virus)
Both singular (-Virus) and plural (-Viren) forms are consistently maintained
Neo-Angin is correctly documented as a medical brand name with appropriate markings

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Verify consistency of medical terminology across language files
rg -i "coxsackie|neo-angin" "languagetool-language-modules/*/src/main/resources/org/languagetool/resource/*/hunspell/spelling.txt"

Length of output: 273

Script:

#!/bin/bash
# Let's verify the medical terminology in German dictionaries and spelling files
rg -i "coxsackie|neo-angin" languagetool-language-modules/de/src/main/resources/org/languagetool/resource/de/

# Check for similar virus entries to verify consistency
rg -i "virus|viren" languagetool-language-modules/de/src/main/resources/org/languagetool/resource/de/compounds.txt

Length of output: 2313

languagetool-core/src/main/resources/org/languagetool/resource/spelling_global.txt (2)

26878-26879: Verify intentional inclusion of both variants.

The file includes both "Fox & Friends" and "Fox and Friends". While both variants are commonly used, verify if this is intentional as it might affect consistency in text processing.

✅ Verification successful

Both variants of "Fox & Friends" are correctly included in the global spelling dictionary

The inclusion of both "Fox & Friends" and "Fox and Friends" in the spelling dictionary is intentional and correct. This is a common practice for proper nouns that have multiple accepted written forms, ensuring that both variants are recognized as valid spellings. The search results confirm these entries only exist in the global spelling dictionary, which is the appropriate location for such variants.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Check if these variants are consistently handled across other language files
rg -i "Fox.*(and|&).*Friends" --type txt

Length of output: 238

26940-26942: Ensure consistent handling of name prefixes.

The entries "al-Bashir" and "Al Bashir" represent variants of the same name. This matches the related change mentioned in multiwords.txt ("Al Bashir NNP"), but we should verify consistent handling across files.

✅ Verification successful

Consistent handling of "al-" prefix variants is already implemented across language modules.

The search results show that both variants "al-Bashir" and "Al Bashir" are consistently handled across different language modules:

German module has "al-Bashir/S" in ignore.txt
English module has "Al Bashir NNP" in multiwords.txt
Dutch module has "al-Bashir" in spelling.txt
Global spelling has both "al-Bashir" and "Al Bashir"

This pattern of supporting both hyphenated and space-separated variants is also consistently applied to other Arabic names with the "al-" prefix across the codebase (e.g., al-Assad/Al Assad, al-Qaeda/Al Qaeda), indicating this is an intentional design to support multiple valid representations of these names.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Check for consistency in handling of "al-" prefix across files
rg -i "al-?\s?bashir" --type txt

# Check if there are other similar prefix patterns that might need similar treatment
rg -i "^al-" --type txt

Length of output: 22181

coderabbitai · 2024-12-13T14:03:18Z

...ol-language-modules/de/src/main/resources/org/languagetool/resource/de/hunspell/spelling.txt

+Neo-Angin/S #name
+privatinsolvent/A
+oberhell/A
+blitzkriegartig/A


⚠️ Potential issue

Consider alternative terminology for "blitzkriegartig".

The term "blitzkriegartig" might be sensitive due to its historical connotations. Consider using alternatives like "blitzschnell" or "schlagartig" instead.

coderabbitai · 2024-12-13T14:03:18Z

...ol-language-modules/de/src/main/resources/org/languagetool/resource/de/hunspell/spelling.txt

+Coxsackievirus
+Coxsackieviren
+Coxsackie-Virus
+Coxsackie-Viren
+Coxsackie-A-Virus
+Coxsackie-A-Viren


🛠️ Refactor suggestion

Ensure consistent hyphenation in virus terminology.

The Coxsackievirus entries show inconsistent hyphenation:

"Coxsackievirus" (without hyphen)

"Coxsackie-Virus" (with hyphen)

Medical terminology should follow a consistent pattern.

Consider standardizing all entries to use hyphens:

-Coxsackievirus -Coxsackieviren +Coxsackie-Virus +Coxsackie-Viren

Committable suggestion skipped: line range outside the PR's diff.

coderabbitai · 2024-12-13T14:03:18Z

...ol-language-modules/de/src/main/resources/org/languagetool/resource/de/hunspell/spelling.txt

@@ -65432,6 +65437,24 @@
 Durchfahrtssperre/N
 Bøgseth/S
 AnnenMayKantereit/S
+syrischstämmig/A
+Dunkelflaute/N
+Neo-Angin/S #name


💡 Codebase verification

Review of medication trademark handling in German spelling dictionary

Based on the search results, I found several medication-related trademarks that should be consistently marked with #name:

Chlorhexamed (antiseptic medication)

Medistar (medical software)

Mundipharma/S (pharmaceutical company)

These entries appear to be trademarks similar to Neo-Angin/S but currently lack the #name marker.

🔗 Analysis chain

Verify trademark handling for Neo-Angin.

The entry "Neo-Angin/S" is correctly marked with #name as it's a trademark. Consider if other branded medications in the dictionary should follow the same pattern.

Let me generate new shell scripts to specifically search for medication names in the German spelling dictionary:

This will help us identify if there are other medication names that should be marked with #name like Neo-Angin/S.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash # Description: Check for other medication names that might need #name marking # Search for potential medication names without #name marking rg -l '[A-Z][a-z]+(?:-[A-Z][a-z]+)?(?:/[A-Z])?' | xargs rg -v '#name' | grep -i 'med\|pharma\|drug'

Length of output: 69674

Script:

#!/bin/bash # Search for medication-related entries in the German spelling dictionary rg -i "(?:med|pharm|drug|angin)" languagetool-language-modules/de/src/main/resources/org/languagetool/resource/de/hunspell/spelling.txt

Length of output: 4010

coderabbitai · 2024-12-13T14:03:18Z

...ool-language-modules/de/src/main/resources/org/languagetool/rules/de/remote-rule-filters.xml

+            <rule>
+                <pattern>
+                    <token>sorry</token>
+                    <token>für</token>
+                </pattern>
+                <example correction=""><marker>Sorr für</marker> die Unannehmlichkeit.</example>
+            </rule>


🛠️ Refactor suggestion

Enhance the rule implementation for 'sorry für' pattern

The rule implementation has several issues that need to be addressed:

Missing suggestion for the correct German alternative

Pattern doesn't match the example case ("Sorr für" vs "sorry für")

Pattern might be too broad without additional context checks

Consider enhancing the rule with:

<rule> <pattern> - <token>sorry</token> - <token>für</token> + <token regexp="yes">sorry|sorr</token> + <token>für</token> </pattern> + <suggestion>Entschuldigung für</suggestion> <example correction="">Sorr für die Unannehmlichkeit.</example> + <example>Es tut mir leid für die Unannehmlichkeit.</example> </rule>

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

<rule>

<pattern>

<token>sorry</token>

<token>für</token>

</pattern>

<example correction=""><marker>Sorr für</marker> die Unannehmlichkeit.</example>

</rule>

<rule>

<pattern>

<token regexp="yes">sorry|sorr</token>

<token>für</token>

</pattern>

<suggestion>Entschuldigung für</suggestion>

<example correction="">Sorr für die Unannehmlichkeit.</example>

<example>Es tut mir leid für die Unannehmlichkeit.</example>

</rule>

coderabbitai · 2024-12-13T14:03:19Z

...ool-language-modules/de/src/main/resources/org/languagetool/rules/de/remote-rule-filters.xml

@@ -1831,7 +1831,7 @@
            <rule>
                <pattern>
                    <marker>
-                        <token regexp="yes">Barbarazweige?|obengenannt</token>
+                        <token regexp="yes">Barbarazweige?|obengenannt.*|privatinsolvent.*</token>


💡 Codebase verification

Based on the findings, I can now generate the final response:

Add examples and documentation for new orthography space rule patterns

The rule additions need proper documentation and examples:

obengenannt.* and privatinsolvent.* lack example cases in the rule definition

All terms are listed in the spelling dictionaries, suggesting they are valid words

The rule group AI_DE_GGEC_MISSING_ORTHOGRAPHY_SPACE appears to be for catching missing spaces, but the specific cases for these terms are not documented

Please:

Add example cases for obengenannt and privatinsolvent patterns

Document why these terms specifically need orthography space checks

Consider adding test cases to verify the correct behavior

🔗 Analysis chain

Verify and document the new pattern additions

The addition of "Barbarazweige|obengenannt.|privatinsolvent." to the pattern needs verification and documentation.

Let's verify the usage of these terms in the codebase and documentation:

Please:

Add examples for each new term

Document why these terms require orthography space checks

Consider adding test cases to verify the rules

Let me analyze the findings:

The terms are listed in spelling dictionaries:

"Barbarazweig/E" and "Barbarazweigen" in spelling.txt

"obengenannt/nA" in all German dictionaries (DE, CH, AT)

"privatinsolvent/A" in spelling.txt

There's only one example for "Barbarazweige" but none for the other terms.

Let me check if there are any message/description files that might explain these rules:

Let me check for any rule descriptions or messages that might explain the purpose of these additions:

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash # Description: Check for occurrences and documentation of the new terms # Search for existing examples or documentation rg -A 5 -B 5 "Barbarazweige|obengenannt|privatinsolvent" # Search for any related test cases fd -e "test.xml" -e "test.txt" -x rg "Barbarazweige|obengenannt|privatinsolvent" {}

Length of output: 8391

Script:

#!/bin/bash # Check for rule descriptions and messages rg -A 5 -B 5 "AI_DE_GGEC_MISSING_ORTHOGRAPHY_SPACE" languagetool-language-modules/de/src/main/resources/org/languagetool/rules/de/ # Check if these terms appear in other rule files fd -e xml -x rg "Barbarazweige?|obengenannt|privatinsolvent" {} languagetool-language-modules/de/src/main/resources/org/languagetool/rules/de/

Length of output: 66429

Script:

#!/bin/bash # Search for rule descriptions or messages containing these terms or rule ID rg -B 5 -A 5 "AI_DE_GGEC_MISSING_ORTHOGRAPHY_SPACE" languagetool-language-modules/de/src/main/resources/org/languagetool/messages/ # Check for any XML comments near the rule definition rg -B 10 -A 2 "AI_DE_GGEC_MISSING_ORTHOGRAPHY_SPACE" --type xml

Length of output: 2365

coderabbitai

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 61872b0 and f29879a.

📒 Files selected for processing (2)

languagetool-language-modules/de/src/main/resources/org/languagetool/rules/de/remote-rule-filters.xml (2 hunks)
languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/multiwords.txt (1 hunks)

🔇 Additional comments (3)

languagetool-language-modules/de/src/main/resources/org/languagetool/rules/de/remote-rule-filters.xml (2)

1834-1834: 🛠️ Refactor suggestion

Add examples and documentation for new orthography space rule patterns

The new pattern additions Barbarazweige?|obengenannt.*|privatinsolvent.* lack proper documentation and examples to demonstrate their usage.

Add example cases for each term to validate the rule behavior:

            <rule>
                <pattern>
                    <marker>
                        <token regexp="yes">Barbarazweige?|obengenannt.*|privatinsolvent.*</token>
                    </marker>
                </pattern>
                <example correction="">Barbarazweige sind Zweige von Obstbäumen.</example>
+               <example correction="">Der obengenannte Vertrag ist gültig.</example>
+               <example correction="">Die privatinsolvente Person muss...</example>
            </rule>

2003-2009: 🛠️ Refactor suggestion

Enhance the rule implementation for 'sorry für' pattern

The rule implementation has several issues that need to be addressed:

Missing suggestion for the correct German alternative
Pattern doesn't match the example case ("Sorr für" vs "sorry für")
Pattern might be too broad without additional context checks

Consider enhancing the rule with:

            <rule>
                <pattern>
-                    <token>sorry</token>
-                    <token>für</token>
+                    <token regexp="yes">sorry|sorr</token>
+                    <token>für</token>
                </pattern>
+                <suggestion>Entschuldigung für</suggestion>
                <example correction="">Sorr für die Unannehmlichkeit.</example>
+                <example>Es tut mir leid für die Unannehmlichkeit.</example>
            </rule>

Likely invalid or redundant comment.

languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/multiwords.txt (1)

8019-8019: LGTM! The new entry follows the file guidelines.

The entry "Al Bashir NNP" is properly formatted and meets the requirements:

Single part of speech (NNP)
No disambiguation needed
Follows the termPOS format

coderabbitai · 2024-12-13T14:36:31Z

languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/multiwords.txt

@@ -8016,6 +8016,7 @@ Dickson Fjord	NNP
 Iron Boulder	NNP
 prima facie	UH
 Masoud Pezeshkian	NNP
+Al Bashir	NNP


💡 Codebase verification

Inconsistent formatting found for "Al Bashir" across language modules

The name appears with different formats across language resources:

English (en): "Al Bashir" (space-separated)

Dutch (nl): "al-Bashir" (hyphenated)

German (de): Both "Al Bashir/S" and "al-Bashir/S" (inconsistent within same language)

🔗 Analysis chain

Verify consistent usage across the codebase.

Let's check if this term is used consistently in other language resources.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash # Description: Check for any variations of "Al Bashir" in other language resources # to ensure consistent usage. # Search for variations in language resource files rg -i "al.?bashir" "languagetool-language-modules/*/src/main/resources/"

Length of output: 166

Script:

#!/bin/bash # Let's try a more targeted search approach # First, let's find all resource directories fd -t d "resources" languagetool-language-modules # Then search for variations of the name in text files rg -i "al.?bashir" $(fd -t d "resources" languagetool-language-modules) # Also search specifically in multiwords.txt files across language modules fd -g "*multiwords.txt" languagetool-language-modules

Length of output: 6374

tiff added 5 commits December 13, 2024 14:58

[core] add global spelling

43d31d1

[de] improve rules

ef8fc4e

[de] add spelling

d1e4ba3

[nl] update common_words.txt

d85e839

[en] improve rules

61872b0

coderabbitai bot reviewed Dec 13, 2024

View reviewed changes

tiff added 2 commits December 13, 2024 15:34

[de] fix build

91e4e67

[en] fix build

f29879a

coderabbitai bot reviewed Dec 13, 2024

View reviewed changes

tiff merged commit c8ca4f9 into master Dec 13, 2024
5 checks passed

This was referenced Dec 15, 2024

Optimizations, part 2 #11064

Merged

Cb changes 20241217 #11120

Merged

coderabbitai bot mentioned this pull request Jan 10, 2025

Cb changes 20250110 #11181

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cb changes 20241213 #11107

Cb changes 20241213 #11107

tiff commented Dec 13, 2024 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Dec 13, 2024 •

edited

Loading

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (`.coderabbit.yaml`)

Documentation and Community

coderabbitai bot left a comment

coderabbitai bot Dec 13, 2024

coderabbitai bot Dec 13, 2024

coderabbitai bot Dec 13, 2024

coderabbitai bot Dec 13, 2024

coderabbitai bot Dec 13, 2024

coderabbitai bot left a comment

coderabbitai bot Dec 13, 2024

Cb changes 20241213 #11107

Cb changes 20241213 #11107

Conversation

tiff commented Dec 13, 2024 • edited by coderabbitai bot Loading

Summary by CodeRabbit

coderabbitai bot commented Dec 13, 2024 • edited Loading

Walkthrough

Changes

Possibly related PRs

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot Dec 13, 2024

Choose a reason for hiding this comment

coderabbitai bot Dec 13, 2024

Choose a reason for hiding this comment

coderabbitai bot Dec 13, 2024

Choose a reason for hiding this comment

coderabbitai bot Dec 13, 2024

Choose a reason for hiding this comment

coderabbitai bot Dec 13, 2024

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot Dec 13, 2024

Choose a reason for hiding this comment

tiff commented Dec 13, 2024 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Dec 13, 2024 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)