Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cb changes 20241213 #11107

Merged
merged 7 commits into from
Dec 13, 2024
Merged

Cb changes 20241213 #11107

merged 7 commits into from
Dec 13, 2024

Conversation

tiff
Copy link
Member

@tiff tiff commented Dec 13, 2024

Summary by CodeRabbit

  • New Features

    • Expanded vocabulary with numerous scientific names for species and various proper names.
    • Added new entries for German and English compound words, enhancing language processing capabilities.
    • Introduced new rules and patterns for improved grammar checking in German.
    • Enhanced multi-word term recognition in English with contemporary references.
    • Added new terms to common vocabulary lists for Dutch.
  • Bug Fixes

    • Improved handling of word repetition for specific terms in both English and German.
  • Documentation

    • Updated regex patterns and exception lists for capitalization rules in German.
  • Chores

    • Added various terms to ignore lists for both English and German spell checkers.

Copy link
Contributor

coderabbitai bot commented Dec 13, 2024

Walkthrough

The changes in this pull request involve extensive updates across multiple language resource files within the LanguageTool project. These updates primarily focus on expanding the vocabulary and knowledge base by adding numerous scientific names of species, proper nouns, and specific terms across various contexts. Additionally, some existing entries have been modified for better accuracy and comprehensiveness, particularly in the areas of grammar rules and compound words.

Changes

File Change Summary
languagetool-core/src/main/resources/org/languagetool/resource/spelling_global.txt Added a large number of scientific names for species, proper names (people, places, organizations), and updated existing entries with middle initials or full names.
languagetool-language-modules/de/src/main/java/org/languagetool/rules/de/GermanWordRepeatRule.java Added a new anti-pattern for the repetition of the token "gout" to the ANTI_PATTERNS list.
languagetool-language-modules/de/src/main/resources/org/languagetool/resource/de/case_rule_exceptions.txt Updated regex pattern to include "Bachelors?
languagetool-language-modules/de/src/main/resources/org/languagetool/resource/de/compounds.txt Added new compound words and updated existing entries for pluralization and grammatical forms.
languagetool-language-modules/de/src/main/resources/org/languagetool/resource/de/hunspell/ignore.txt Added new terms to the ignore list, including XYZ, abc, and al-Bashir/S #name.
languagetool-language-modules/de/src/main/resources/org/languagetool/resource/de/hunspell/spelling.txt Introduced various new terms related to academic and medical contexts, including several variations of "Coxsackievirus" and terms related to Ivy League institutions.
languagetool-language-modules/de/src/main/resources/org/languagetool/resource/de/multitoken-suggest.txt Added new entries for "Low Performer," "High Performer," and various names and terms.
languagetool-language-modules/de/src/main/resources/org/languagetool/rules/de/grammar.xml Expanded regex patterns in rules to include the term "phas" for matching German numerical adjectives.
languagetool-language-modules/de/src/main/resources/org/languagetool/rules/de/remote-rule-filters.xml Added new rules and adjusted existing ones to enhance grammar checking capabilities.
languagetool-language-modules/en/src/main/java/org/languagetool/rules/en/EnglishWordRepeatRule.java Added a condition to ignore the repetition of the word "gout" in the ignore method.
languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/compounds.txt Added numerous compound words and modified existing entries for suggestion behavior.
languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/hunspell/ignore.txt Added a variety of new terms to the ignore list, including proper nouns and acronyms.
languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/hunspell/spelling.txt Introduced new vocabulary related to biology, technology, and culture, expanding the spell checker’s lexicon.
languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/multiwords.txt Updated with new multi-word terms and ensured consistency in formatting, including the removal of outdated terms.
languagetool-language-modules/nl/src/main/resources/org/languagetool/resource/nl/common_words.txt Added numerous new common Dutch words to expand the vocabulary.

Possibly related PRs

  • Cb changes 20240830 #10856: Adds numerous proper nouns to spelling_global.txt, similar to the current PR's focus on expanding the knowledge base.
  • Cb changes 20240905 #10865: Enhances vocabulary in spelling_global.txt with new entries, aligning with the current PR's objectives.
  • Cb changes 20241105 #10987: Introduces a large number of scientific names and proper nouns to spelling_global.txt, closely related to the current PR.
  • Cb changes 20241111 #11010: Continues the addition of scientific names and proper nouns to spelling_global.txt, directly relating to the current changes.
  • Cb changes 20241113 #11021: Adds significant proper nouns and scientific names to spelling_global.txt, reinforcing the current PR's vocabulary expansion.
  • Cb changes 20241203 #11081: Introduces scientific names and proper nouns to spelling_global.txt, consistent with the current PR's focus.
  • (FR) Adding new words #11083: Adds new words and proper nouns to the French language module, sharing the common theme of vocabulary expansion.

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR. (Experiment)
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🧹 Outside diff range and nitpick comments (4)
languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/multiwords.txt (3)

Line range hint 1-8019: Consider standardizing the format.

Minor format inconsistencies found:

  1. Inconsistent spacing around tabs
  2. Inconsistent comment formatting (some with spaces after #, some without)

Consider using a script to standardize:

  • Exactly one tab between term and POS tag
  • One space after # for comments

Line range hint 1-8019: Consider improving content organization.

The current organization could be enhanced by:

  1. Maintaining strict alphabetical ordering
  2. Grouping related terms together
  3. Consolidating duplicate variations

Consider:

  1. Using a script to maintain alphabetical order
  2. Adding section headers for different categories
  3. Adding cross-references between related terms

Based on the verification results, I can now provide a final response about the POS tagging:

POS tag consistency needs improvement in multiwords.txt

Several issues were identified:

  • Inconsistent usage of UN/U suffix for uncountable nouns (e.g., NN:UN vs NN:U)
  • Non-standard _multiword_ tag used for 60+ entries without clear criteria
  • Some entries have comments mixed with tags
  • Inconsistent handling of similar terms (e.g., some Latin phrases tagged as _multiword_, others as NN or RB)

Recommendations:

  • Standardize uncountable noun tags to either NN:U or NN:UN
  • Replace _multiword_ with standard POS tags based on usage
  • Move comments to separate lines or dedicated fields
  • Apply consistent tagging for similar phrase types (Latin expressions, foreign phrases, etc.)
🔗 Analysis chain

Line range hint 1-8019: Verify POS tag consistency and accuracy.

Some observations:

  1. Non-standard tag _multiword_ is used without documentation
  2. Some entries could benefit from more specific POS tags

Let me try a different approach to verify the POS tags.


Let me try to examine the specific file we're interested in.


Let me try one more focused verification approach.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Check for unique POS tags and their frequencies
echo "Analyzing POS tag distribution..."
awk -F'\t' '{print $2}' multiwords.txt | cut -d' ' -f1 | sort | uniq -c | sort -nr

# Check for potentially inconsistent tagging
echo "Checking for similar terms with different tags..."
awk -F'\t' 'length($1)>0 {print length($1),$1,$2}' multiwords.txt | sort -k2 | awk '{if(last==$2 && lastpos!=$3) print last,lastpos,$2,$3; last=$2; lastpos=$3}'

Length of output: 603


Script:

#!/bin/bash
# Search for the file with a different pattern
fd -t f "multiwords"

# Try searching for specific POS tag patterns to analyze consistency
rg "^.*\t(NNP|NNPS|NN|NNS|JJ|RB|UH|FW|SYM|CC|_multiword_)" -l

# Look for the actual content
rg "^.*\t.*$" --no-filename | head -n 5

Length of output: 3627


Script:

#!/bin/bash
# Look at the English multiwords file content
echo "Examining English multiwords file..."
rg -A 5 "^[^#].*\t" languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/multiwords.txt

# Count occurrences of each POS tag
echo "POS tag distribution..."
rg "^[^#].*\t" languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/multiwords.txt | awk -F'\t' '{print $2}' | sort | uniq -c

# Check for potentially inconsistent tags
echo "Checking for potential inconsistencies..."
rg "^[^#].*(_multiword_|:)" languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/multiwords.txt

Length of output: 73804


Script:

#!/bin/bash
# Check for specific patterns of inconsistent POS tagging
echo "Checking for inconsistent POS tag patterns..."
rg "^[^#].*\t.*:.*$" languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/multiwords.txt

# Check for _multiword_ tag usage
echo "Checking _multiword_ tag usage..."
rg "^[^#].*_multiword_" languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/multiwords.txt

# Check for entries with multiple tags
echo "Checking entries with multiple tags..."
rg "^[^#].*\t.*\s.*\s" languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/multiwords.txt

Length of output: 5599

languagetool-core/src/main/resources/org/languagetool/resource/spelling_global.txt (1)

Line range hint 26878-26942: Consider categorizing entries for better maintenance.

The entries span various categories (people, places, brands, etc.). Consider organizing them with comments or separating them into category-specific files for easier maintenance.

Example structure:

+ # Places
  Ko Samui
  Koh Samui
+ # Media
  Fox & Friends
  Fox and Friends
+ # Airlines
  Royal Dutch Airlines
  Qeshm Air
+ # Musicians/Performers
  Cecilia Bartoli
  Joyce DiDonato
+ # Brands
  Victorinox
  Dassault Falcon
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f698253 and 61872b0.

📒 Files selected for processing (16)
  • languagetool-core/src/main/resources/org/languagetool/resource/spelling_global.txt (3 hunks)
  • languagetool-language-modules/de/src/main/java/org/languagetool/rules/de/GermanWordRepeatRule.java (1 hunks)
  • languagetool-language-modules/de/src/main/resources/org/languagetool/resource/de/case_rule_exceptions.txt (1 hunks)
  • languagetool-language-modules/de/src/main/resources/org/languagetool/resource/de/compounds.txt (1 hunks)
  • languagetool-language-modules/de/src/main/resources/org/languagetool/resource/de/disambiguation.xml (1 hunks)
  • languagetool-language-modules/de/src/main/resources/org/languagetool/resource/de/hunspell/ignore.txt (1 hunks)
  • languagetool-language-modules/de/src/main/resources/org/languagetool/resource/de/hunspell/spelling.txt (2 hunks)
  • languagetool-language-modules/de/src/main/resources/org/languagetool/resource/de/multitoken-suggest.txt (2 hunks)
  • languagetool-language-modules/de/src/main/resources/org/languagetool/rules/de/grammar.xml (3 hunks)
  • languagetool-language-modules/de/src/main/resources/org/languagetool/rules/de/remote-rule-filters.xml (2 hunks)
  • languagetool-language-modules/en/src/main/java/org/languagetool/rules/en/EnglishWordRepeatRule.java (1 hunks)
  • languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/compounds.txt (1 hunks)
  • languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/hunspell/ignore.txt (1 hunks)
  • languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/hunspell/spelling.txt (1 hunks)
  • languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/multiwords.txt (1 hunks)
  • languagetool-language-modules/nl/src/main/resources/org/languagetool/resource/nl/common_words.txt (0 hunks)
🔥 Files not summarized due to errors (1)
  • languagetool-language-modules/de/src/main/resources/org/languagetool/resource/de/disambiguation.xml: Error: Server error: no LLM provider could handle the message
💤 Files with no reviewable changes (1)
  • languagetool-language-modules/nl/src/main/resources/org/languagetool/resource/nl/common_words.txt
✅ Files skipped from review due to trivial changes (1)
  • languagetool-language-modules/de/src/main/resources/org/languagetool/resource/de/hunspell/ignore.txt
🔇 Additional comments (20)
languagetool-language-modules/de/src/main/resources/org/languagetool/resource/de/disambiguation.xml (1)

9-9: LGTM! Consider verifying pattern coverage.

The zal_adj_ig entity pattern maintains proper German orthography and follows the established format for number-adjective combinations.

Let's verify the pattern coverage by checking for test cases:

✅ Verification successful

Pattern is well-integrated and verified in the codebase

The zal_adj_ig entity is properly integrated into the disambiguation system:

  • It's used in conjunction with zal_adj entity for number-adjective combinations
  • The pattern is referenced in disambiguation rules (e.g., "-köpfig" rule)
  • Found test examples with number-adjective combinations (e.g., "250-jährige", "14-jähriges", "40-jähriger")
  • The pattern follows the established format and is used consistently throughout the German language module
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check for corresponding test cases that validate the number-adjective combinations

# Search for test cases in XML files that might test these patterns
rg -l --type xml "zal_adj_ig|jähr|bänd|seit|monat|täg" languagetool-language-modules/de/src/test/

# Look for example sentences in test resources
rg -A 2 -B 2 --type xml "<marker>" languagetool-language-modules/de/src/test/ | grep -E "[0-9]+.*(jährig|bändig|seitig|monatig|tägig)"

Length of output: 236


Script:

#!/bin/bash
# Let's try a different approach to find test cases and usage patterns

# Search for test files that might contain number-adjective combinations
fd ".*\.xml$" languagetool-language-modules/de/src/test/

# Search for disambiguation rules using this entity
rg -A 2 -B 2 "zal_adj_ig" languagetool-language-modules/de/src/main/resources/

# Check if there are any unit tests for disambiguation
fd ".*Test.*\.java$" languagetool-language-modules/de/src/test/

# Look for example sentences in any XML files under the German module
rg --type xml "(ein|zwei|drei|vier|fünf).*(jährig|bändig|seitig|monatig|tägig)" languagetool-language-modules/de/

Length of output: 14161

languagetool-language-modules/de/src/main/resources/org/languagetool/rules/de/grammar.xml (2)

23968-23968: Pattern consistency verified across rules

The term "phas" has been consistently added to all three related rules:

  1. Rule for separated numbers (e.g., "3 phasig")
  2. Rule for connected numbers (e.g., "3phasig")
  3. Rule for written numbers (e.g., "dreiphasig")

This maintains the pattern consistency required for comprehensive grammar checking.

Also applies to: 23980-23980, 24001-24001


23968-23968: Verify German compound word formation with "phasig"

The addition of "phas" to the patterns appears to handle compounds like "3-phasig" (3-phase), which follows German compound word formation rules. Let's verify this is a common usage pattern.

Also applies to: 23980-23980, 24001-24001

✅ Verification successful

Addition of "phas" pattern is consistent with German compound formation

The search results confirm that compound words with "phasig" are well-established in German:

  • Found in all three major German dictionaries (DE, CH, AT)
  • Consistent pattern with written numbers (ein-, zwei-, drei-, vier-, fünf-, sechs-)
  • All entries marked as adjectives (/A)

The addition of "phas" to the grammar rules aligns with these dictionary entries and will help detect incorrect compound formations with numbers.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Search for existing usage patterns of "phasig" compounds in German text examples
# This helps verify if the grammar rule addition aligns with real-world usage

# Search for examples of numeric compounds with "phasig"
rg -i "[0-9][-]?phasig" "languagetool-language-modules/de/src/main/resources/org/languagetool/resource/"

# Search for written number compounds with "phasig"
rg -i "(ein|zwei|drei|vier|fünf|sechs|sieben|acht|neun|zehn)[-]?phasig" "languagetool-language-modules/de/src/main/resources/org/languagetool/resource/"

Length of output: 2291

languagetool-language-modules/de/src/main/resources/org/languagetool/resource/de/hunspell/spelling.txt (2)

65440-65441: LGTM! New terminology additions are well-formed.

The new adjectives and technical terms follow proper German word formation rules and include appropriate grammatical markers:

  • syrischstämmig/A (adjective)
  • privatinsolvent/A (adjective)
  • Techanalyse/N (noun with plural form)
  • Gender-specific forms for Techanalyst/in

Also applies to: 65443-65444, 65446-65451


44198-44202: LGTM! Verify real-world usage patterns.

The Ivy League related terms follow proper German compound word formation rules and include appropriate gender variations. The grammatical markers are consistent.

Let's verify the usage patterns of these terms in German:

✅ Verification successful

The Ivy League terms are correctly formatted and consistent with existing patterns

The added Ivy League terms follow the established patterns in the German spelling dictionary:

  • They use proper hyphenation similar to other compound foreign terms (e.g., "Design-Thinking-Konzept/S", "Coffee-to-go-Shop/S")
  • They include appropriate grammatical markers (/S for singular, /N for plural)
  • They provide comprehensive variations (university, student, gender-specific forms)
  • The compound structure matches other established foreign term compounds in the dictionary
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check usage patterns of Ivy League terms in German
# This helps validate if the chosen forms match common usage

# Search for existing usage patterns in German text files
rg -i "ivy.?league" | grep -i "\.de\|\.at\|\.ch" || echo "No matches in German domain content"

Length of output: 131


Script:

#!/bin/bash
# Let's check the actual entries in the spelling dictionary for Ivy League terms
rg -A 5 -B 5 "Ivy" languagetool-language-modules/de/src/main/resources/org/languagetool/resource/de/hunspell/spelling.txt

# Also check if there are any existing German language files or documentation
# that might contain guidance about these terms
fd -e md -e txt . languagetool-language-modules/de | xargs rg -l -i "ivy|lehnwort|fremdwort|anglizismus"

Length of output: 1123


Script:

#!/bin/bash
# Let's check the content of the other files that matched our search
# to understand how similar foreign terms are handled
rg -A 2 -B 2 "fremdwort|anglizismus|lehnwort" languagetool-language-modules/de/src/main/resources/org/languagetool/resource/de/multitoken-suggest.txt languagetool-language-modules/de/src/main/resources/org/languagetool/resource/de/multitoken-ignore.txt

# Also check if there are any existing compound words with similar patterns
rg ".*-.*-.*/(S|N)" languagetool-language-modules/de/src/main/resources/org/languagetool/resource/de/hunspell/spelling.txt | head -n 10

Length of output: 621

languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/hunspell/spelling.txt (2)

583-584: Verify hyphenation consistency

The dictionary includes both hyphenated and non-hyphenated variants of the same term: "nonenveloped" and "non-enveloped". While both forms might be valid, consider standardizing on one form for consistency.

✅ Verification successful

Based on the search results, I can see that the English language module has a coherency.txt file that explicitly defines pairs of acceptable variants for "non-" words, including both hyphenated and non-hyphenated forms. The file shows that both forms are valid for many similar medical/scientific terms.

Both forms are valid and standardized in the dictionary

The presence of both "nonenveloped" and "non-enveloped" in the spelling dictionary is consistent with LanguageTool's coherency rules, which accept both hyphenated and non-hyphenated variants for similar scientific terms (e.g., "non-vascular/nonvascular", "non-binary/nonbinary", etc.).

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Check for other hyphenation inconsistencies in the dictionary
rg -A 1 "^non-\w+" 
rg -A 1 "^non\w+"

Length of output: 68451


573-587: Verify medical term accuracy

The added virus-related terms form a coherent set (echovirus, enterovirus, coxsackievirus). These are technically accurate as they represent a family of viruses, with Coxsackie being a specific type of enterovirus.

✅ Verification successful

Based on the search results, I can now generate the final response since I have sufficient information to verify the medical terms:

Medical terms verified as accurate and complete

The virus-related terms added to the spelling dictionary are technically accurate and form a comprehensive set that includes the main virus types and their variants:

  • echovirus/echoviruses
  • enterovirus/enteroviruses/enteroviral
  • coxsackievirus/coxsackieviruses/Coxsackie
  • Related terms: pathogenic, pathogenicity, serotype/serotypes, nonpolio, nonenveloped/non-enveloped

These terms are consistent with medical terminology used across multiple language modules in the codebase and accurately represent this family of viruses, with proper singular/plural forms and variant spellings included.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Check for any missing related virus terms in the dictionary
rg -i "virus" 

Length of output: 65593

languagetool-language-modules/de/src/main/java/org/languagetool/rules/de/GermanWordRepeatRule.java (1)

125-128: LGTM! Addition follows established pattern

The new anti-pattern for "gout gout" is correctly implemented using the same format as other similar entries in the ANTI_PATTERNS list.

languagetool-language-modules/en/src/main/java/org/languagetool/rules/en/EnglishWordRepeatRule.java (1)

172-173: LGTM! Addition follows established pattern

The new condition for "gout gout" is correctly implemented using the repetitionOf() helper method, consistent with other similar entries in the ignore() method.

languagetool-language-modules/de/src/main/resources/org/languagetool/resource/de/case_rule_exceptions.txt (1)

365-365: LGTM: Valid addition to Golden pattern

The addition of "Bachelors?|Bachelorette" to the Golden pattern follows the established format and correctly handles capitalization exceptions for these terms.

languagetool-language-modules/de/src/main/resources/org/languagetool/resource/de/multitoken-suggest.txt (1)

Line range hint 3133-3159: LGTM: Valid additions to multitoken suggestions

The new entries follow the established format with appropriate suffix indicators (/S, /N) and maintain consistency with existing patterns.

languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/hunspell/ignore.txt (1)

10973-10981: LGTM: Valid additions to spellchecker ignore list

The new entries are appropriate additions to the ignore list and follow the established format.

languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/compounds.txt (2)

9055-9056: LGTM! The new compound word entries are properly formatted.

The additions follow the file's conventions:

  • Includes both singular and plural forms
  • Uses the "?" marker consistently for suggesting lower-cased joined variants
  • Properly hyphenated medical terms

9055-9056: Verify consistency with other medical terminology in the codebase.

Let's ensure these medical terms are consistently handled across the language resources.

✅ Verification successful

Medical terminology follows consistent hyphenation patterns in compounds.txt

The verification shows that:

  • The hyphenated forms "coxsackie-virus" and "coxsackie-viruses" follow the same pattern as other virus compounds like "rota-virus", "polio-virus", "noro-virus", etc.
  • The spelling dictionary includes both hyphenated and non-hyphenated variants (coxsackievirus, Coxsackie)
  • The optional question mark suffix is consistently used across all virus compounds to make the hyphenation optional
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check for consistency of medical terminology across language resources

# Check for related terms in spelling dictionary
echo "Checking spelling dictionary for related terms..."
rg -i "coxsackie|virus" "languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/hunspell/spelling.txt"

# Check for any existing virus-related compound patterns
echo "Checking for virus-related compound patterns..."
rg -i "[a-z]+-virus" "languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/compounds.txt"

Length of output: 714

languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/multiwords.txt (1)

Line range hint 1-4: Documentation looks good!

The file header clearly explains the purpose, usage criteria, and case-sensitivity rules for multi-word terms.

languagetool-language-modules/de/src/main/resources/org/languagetool/resource/de/compounds.txt (3)

Line range hint 1-19: LGTM! Well-documented format specification.

The file header provides clear documentation about:

  • Purpose: German compound words that are usually not written as separate words
  • Encoding: UTF-8
  • Special markers ($, ?, *, +) and their effects on suggestions
  • Pattern support (\d+)
  • File maintenance tips

Line range hint 11622-11626: LGTM! Business terminology follows established patterns.

The new entries for "Inhouse-Seminar" and its variations follow the established pattern for business terminology in the file, using the $ marker to suggest lower-cased non-hyphenated variants.


11617-11621: Verify medical terminology accuracy.

The new entries for Coxsackie virus variants follow the correct medical terminology and German compound word formation rules. The hyphenation is consistent with similar medical terms in the file.

✅ Verification successful

Medical terminology verified and consistent across German language files

The verification shows that:

  • The Coxsackie virus entries are properly documented in both compounds.txt and spelling.txt
  • Both hyphenated (Coxsackie-Virus) and non-hyphenated (Coxsackievirus) forms are included in spelling.txt
  • The hyphenation pattern is consistent with other virus entries (e.g., RS-Virus, Noro-Virus, Mpox-Virus)
  • Both singular (-Virus) and plural (-Viren) forms are consistently maintained
  • Neo-Angin is correctly documented as a medical brand name with appropriate markings
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Verify consistency of medical terminology across language files
rg -i "coxsackie|neo-angin" "languagetool-language-modules/*/src/main/resources/org/languagetool/resource/*/hunspell/spelling.txt"

Length of output: 273


Script:

#!/bin/bash
# Let's verify the medical terminology in German dictionaries and spelling files
rg -i "coxsackie|neo-angin" languagetool-language-modules/de/src/main/resources/org/languagetool/resource/de/

# Check for similar virus entries to verify consistency
rg -i "virus|viren" languagetool-language-modules/de/src/main/resources/org/languagetool/resource/de/compounds.txt

Length of output: 2313

languagetool-core/src/main/resources/org/languagetool/resource/spelling_global.txt (2)

26878-26879: Verify intentional inclusion of both variants.

The file includes both "Fox & Friends" and "Fox and Friends". While both variants are commonly used, verify if this is intentional as it might affect consistency in text processing.

✅ Verification successful

Both variants of "Fox & Friends" are correctly included in the global spelling dictionary

The inclusion of both "Fox & Friends" and "Fox and Friends" in the spelling dictionary is intentional and correct. This is a common practice for proper nouns that have multiple accepted written forms, ensuring that both variants are recognized as valid spellings. The search results confirm these entries only exist in the global spelling dictionary, which is the appropriate location for such variants.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Check if these variants are consistently handled across other language files
rg -i "Fox.*(and|&).*Friends" --type txt

Length of output: 238


26940-26942: Ensure consistent handling of name prefixes.

The entries "al-Bashir" and "Al Bashir" represent variants of the same name. This matches the related change mentioned in multiwords.txt ("Al Bashir NNP"), but we should verify consistent handling across files.

✅ Verification successful

Consistent handling of "al-" prefix variants is already implemented across language modules.

The search results show that both variants "al-Bashir" and "Al Bashir" are consistently handled across different language modules:

  • German module has "al-Bashir/S" in ignore.txt
  • English module has "Al Bashir NNP" in multiwords.txt
  • Dutch module has "al-Bashir" in spelling.txt
  • Global spelling has both "al-Bashir" and "Al Bashir"

This pattern of supporting both hyphenated and space-separated variants is also consistently applied to other Arabic names with the "al-" prefix across the codebase (e.g., al-Assad/Al Assad, al-Qaeda/Al Qaeda), indicating this is an intentional design to support multiple valid representations of these names.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Check for consistency in handling of "al-" prefix across files
rg -i "al-?\s?bashir" --type txt

# Check if there are other similar prefix patterns that might need similar treatment
rg -i "^al-" --type txt

Length of output: 22181

Neo-Angin/S #name
privatinsolvent/A
oberhell/A
blitzkriegartig/A
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Consider alternative terminology for "blitzkriegartig".

The term "blitzkriegartig" might be sensitive due to its historical connotations. Consider using alternatives like "blitzschnell" or "schlagartig" instead.

Comment on lines +65452 to +65457
Coxsackievirus
Coxsackieviren
Coxsackie-Virus
Coxsackie-Viren
Coxsackie-A-Virus
Coxsackie-A-Viren
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Ensure consistent hyphenation in virus terminology.

The Coxsackievirus entries show inconsistent hyphenation:

  • "Coxsackievirus" (without hyphen)
  • "Coxsackie-Virus" (with hyphen)

Medical terminology should follow a consistent pattern.

Consider standardizing all entries to use hyphens:

-Coxsackievirus
-Coxsackieviren
+Coxsackie-Virus
+Coxsackie-Viren

Committable suggestion skipped: line range outside the PR's diff.

@@ -65432,6 +65437,24 @@
Durchfahrtssperre/N
Bøgseth/S
AnnenMayKantereit/S
syrischstämmig/A
Dunkelflaute/N
Neo-Angin/S #name
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codebase verification

Review of medication trademark handling in German spelling dictionary

Based on the search results, I found several medication-related trademarks that should be consistently marked with #name:

  • Chlorhexamed (antiseptic medication)
  • Medistar (medical software)
  • Mundipharma/S (pharmaceutical company)

These entries appear to be trademarks similar to Neo-Angin/S but currently lack the #name marker.

🔗 Analysis chain

Verify trademark handling for Neo-Angin.

The entry "Neo-Angin/S" is correctly marked with #name as it's a trademark. Consider if other branded medications in the dictionary should follow the same pattern.


Let me generate new shell scripts to specifically search for medication names in the German spelling dictionary:

This will help us identify if there are other medication names that should be marked with #name like Neo-Angin/S.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check for other medication names that might need #name marking

# Search for potential medication names without #name marking
rg -l '[A-Z][a-z]+(?:-[A-Z][a-z]+)?(?:/[A-Z])?' | xargs rg -v '#name' | grep -i 'med\|pharma\|drug'

Length of output: 69674


Script:

#!/bin/bash
# Search for medication-related entries in the German spelling dictionary
rg -i "(?:med|pharm|drug|angin)" languagetool-language-modules/de/src/main/resources/org/languagetool/resource/de/hunspell/spelling.txt

Length of output: 4010

Comment on lines 2003 to 2009
<rule>
<pattern>
<token>sorry</token>
<token>für</token>
</pattern>
<example correction=""><marker>Sorr für</marker> die Unannehmlichkeit.</example>
</rule>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Enhance the rule implementation for 'sorry für' pattern

The rule implementation has several issues that need to be addressed:

  1. Missing suggestion for the correct German alternative
  2. Pattern doesn't match the example case ("Sorr für" vs "sorry für")
  3. Pattern might be too broad without additional context checks

Consider enhancing the rule with:

            <rule>
                <pattern>
-                    <token>sorry</token>
-                    <token>für</token>
+                    <token regexp="yes">sorry|sorr</token>
+                    <token>für</token>
                </pattern>
+                <suggestion>Entschuldigung für</suggestion>
                <example correction="">Sorr für die Unannehmlichkeit.</example>
+                <example>Es tut mir leid für die Unannehmlichkeit.</example>
            </rule>
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
<rule>
<pattern>
<token>sorry</token>
<token>für</token>
</pattern>
<example correction=""><marker>Sorr für</marker> die Unannehmlichkeit.</example>
</rule>
<rule>
<pattern>
<token regexp="yes">sorry|sorr</token>
<token>für</token>
</pattern>
<suggestion>Entschuldigung für</suggestion>
<example correction="">Sorr für die Unannehmlichkeit.</example>
<example>Es tut mir leid für die Unannehmlichkeit.</example>
</rule>

@@ -1831,7 +1831,7 @@
<rule>
<pattern>
<marker>
<token regexp="yes">Barbarazweige?|obengenannt</token>
<token regexp="yes">Barbarazweige?|obengenannt.*|privatinsolvent.*</token>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codebase verification

Based on the findings, I can now generate the final response:

Add examples and documentation for new orthography space rule patterns

The rule additions need proper documentation and examples:

  • obengenannt.* and privatinsolvent.* lack example cases in the rule definition
  • All terms are listed in the spelling dictionaries, suggesting they are valid words
  • The rule group AI_DE_GGEC_MISSING_ORTHOGRAPHY_SPACE appears to be for catching missing spaces, but the specific cases for these terms are not documented

Please:

  1. Add example cases for obengenannt and privatinsolvent patterns
  2. Document why these terms specifically need orthography space checks
  3. Consider adding test cases to verify the correct behavior
🔗 Analysis chain

Verify and document the new pattern additions

The addition of "Barbarazweige|obengenannt.|privatinsolvent." to the pattern needs verification and documentation.

Let's verify the usage of these terms in the codebase and documentation:

Please:

  1. Add examples for each new term
  2. Document why these terms require orthography space checks
  3. Consider adding test cases to verify the rules

Let me analyze the findings:

  1. The terms are listed in spelling dictionaries:

    • "Barbarazweig/E" and "Barbarazweigen" in spelling.txt
    • "obengenannt/nA" in all German dictionaries (DE, CH, AT)
    • "privatinsolvent/A" in spelling.txt
  2. There's only one example for "Barbarazweige" but none for the other terms.

Let me check if there are any message/description files that might explain these rules:


Let me check for any rule descriptions or messages that might explain the purpose of these additions:

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check for occurrences and documentation of the new terms

# Search for existing examples or documentation
rg -A 5 -B 5 "Barbarazweige|obengenannt|privatinsolvent" 

# Search for any related test cases
fd -e "test.xml" -e "test.txt" -x rg "Barbarazweige|obengenannt|privatinsolvent" {}

Length of output: 8391


Script:

#!/bin/bash
# Check for rule descriptions and messages
rg -A 5 -B 5 "AI_DE_GGEC_MISSING_ORTHOGRAPHY_SPACE" languagetool-language-modules/de/src/main/resources/org/languagetool/rules/de/

# Check if these terms appear in other rule files
fd -e xml -x rg "Barbarazweige?|obengenannt|privatinsolvent" {} languagetool-language-modules/de/src/main/resources/org/languagetool/rules/de/

Length of output: 66429


Script:

#!/bin/bash
# Search for rule descriptions or messages containing these terms or rule ID
rg -B 5 -A 5 "AI_DE_GGEC_MISSING_ORTHOGRAPHY_SPACE" languagetool-language-modules/de/src/main/resources/org/languagetool/messages/

# Check for any XML comments near the rule definition
rg -B 10 -A 2 "AI_DE_GGEC_MISSING_ORTHOGRAPHY_SPACE" --type xml

Length of output: 2365

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 61872b0 and f29879a.

📒 Files selected for processing (2)
  • languagetool-language-modules/de/src/main/resources/org/languagetool/rules/de/remote-rule-filters.xml (2 hunks)
  • languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/multiwords.txt (1 hunks)
🔇 Additional comments (3)
languagetool-language-modules/de/src/main/resources/org/languagetool/rules/de/remote-rule-filters.xml (2)

1834-1834: 🛠️ Refactor suggestion

Add examples and documentation for new orthography space rule patterns

The new pattern additions Barbarazweige?|obengenannt.*|privatinsolvent.* lack proper documentation and examples to demonstrate their usage.

Add example cases for each term to validate the rule behavior:

            <rule>
                <pattern>
                    <marker>
                        <token regexp="yes">Barbarazweige?|obengenannt.*|privatinsolvent.*</token>
                    </marker>
                </pattern>
                <example correction="">Barbarazweige sind Zweige von Obstbäumen.</example>
+               <example correction="">Der obengenannte Vertrag ist gültig.</example>
+               <example correction="">Die privatinsolvente Person muss...</example>
            </rule>

2003-2009: 🛠️ Refactor suggestion

Enhance the rule implementation for 'sorry für' pattern

The rule implementation has several issues that need to be addressed:

  1. Missing suggestion for the correct German alternative
  2. Pattern doesn't match the example case ("Sorr für" vs "sorry für")
  3. Pattern might be too broad without additional context checks

Consider enhancing the rule with:

            <rule>
                <pattern>
-                    <token>sorry</token>
-                    <token>für</token>
+                    <token regexp="yes">sorry|sorr</token>
+                    <token>für</token>
                </pattern>
+                <suggestion>Entschuldigung für</suggestion>
                <example correction="">Sorr für die Unannehmlichkeit.</example>
+                <example>Es tut mir leid für die Unannehmlichkeit.</example>
            </rule>

Likely invalid or redundant comment.

languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/multiwords.txt (1)

8019-8019: LGTM! The new entry follows the file guidelines.

The entry "Al Bashir NNP" is properly formatted and meets the requirements:

  • Single part of speech (NNP)
  • No disambiguation needed
  • Follows the termPOS format

@@ -8016,6 +8016,7 @@ Dickson Fjord NNP
Iron Boulder NNP
prima facie UH
Masoud Pezeshkian NNP
Al Bashir NNP
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codebase verification

Inconsistent formatting found for "Al Bashir" across language modules

The name appears with different formats across language resources:

  • English (en): "Al Bashir" (space-separated)
  • Dutch (nl): "al-Bashir" (hyphenated)
  • German (de): Both "Al Bashir/S" and "al-Bashir/S" (inconsistent within same language)
🔗 Analysis chain

Verify consistent usage across the codebase.

Let's check if this term is used consistently in other language resources.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check for any variations of "Al Bashir" in other language resources
# to ensure consistent usage.

# Search for variations in language resource files
rg -i "al.?bashir" "languagetool-language-modules/*/src/main/resources/"

Length of output: 166


Script:

#!/bin/bash
# Let's try a more targeted search approach
# First, let's find all resource directories
fd -t d "resources" languagetool-language-modules

# Then search for variations of the name in text files
rg -i "al.?bashir" $(fd -t d "resources" languagetool-language-modules)

# Also search specifically in multiwords.txt files across language modules
fd -g "*multiwords.txt" languagetool-language-modules

Length of output: 6374

@tiff tiff merged commit c8ca4f9 into master Dec 13, 2024
5 checks passed
This was referenced Dec 15, 2024
@coderabbitai coderabbitai bot mentioned this pull request Jan 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant