-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cb changes 20241213 #11107
Cb changes 20241213 #11107
Conversation
WalkthroughThe changes in this pull request involve extensive updates across multiple language resource files within the LanguageTool project. These updates primarily focus on expanding the vocabulary and knowledge base by adding numerous scientific names of species, proper nouns, and specific terms across various contexts. Additionally, some existing entries have been modified for better accuracy and comprehensiveness, particularly in the areas of grammar rules and compound words. Changes
Possibly related PRs
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 5
🧹 Outside diff range and nitpick comments (4)
languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/multiwords.txt (3)
Line range hint
1-8019
: Consider standardizing the format.Minor format inconsistencies found:
- Inconsistent spacing around tabs
- Inconsistent comment formatting (some with spaces after #, some without)
Consider using a script to standardize:
- Exactly one tab between term and POS tag
- One space after # for comments
Line range hint
1-8019
: Consider improving content organization.The current organization could be enhanced by:
- Maintaining strict alphabetical ordering
- Grouping related terms together
- Consolidating duplicate variations
Consider:
- Using a script to maintain alphabetical order
- Adding section headers for different categories
- Adding cross-references between related terms
Based on the verification results, I can now provide a final response about the POS tagging:
POS tag consistency needs improvement in multiwords.txt
Several issues were identified:
- Inconsistent usage of UN/U suffix for uncountable nouns (e.g.,
NN:UN
vsNN:U
)- Non-standard
_multiword_
tag used for 60+ entries without clear criteria- Some entries have comments mixed with tags
- Inconsistent handling of similar terms (e.g., some Latin phrases tagged as
_multiword_
, others asNN
orRB
)Recommendations:
- Standardize uncountable noun tags to either
NN:U
orNN:UN
- Replace
_multiword_
with standard POS tags based on usage- Move comments to separate lines or dedicated fields
- Apply consistent tagging for similar phrase types (Latin expressions, foreign phrases, etc.)
🔗 Analysis chain
Line range hint
1-8019
: Verify POS tag consistency and accuracy.Some observations:
- Non-standard tag
_multiword_
is used without documentation- Some entries could benefit from more specific POS tags
Let me try a different approach to verify the POS tags.
Let me try to examine the specific file we're interested in.
Let me try one more focused verification approach.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Check for unique POS tags and their frequencies echo "Analyzing POS tag distribution..." awk -F'\t' '{print $2}' multiwords.txt | cut -d' ' -f1 | sort | uniq -c | sort -nr # Check for potentially inconsistent tagging echo "Checking for similar terms with different tags..." awk -F'\t' 'length($1)>0 {print length($1),$1,$2}' multiwords.txt | sort -k2 | awk '{if(last==$2 && lastpos!=$3) print last,lastpos,$2,$3; last=$2; lastpos=$3}'Length of output: 603
Script:
#!/bin/bash # Search for the file with a different pattern fd -t f "multiwords" # Try searching for specific POS tag patterns to analyze consistency rg "^.*\t(NNP|NNPS|NN|NNS|JJ|RB|UH|FW|SYM|CC|_multiword_)" -l # Look for the actual content rg "^.*\t.*$" --no-filename | head -n 5Length of output: 3627
Script:
#!/bin/bash # Look at the English multiwords file content echo "Examining English multiwords file..." rg -A 5 "^[^#].*\t" languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/multiwords.txt # Count occurrences of each POS tag echo "POS tag distribution..." rg "^[^#].*\t" languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/multiwords.txt | awk -F'\t' '{print $2}' | sort | uniq -c # Check for potentially inconsistent tags echo "Checking for potential inconsistencies..." rg "^[^#].*(_multiword_|:)" languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/multiwords.txtLength of output: 73804
Script:
#!/bin/bash # Check for specific patterns of inconsistent POS tagging echo "Checking for inconsistent POS tag patterns..." rg "^[^#].*\t.*:.*$" languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/multiwords.txt # Check for _multiword_ tag usage echo "Checking _multiword_ tag usage..." rg "^[^#].*_multiword_" languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/multiwords.txt # Check for entries with multiple tags echo "Checking entries with multiple tags..." rg "^[^#].*\t.*\s.*\s" languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/multiwords.txtLength of output: 5599
languagetool-core/src/main/resources/org/languagetool/resource/spelling_global.txt (1)
Line range hint
26878-26942
: Consider categorizing entries for better maintenance.The entries span various categories (people, places, brands, etc.). Consider organizing them with comments or separating them into category-specific files for easier maintenance.
Example structure:
+ # Places Ko Samui Koh Samui + # Media Fox & Friends Fox and Friends + # Airlines Royal Dutch Airlines Qeshm Air + # Musicians/Performers Cecilia Bartoli Joyce DiDonato + # Brands Victorinox Dassault Falcon
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (16)
languagetool-core/src/main/resources/org/languagetool/resource/spelling_global.txt
(3 hunks)languagetool-language-modules/de/src/main/java/org/languagetool/rules/de/GermanWordRepeatRule.java
(1 hunks)languagetool-language-modules/de/src/main/resources/org/languagetool/resource/de/case_rule_exceptions.txt
(1 hunks)languagetool-language-modules/de/src/main/resources/org/languagetool/resource/de/compounds.txt
(1 hunks)languagetool-language-modules/de/src/main/resources/org/languagetool/resource/de/disambiguation.xml
(1 hunks)languagetool-language-modules/de/src/main/resources/org/languagetool/resource/de/hunspell/ignore.txt
(1 hunks)languagetool-language-modules/de/src/main/resources/org/languagetool/resource/de/hunspell/spelling.txt
(2 hunks)languagetool-language-modules/de/src/main/resources/org/languagetool/resource/de/multitoken-suggest.txt
(2 hunks)languagetool-language-modules/de/src/main/resources/org/languagetool/rules/de/grammar.xml
(3 hunks)languagetool-language-modules/de/src/main/resources/org/languagetool/rules/de/remote-rule-filters.xml
(2 hunks)languagetool-language-modules/en/src/main/java/org/languagetool/rules/en/EnglishWordRepeatRule.java
(1 hunks)languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/compounds.txt
(1 hunks)languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/hunspell/ignore.txt
(1 hunks)languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/hunspell/spelling.txt
(1 hunks)languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/multiwords.txt
(1 hunks)languagetool-language-modules/nl/src/main/resources/org/languagetool/resource/nl/common_words.txt
(0 hunks)
🔥 Files not summarized due to errors (1)
- languagetool-language-modules/de/src/main/resources/org/languagetool/resource/de/disambiguation.xml: Error: Server error: no LLM provider could handle the message
💤 Files with no reviewable changes (1)
- languagetool-language-modules/nl/src/main/resources/org/languagetool/resource/nl/common_words.txt
✅ Files skipped from review due to trivial changes (1)
- languagetool-language-modules/de/src/main/resources/org/languagetool/resource/de/hunspell/ignore.txt
🔇 Additional comments (20)
languagetool-language-modules/de/src/main/resources/org/languagetool/resource/de/disambiguation.xml (1)
9-9
: LGTM! Consider verifying pattern coverage.
The zal_adj_ig
entity pattern maintains proper German orthography and follows the established format for number-adjective combinations.
Let's verify the pattern coverage by checking for test cases:
✅ Verification successful
Pattern is well-integrated and verified in the codebase
The zal_adj_ig
entity is properly integrated into the disambiguation system:
- It's used in conjunction with
zal_adj
entity for number-adjective combinations - The pattern is referenced in disambiguation rules (e.g., "-köpfig" rule)
- Found test examples with number-adjective combinations (e.g., "250-jährige", "14-jähriges", "40-jähriger")
- The pattern follows the established format and is used consistently throughout the German language module
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check for corresponding test cases that validate the number-adjective combinations
# Search for test cases in XML files that might test these patterns
rg -l --type xml "zal_adj_ig|jähr|bänd|seit|monat|täg" languagetool-language-modules/de/src/test/
# Look for example sentences in test resources
rg -A 2 -B 2 --type xml "<marker>" languagetool-language-modules/de/src/test/ | grep -E "[0-9]+.*(jährig|bändig|seitig|monatig|tägig)"
Length of output: 236
Script:
#!/bin/bash
# Let's try a different approach to find test cases and usage patterns
# Search for test files that might contain number-adjective combinations
fd ".*\.xml$" languagetool-language-modules/de/src/test/
# Search for disambiguation rules using this entity
rg -A 2 -B 2 "zal_adj_ig" languagetool-language-modules/de/src/main/resources/
# Check if there are any unit tests for disambiguation
fd ".*Test.*\.java$" languagetool-language-modules/de/src/test/
# Look for example sentences in any XML files under the German module
rg --type xml "(ein|zwei|drei|vier|fünf).*(jährig|bändig|seitig|monatig|tägig)" languagetool-language-modules/de/
Length of output: 14161
languagetool-language-modules/de/src/main/resources/org/languagetool/rules/de/grammar.xml (2)
23968-23968
: Pattern consistency verified across rules
The term "phas" has been consistently added to all three related rules:
- Rule for separated numbers (e.g., "3 phasig")
- Rule for connected numbers (e.g., "3phasig")
- Rule for written numbers (e.g., "dreiphasig")
This maintains the pattern consistency required for comprehensive grammar checking.
Also applies to: 23980-23980, 24001-24001
23968-23968
: Verify German compound word formation with "phasig"
The addition of "phas" to the patterns appears to handle compounds like "3-phasig" (3-phase), which follows German compound word formation rules. Let's verify this is a common usage pattern.
Also applies to: 23980-23980, 24001-24001
✅ Verification successful
Addition of "phas" pattern is consistent with German compound formation
The search results confirm that compound words with "phasig" are well-established in German:
- Found in all three major German dictionaries (DE, CH, AT)
- Consistent pattern with written numbers (ein-, zwei-, drei-, vier-, fünf-, sechs-)
- All entries marked as adjectives (/A)
The addition of "phas" to the grammar rules aligns with these dictionary entries and will help detect incorrect compound formations with numbers.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Search for existing usage patterns of "phasig" compounds in German text examples
# This helps verify if the grammar rule addition aligns with real-world usage
# Search for examples of numeric compounds with "phasig"
rg -i "[0-9][-]?phasig" "languagetool-language-modules/de/src/main/resources/org/languagetool/resource/"
# Search for written number compounds with "phasig"
rg -i "(ein|zwei|drei|vier|fünf|sechs|sieben|acht|neun|zehn)[-]?phasig" "languagetool-language-modules/de/src/main/resources/org/languagetool/resource/"
Length of output: 2291
languagetool-language-modules/de/src/main/resources/org/languagetool/resource/de/hunspell/spelling.txt (2)
65440-65441
: LGTM! New terminology additions are well-formed.
The new adjectives and technical terms follow proper German word formation rules and include appropriate grammatical markers:
- syrischstämmig/A (adjective)
- privatinsolvent/A (adjective)
- Techanalyse/N (noun with plural form)
- Gender-specific forms for Techanalyst/in
Also applies to: 65443-65444, 65446-65451
44198-44202
: LGTM! Verify real-world usage patterns.
The Ivy League related terms follow proper German compound word formation rules and include appropriate gender variations. The grammatical markers are consistent.
Let's verify the usage patterns of these terms in German:
✅ Verification successful
The Ivy League terms are correctly formatted and consistent with existing patterns
The added Ivy League terms follow the established patterns in the German spelling dictionary:
- They use proper hyphenation similar to other compound foreign terms (e.g., "Design-Thinking-Konzept/S", "Coffee-to-go-Shop/S")
- They include appropriate grammatical markers (/S for singular, /N for plural)
- They provide comprehensive variations (university, student, gender-specific forms)
- The compound structure matches other established foreign term compounds in the dictionary
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check usage patterns of Ivy League terms in German
# This helps validate if the chosen forms match common usage
# Search for existing usage patterns in German text files
rg -i "ivy.?league" | grep -i "\.de\|\.at\|\.ch" || echo "No matches in German domain content"
Length of output: 131
Script:
#!/bin/bash
# Let's check the actual entries in the spelling dictionary for Ivy League terms
rg -A 5 -B 5 "Ivy" languagetool-language-modules/de/src/main/resources/org/languagetool/resource/de/hunspell/spelling.txt
# Also check if there are any existing German language files or documentation
# that might contain guidance about these terms
fd -e md -e txt . languagetool-language-modules/de | xargs rg -l -i "ivy|lehnwort|fremdwort|anglizismus"
Length of output: 1123
Script:
#!/bin/bash
# Let's check the content of the other files that matched our search
# to understand how similar foreign terms are handled
rg -A 2 -B 2 "fremdwort|anglizismus|lehnwort" languagetool-language-modules/de/src/main/resources/org/languagetool/resource/de/multitoken-suggest.txt languagetool-language-modules/de/src/main/resources/org/languagetool/resource/de/multitoken-ignore.txt
# Also check if there are any existing compound words with similar patterns
rg ".*-.*-.*/(S|N)" languagetool-language-modules/de/src/main/resources/org/languagetool/resource/de/hunspell/spelling.txt | head -n 10
Length of output: 621
languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/hunspell/spelling.txt (2)
583-584
: Verify hyphenation consistency
The dictionary includes both hyphenated and non-hyphenated variants of the same term: "nonenveloped" and "non-enveloped". While both forms might be valid, consider standardizing on one form for consistency.
✅ Verification successful
Based on the search results, I can see that the English language module has a coherency.txt file that explicitly defines pairs of acceptable variants for "non-" words, including both hyphenated and non-hyphenated forms. The file shows that both forms are valid for many similar medical/scientific terms.
Both forms are valid and standardized in the dictionary
The presence of both "nonenveloped" and "non-enveloped" in the spelling dictionary is consistent with LanguageTool's coherency rules, which accept both hyphenated and non-hyphenated variants for similar scientific terms (e.g., "non-vascular/nonvascular", "non-binary/nonbinary", etc.).
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Check for other hyphenation inconsistencies in the dictionary
rg -A 1 "^non-\w+"
rg -A 1 "^non\w+"
Length of output: 68451
573-587
: Verify medical term accuracy
The added virus-related terms form a coherent set (echovirus, enterovirus, coxsackievirus). These are technically accurate as they represent a family of viruses, with Coxsackie being a specific type of enterovirus.
✅ Verification successful
Based on the search results, I can now generate the final response since I have sufficient information to verify the medical terms:
Medical terms verified as accurate and complete
The virus-related terms added to the spelling dictionary are technically accurate and form a comprehensive set that includes the main virus types and their variants:
- echovirus/echoviruses
- enterovirus/enteroviruses/enteroviral
- coxsackievirus/coxsackieviruses/Coxsackie
- Related terms: pathogenic, pathogenicity, serotype/serotypes, nonpolio, nonenveloped/non-enveloped
These terms are consistent with medical terminology used across multiple language modules in the codebase and accurately represent this family of viruses, with proper singular/plural forms and variant spellings included.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Check for any missing related virus terms in the dictionary
rg -i "virus"
Length of output: 65593
languagetool-language-modules/de/src/main/java/org/languagetool/rules/de/GermanWordRepeatRule.java (1)
125-128
: LGTM! Addition follows established pattern
The new anti-pattern for "gout gout" is correctly implemented using the same format as other similar entries in the ANTI_PATTERNS
list.
languagetool-language-modules/en/src/main/java/org/languagetool/rules/en/EnglishWordRepeatRule.java (1)
172-173
: LGTM! Addition follows established pattern
The new condition for "gout gout" is correctly implemented using the repetitionOf()
helper method, consistent with other similar entries in the ignore()
method.
languagetool-language-modules/de/src/main/resources/org/languagetool/resource/de/case_rule_exceptions.txt (1)
365-365
: LGTM: Valid addition to Golden pattern
The addition of "Bachelors?|Bachelorette" to the Golden pattern follows the established format and correctly handles capitalization exceptions for these terms.
languagetool-language-modules/de/src/main/resources/org/languagetool/resource/de/multitoken-suggest.txt (1)
Line range hint 3133-3159
: LGTM: Valid additions to multitoken suggestions
The new entries follow the established format with appropriate suffix indicators (/S, /N) and maintain consistency with existing patterns.
languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/hunspell/ignore.txt (1)
10973-10981
: LGTM: Valid additions to spellchecker ignore list
The new entries are appropriate additions to the ignore list and follow the established format.
languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/compounds.txt (2)
9055-9056
: LGTM! The new compound word entries are properly formatted.
The additions follow the file's conventions:
- Includes both singular and plural forms
- Uses the "?" marker consistently for suggesting lower-cased joined variants
- Properly hyphenated medical terms
9055-9056
: Verify consistency with other medical terminology in the codebase.
Let's ensure these medical terms are consistently handled across the language resources.
✅ Verification successful
Medical terminology follows consistent hyphenation patterns in compounds.txt
The verification shows that:
- The hyphenated forms "coxsackie-virus" and "coxsackie-viruses" follow the same pattern as other virus compounds like "rota-virus", "polio-virus", "noro-virus", etc.
- The spelling dictionary includes both hyphenated and non-hyphenated variants (coxsackievirus, Coxsackie)
- The optional question mark suffix is consistently used across all virus compounds to make the hyphenation optional
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check for consistency of medical terminology across language resources
# Check for related terms in spelling dictionary
echo "Checking spelling dictionary for related terms..."
rg -i "coxsackie|virus" "languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/hunspell/spelling.txt"
# Check for any existing virus-related compound patterns
echo "Checking for virus-related compound patterns..."
rg -i "[a-z]+-virus" "languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/compounds.txt"
Length of output: 714
languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/multiwords.txt (1)
Line range hint 1-4
: Documentation looks good!
The file header clearly explains the purpose, usage criteria, and case-sensitivity rules for multi-word terms.
languagetool-language-modules/de/src/main/resources/org/languagetool/resource/de/compounds.txt (3)
Line range hint 1-19
: LGTM! Well-documented format specification.
The file header provides clear documentation about:
- Purpose: German compound words that are usually not written as separate words
- Encoding: UTF-8
- Special markers ($, ?, *, +) and their effects on suggestions
- Pattern support (\d+)
- File maintenance tips
Line range hint 11622-11626
: LGTM! Business terminology follows established patterns.
The new entries for "Inhouse-Seminar" and its variations follow the established pattern for business terminology in the file, using the $ marker to suggest lower-cased non-hyphenated variants.
11617-11621
: Verify medical terminology accuracy.
The new entries for Coxsackie virus variants follow the correct medical terminology and German compound word formation rules. The hyphenation is consistent with similar medical terms in the file.
✅ Verification successful
Medical terminology verified and consistent across German language files
The verification shows that:
- The Coxsackie virus entries are properly documented in both compounds.txt and spelling.txt
- Both hyphenated (Coxsackie-Virus) and non-hyphenated (Coxsackievirus) forms are included in spelling.txt
- The hyphenation pattern is consistent with other virus entries (e.g., RS-Virus, Noro-Virus, Mpox-Virus)
- Both singular (-Virus) and plural (-Viren) forms are consistently maintained
- Neo-Angin is correctly documented as a medical brand name with appropriate markings
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Verify consistency of medical terminology across language files
rg -i "coxsackie|neo-angin" "languagetool-language-modules/*/src/main/resources/org/languagetool/resource/*/hunspell/spelling.txt"
Length of output: 273
Script:
#!/bin/bash
# Let's verify the medical terminology in German dictionaries and spelling files
rg -i "coxsackie|neo-angin" languagetool-language-modules/de/src/main/resources/org/languagetool/resource/de/
# Check for similar virus entries to verify consistency
rg -i "virus|viren" languagetool-language-modules/de/src/main/resources/org/languagetool/resource/de/compounds.txt
Length of output: 2313
languagetool-core/src/main/resources/org/languagetool/resource/spelling_global.txt (2)
26878-26879
: Verify intentional inclusion of both variants.
The file includes both "Fox & Friends" and "Fox and Friends". While both variants are commonly used, verify if this is intentional as it might affect consistency in text processing.
✅ Verification successful
Both variants of "Fox & Friends" are correctly included in the global spelling dictionary
The inclusion of both "Fox & Friends" and "Fox and Friends" in the spelling dictionary is intentional and correct. This is a common practice for proper nouns that have multiple accepted written forms, ensuring that both variants are recognized as valid spellings. The search results confirm these entries only exist in the global spelling dictionary, which is the appropriate location for such variants.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Check if these variants are consistently handled across other language files
rg -i "Fox.*(and|&).*Friends" --type txt
Length of output: 238
26940-26942
: Ensure consistent handling of name prefixes.
The entries "al-Bashir" and "Al Bashir" represent variants of the same name. This matches the related change mentioned in multiwords.txt ("Al Bashir NNP"), but we should verify consistent handling across files.
✅ Verification successful
Consistent handling of "al-" prefix variants is already implemented across language modules.
The search results show that both variants "al-Bashir" and "Al Bashir" are consistently handled across different language modules:
- German module has "al-Bashir/S" in ignore.txt
- English module has "Al Bashir NNP" in multiwords.txt
- Dutch module has "al-Bashir" in spelling.txt
- Global spelling has both "al-Bashir" and "Al Bashir"
This pattern of supporting both hyphenated and space-separated variants is also consistently applied to other Arabic names with the "al-" prefix across the codebase (e.g., al-Assad/Al Assad, al-Qaeda/Al Qaeda), indicating this is an intentional design to support multiple valid representations of these names.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Check for consistency in handling of "al-" prefix across files
rg -i "al-?\s?bashir" --type txt
# Check if there are other similar prefix patterns that might need similar treatment
rg -i "^al-" --type txt
Length of output: 22181
Neo-Angin/S #name | ||
privatinsolvent/A | ||
oberhell/A | ||
blitzkriegartig/A |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider alternative terminology for "blitzkriegartig".
The term "blitzkriegartig" might be sensitive due to its historical connotations. Consider using alternatives like "blitzschnell" or "schlagartig" instead.
Coxsackievirus | ||
Coxsackieviren | ||
Coxsackie-Virus | ||
Coxsackie-Viren | ||
Coxsackie-A-Virus | ||
Coxsackie-A-Viren |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Ensure consistent hyphenation in virus terminology.
The Coxsackievirus entries show inconsistent hyphenation:
- "Coxsackievirus" (without hyphen)
- "Coxsackie-Virus" (with hyphen)
Medical terminology should follow a consistent pattern.
Consider standardizing all entries to use hyphens:
-Coxsackievirus
-Coxsackieviren
+Coxsackie-Virus
+Coxsackie-Viren
Committable suggestion skipped: line range outside the PR's diff.
@@ -65432,6 +65437,24 @@ | |||
Durchfahrtssperre/N | |||
Bøgseth/S | |||
AnnenMayKantereit/S | |||
syrischstämmig/A | |||
Dunkelflaute/N | |||
Neo-Angin/S #name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codebase verification
Review of medication trademark handling in German spelling dictionary
Based on the search results, I found several medication-related trademarks that should be consistently marked with #name:
- Chlorhexamed (antiseptic medication)
- Medistar (medical software)
- Mundipharma/S (pharmaceutical company)
These entries appear to be trademarks similar to Neo-Angin/S but currently lack the #name marker.
🔗 Analysis chain
Verify trademark handling for Neo-Angin.
The entry "Neo-Angin/S" is correctly marked with #name as it's a trademark. Consider if other branded medications in the dictionary should follow the same pattern.
Let me generate new shell scripts to specifically search for medication names in the German spelling dictionary:
This will help us identify if there are other medication names that should be marked with #name like Neo-Angin/S.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check for other medication names that might need #name marking
# Search for potential medication names without #name marking
rg -l '[A-Z][a-z]+(?:-[A-Z][a-z]+)?(?:/[A-Z])?' | xargs rg -v '#name' | grep -i 'med\|pharma\|drug'
Length of output: 69674
Script:
#!/bin/bash
# Search for medication-related entries in the German spelling dictionary
rg -i "(?:med|pharm|drug|angin)" languagetool-language-modules/de/src/main/resources/org/languagetool/resource/de/hunspell/spelling.txt
Length of output: 4010
<rule> | ||
<pattern> | ||
<token>sorry</token> | ||
<token>für</token> | ||
</pattern> | ||
<example correction=""><marker>Sorr für</marker> die Unannehmlichkeit.</example> | ||
</rule> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Enhance the rule implementation for 'sorry für' pattern
The rule implementation has several issues that need to be addressed:
- Missing suggestion for the correct German alternative
- Pattern doesn't match the example case ("Sorr für" vs "sorry für")
- Pattern might be too broad without additional context checks
Consider enhancing the rule with:
<rule>
<pattern>
- <token>sorry</token>
- <token>für</token>
+ <token regexp="yes">sorry|sorr</token>
+ <token>für</token>
</pattern>
+ <suggestion>Entschuldigung für</suggestion>
<example correction="">Sorr für die Unannehmlichkeit.</example>
+ <example>Es tut mir leid für die Unannehmlichkeit.</example>
</rule>
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
<rule> | |
<pattern> | |
<token>sorry</token> | |
<token>für</token> | |
</pattern> | |
<example correction=""><marker>Sorr für</marker> die Unannehmlichkeit.</example> | |
</rule> | |
<rule> | |
<pattern> | |
<token regexp="yes">sorry|sorr</token> | |
<token>für</token> | |
</pattern> | |
<suggestion>Entschuldigung für</suggestion> | |
<example correction="">Sorr für die Unannehmlichkeit.</example> | |
<example>Es tut mir leid für die Unannehmlichkeit.</example> | |
</rule> |
@@ -1831,7 +1831,7 @@ | |||
<rule> | |||
<pattern> | |||
<marker> | |||
<token regexp="yes">Barbarazweige?|obengenannt</token> | |||
<token regexp="yes">Barbarazweige?|obengenannt.*|privatinsolvent.*</token> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codebase verification
Based on the findings, I can now generate the final response:
Add examples and documentation for new orthography space rule patterns
The rule additions need proper documentation and examples:
obengenannt.*
andprivatinsolvent.*
lack example cases in the rule definition- All terms are listed in the spelling dictionaries, suggesting they are valid words
- The rule group
AI_DE_GGEC_MISSING_ORTHOGRAPHY_SPACE
appears to be for catching missing spaces, but the specific cases for these terms are not documented
Please:
- Add example cases for
obengenannt
andprivatinsolvent
patterns - Document why these terms specifically need orthography space checks
- Consider adding test cases to verify the correct behavior
🔗 Analysis chain
Verify and document the new pattern additions
The addition of "Barbarazweige|obengenannt.|privatinsolvent." to the pattern needs verification and documentation.
Let's verify the usage of these terms in the codebase and documentation:
Please:
- Add examples for each new term
- Document why these terms require orthography space checks
- Consider adding test cases to verify the rules
Let me analyze the findings:
-
The terms are listed in spelling dictionaries:
- "Barbarazweig/E" and "Barbarazweigen" in spelling.txt
- "obengenannt/nA" in all German dictionaries (DE, CH, AT)
- "privatinsolvent/A" in spelling.txt
-
There's only one example for "Barbarazweige" but none for the other terms.
Let me check if there are any message/description files that might explain these rules:
Let me check for any rule descriptions or messages that might explain the purpose of these additions:
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check for occurrences and documentation of the new terms
# Search for existing examples or documentation
rg -A 5 -B 5 "Barbarazweige|obengenannt|privatinsolvent"
# Search for any related test cases
fd -e "test.xml" -e "test.txt" -x rg "Barbarazweige|obengenannt|privatinsolvent" {}
Length of output: 8391
Script:
#!/bin/bash
# Check for rule descriptions and messages
rg -A 5 -B 5 "AI_DE_GGEC_MISSING_ORTHOGRAPHY_SPACE" languagetool-language-modules/de/src/main/resources/org/languagetool/rules/de/
# Check if these terms appear in other rule files
fd -e xml -x rg "Barbarazweige?|obengenannt|privatinsolvent" {} languagetool-language-modules/de/src/main/resources/org/languagetool/rules/de/
Length of output: 66429
Script:
#!/bin/bash
# Search for rule descriptions or messages containing these terms or rule ID
rg -B 5 -A 5 "AI_DE_GGEC_MISSING_ORTHOGRAPHY_SPACE" languagetool-language-modules/de/src/main/resources/org/languagetool/messages/
# Check for any XML comments near the rule definition
rg -B 10 -A 2 "AI_DE_GGEC_MISSING_ORTHOGRAPHY_SPACE" --type xml
Length of output: 2365
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
languagetool-language-modules/de/src/main/resources/org/languagetool/rules/de/remote-rule-filters.xml
(2 hunks)languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/multiwords.txt
(1 hunks)
🔇 Additional comments (3)
languagetool-language-modules/de/src/main/resources/org/languagetool/rules/de/remote-rule-filters.xml (2)
1834-1834
: 🛠️ Refactor suggestion
Add examples and documentation for new orthography space rule patterns
The new pattern additions Barbarazweige?|obengenannt.*|privatinsolvent.*
lack proper documentation and examples to demonstrate their usage.
Add example cases for each term to validate the rule behavior:
<rule>
<pattern>
<marker>
<token regexp="yes">Barbarazweige?|obengenannt.*|privatinsolvent.*</token>
</marker>
</pattern>
<example correction="">Barbarazweige sind Zweige von Obstbäumen.</example>
+ <example correction="">Der obengenannte Vertrag ist gültig.</example>
+ <example correction="">Die privatinsolvente Person muss...</example>
</rule>
2003-2009
: 🛠️ Refactor suggestion
Enhance the rule implementation for 'sorry für' pattern
The rule implementation has several issues that need to be addressed:
- Missing suggestion for the correct German alternative
- Pattern doesn't match the example case ("Sorr für" vs "sorry für")
- Pattern might be too broad without additional context checks
Consider enhancing the rule with:
<rule>
<pattern>
- <token>sorry</token>
- <token>für</token>
+ <token regexp="yes">sorry|sorr</token>
+ <token>für</token>
</pattern>
+ <suggestion>Entschuldigung für</suggestion>
<example correction="">Sorr für die Unannehmlichkeit.</example>
+ <example>Es tut mir leid für die Unannehmlichkeit.</example>
</rule>
Likely invalid or redundant comment.
languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/multiwords.txt (1)
8019-8019
: LGTM! The new entry follows the file guidelines.
The entry "Al Bashir NNP" is properly formatted and meets the requirements:
- Single part of speech (NNP)
- No disambiguation needed
- Follows the termPOS format
@@ -8016,6 +8016,7 @@ Dickson Fjord NNP | |||
Iron Boulder NNP | |||
prima facie UH | |||
Masoud Pezeshkian NNP | |||
Al Bashir NNP |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codebase verification
Inconsistent formatting found for "Al Bashir" across language modules
The name appears with different formats across language resources:
- English (en): "Al Bashir" (space-separated)
- Dutch (nl): "al-Bashir" (hyphenated)
- German (de): Both "Al Bashir/S" and "al-Bashir/S" (inconsistent within same language)
🔗 Analysis chain
Verify consistent usage across the codebase.
Let's check if this term is used consistently in other language resources.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check for any variations of "Al Bashir" in other language resources
# to ensure consistent usage.
# Search for variations in language resource files
rg -i "al.?bashir" "languagetool-language-modules/*/src/main/resources/"
Length of output: 166
Script:
#!/bin/bash
# Let's try a more targeted search approach
# First, let's find all resource directories
fd -t d "resources" languagetool-language-modules
# Then search for variations of the name in text files
rg -i "al.?bashir" $(fd -t d "resources" languagetool-language-modules)
# Also search specifically in multiwords.txt files across language modules
fd -g "*multiwords.txt" languagetool-language-modules
Length of output: 6374
Summary by CodeRabbit
New Features
Bug Fixes
Documentation
Chores