-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cb changes 20241203 #11081
Cb changes 20241203 #11081
Conversation
WalkthroughThis pull request introduces significant updates to various language processing files within the LanguageTool project. Key changes include the addition of numerous scientific names and proper nouns to Changes
Possibly related PRs
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Outside diff range and nitpick comments (3)
languagetool-language-modules/de/src/main/resources/org/languagetool/resource/de/hunspell/spelling.txt (1)
65383-65383
: Add declension suffix for medical termThe medical term "Fasziitis" should include a declension suffix. Consider adding
/S
for singular noun declension, as it follows the same pattern as other medical terms in German.-Fasziitis +Fasziitis/Slanguagetool-core/src/main/resources/org/languagetool/resource/spelling_global.txt (2)
26871-26872
: Consider standardizing case variantsThe entries "Vinci Autoroutes" and "VINCI Autoroutes" represent the same entity. Consider implementing case-insensitive matching in the spelling checker to avoid maintaining multiple variants.
26877-26878
: Consider handling spelling variants systematicallyThe entries "Ko Samui" and "Koh Samui" represent alternative spellings of the same Thai location. Consider:
- Documenting which spelling is preferred
- Implementing a suggestion mechanism to recommend the preferred spelling
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (9)
languagetool-core/src/main/resources/org/languagetool/resource/spelling_global.txt
(1 hunks)languagetool-language-modules/de/src/main/java/org/languagetool/rules/de/CompoundInfinitivRule.java
(3 hunks)languagetool-language-modules/de/src/main/resources/org/languagetool/resource/de/hunspell/spelling.txt
(3 hunks)languagetool-language-modules/de/src/main/resources/org/languagetool/resource/de/multitoken-suggest.txt
(1 hunks)languagetool-language-modules/de/src/main/resources/org/languagetool/rules/de/grammar.xml
(2 hunks)languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/disambiguation.xml
(1 hunks)languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/hunspell/ignore.txt
(1 hunks)languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/hunspell/spelling.txt
(1 hunks)languagetool-language-modules/en/src/main/resources/org/languagetool/rules/en/grammar.xml
(1 hunks)
🔥 Files not summarized due to errors (1)
- languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/disambiguation.xml: Error: Server error: no LLM provider could handle the message
✅ Files skipped from review due to trivial changes (1)
- languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/hunspell/ignore.txt
🔇 Additional comments (14)
languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/disambiguation.xml (1)
17155-17172
: 🛠️ Refactor suggestion
Consider a more maintainable approach for handling censored words
While the implementation is functional, there are a few concerns to address:
- The current approach explicitly lists profanity in the source code, which might not be ideal for code maintenance and readability.
- The pattern matching might miss variations or be too permissive.
Consider these improvements:
-
Move the censored word patterns to a separate configuration file that can be:
- Easily maintained without touching the main rules
- Protected with appropriate access controls
- Updated without redeploying the application
-
Use a more robust pattern matching approach:
<rulegroup id="CENSORED_WORDS" name="ignore spelling of censored words">
<rule>
<pattern>
<token>
<exception regexp="yes"><!-- include safe words that might match --></exception>
<reference path="censored-patterns.xml"/>
</token>
</pattern>
<disambig action="ignore_spelling"/>
</rule>
</rulegroup>
Let's check if similar patterns exist elsewhere in the codebase:
languagetool-language-modules/en/src/main/resources/org/languagetool/rules/en/grammar.xml (1)
72202-72202
: LGTM! Good enhancement to the academic terms pattern.
Adding inflected="yes"
is a good improvement as it allows the antipattern to match plural forms and other inflections of academic terms (e.g., "theses", "students", "graduates"), making the rule more comprehensive.
Let's verify that this antipattern doesn't conflict with other rules:
✅ Verification successful
Let me analyze the search results:
-
The search found several academic-related terms in various rules:
- A US-specific rule for "master's dissertation/thesis" terminology
- Examples using academic terms in various contexts
- No conflicting antipatterns using these terms were found by ast-grep
-
Let me search for any other potential conflicts with inflected forms:
Based on the search results, I can now make a definitive assessment:
No conflicts found with inflected academic terms pattern
The search results show that:
- The academic terms are used in various rules but in different contexts (e.g., subject-verb agreement, capitalization)
- No conflicting antipatterns were found that would be affected by adding
inflected="yes"
- The existing rules that use these terms (like "master's thesis" vs "master's dissertation") work independently and won't be impacted by the inflection change
The change to allow inflected forms in the antipattern is safe and will improve pattern matching by correctly handling plural forms of academic terms.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check for potential conflicts with other rules using the same terms
# Look for other rules that might be affected by this change
# Search for other instances of these academic terms in grammar rules
rg -A 5 -B 5 '\b(thesis|major|student|graduate|class|degree|program|diploma|course)\b' \
--type xml \
languagetool-language-modules/en/src/main/resources/org/languagetool/rules/en/
# Check for any existing antipatterns using these terms
ast-grep --pattern '<antipattern>
$$$
<token $$$>thesis|major|student|graduate|class|degree|program|diploma|course</token>
$$$
</antipattern>'
Length of output: 71113
Script:
#!/bin/bash
# Search for other rules that might use inflected forms of these terms
rg -A 5 -B 5 'inflected="yes".*>(thesis|major|student|graduate|class|degree|program|diploma|course)<' \
--type xml \
languagetool-language-modules/en/src/main/resources/org/languagetool/rules/en/
# Also check for any rules that might be affected by plural forms
rg -A 5 -B 5 '\b(theses|majors|students|graduates|classes|degrees|programs|diplomas|courses)\b' \
--type xml \
languagetool-language-modules/en/src/main/resources/org/languagetool/rules/en/
Length of output: 85539
languagetool-language-modules/de/src/main/resources/org/languagetool/resource/de/hunspell/spelling.txt (3)
17576-17576
: LGTM: Proper morphological annotation added
The addition of the /S
suffix to "Granola" correctly identifies it as a noun with standard singular declension while maintaining its proper noun categorization.
65360-65373
: LGTM: Complete verb conjugation sets
The additions for "rumkippeln" and "herumkippeln" include all necessary conjugation forms (infinitive, present tense, past tense, and participle) for both colloquial variants.
65434-65434
: LGTM: Proper band name entry
The addition of "AnnenMayKantereit" with the /S
suffix correctly follows the conventions for proper nouns in the dictionary.
languagetool-language-modules/de/src/main/resources/org/languagetool/rules/de/grammar.xml (2)
27276-27326
: Well-structured antipatterns covering common German constructions!
The new antipatterns effectively handle various German linguistic constructions:
- Verb-preposition combinations (e.g., "reichen aus zu", "kommen komisch vor zu")
- Intensifier patterns with adjectives
- Common phrasal expressions (e.g., "mit zu tun", "Angst vor zu")
The patterns make good use of XML attributes (inflected, skip, regexp) to ensure flexible matching while maintaining accuracy.
27372-27374
: LGTM! Valid preposition exceptions added.
The new exceptions "kraft", "oberhalb", and "unterhalb" are valid prepositions that can precede "zu" in correct German grammar.
Let's verify if we have comprehensive coverage of similar prepositions:
✅ Verification successful
LGTM! The preposition exceptions are well covered
The search results confirm that similar directional prepositions ending in "-halb" (ausserhalb, oberhalb, unterhalb) are consistently handled as exceptions. The new additions align with the existing pattern and maintain consistency in the grammar rules.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Search for other prepositions that might need similar exception handling
rg -A 2 '<exception>[a-z]+(halb|kraft)</exception>' languagetool-language-modules/de/src/main/resources/
Length of output: 1117
languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/hunspell/spelling.txt (1)
534-535
: LGTM! The new dictionary entries follow the established format.
The additions include well-formed technical terms, compound words, and general vocabulary. The entries maintain consistency with the file's format and existing patterns.
Let's verify that these new entries don't conflict with existing ones:
✅ Verification successful
The new dictionary entries are correctly formatted and do not have any duplicates
The verification confirms that both "controllability" and "controlment" are present in the spelling dictionary exactly once, following the established format of one word per line without any variations or duplicates.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check for potential duplicates of the new entries
# Note: Using case-insensitive search to catch any variations
echo "Checking for potential duplicates..."
for word in controllability controlment organocatalyst organocatalysts organocatalysis bioorganometallic organometallic organometallics victimhood culturewise unintrusive unintrusively unobtrusive unobtrusively questline questlines overindepted underindepted "UV-degradable" "oxy-degradable" "bio-degradable" "bio-degradables" biodegradable biodegradables Braudelian outflux outfluxes torr torrs; do
echo "Checking: $word"
rg -i "^$word(/[A-Z])?$" "languagetool-language-modules/*/src/main/resources/org/languagetool/resource/*/hunspell/spelling.txt"
done
Length of output: 21278
Script:
#!/bin/bash
# Let's try with the correct path pattern for the spelling files
for word in controllability controlment; do
echo "Checking: $word"
rg -i "^$word$" languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/hunspell/spelling.txt
done
Length of output: 483
languagetool-language-modules/de/src/main/java/org/languagetool/rules/de/CompoundInfinitivRule.java (4)
83-118
: LGTM! The new anti-patterns handle common German expressions correctly.
The added patterns properly handle various German expressions that should not trigger the compound infinitive rule, including:
- Expressions with "kurz davor zu"
- Time-related expressions with "Jahr/Monat/Zeit über zu"
- Expressions with "endlich wieder zu"
- Expressions with "bis hin zu"
- Expressions with "von ... her zu"
- Expressions with adjectives and intensifiers
141-144
: LGTM! The time-related anti-patterns are comprehensive.
The pattern correctly handles temporal expressions with "über zu" construction, covering a wide range of time-related nouns:
Spiel|Tag|Nacht|Morgen|Nachmittag|Abend|Zeit|.+zeit|Jahr(zehnt)?|Monat|.+tag|Mittwoch|Januar|Februar|März|April|Mai|Juni|Juli|August|September|Oktober|November|Dezember
171-175
: LGTM! The anti-pattern for repeated "zu" is correctly implemented.
The pattern properly handles cases where "zu" appears twice in sequence with an adjective in between.
222-223
: LGTM! The example pair update is appropriate.
The example has been updated from "sicher zu gehen" to "sicherzugehen", which better illustrates the compound infinitive rule.
languagetool-language-modules/de/src/main/resources/org/languagetool/resource/de/multitoken-suggest.txt (1)
3120-3121
: LGTM! The new multi-token entries are well-formed.
The additions include:
- Alternative spellings for Thai locations (Ko/Koh variants)
- English phrases with proper language tags
Let's verify the consistency of the new entries:
✅ Verification successful
The new entries follow the established format and are correctly placed in the file
The verification confirms that:
- The new entries (Ko Samui/S, Koh Samui/S) follow the same format pattern as other entries in the file (/S suffix for plural forms)
- They are properly grouped with other similar Thai location variants
- The format is consistent with the file's documented suffix rules (# /S will add s)
- The entries are properly sorted and integrated with surrounding entries
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Verify that the new entries follow the established format
# and check for potential duplicates
echo "Checking format consistency and duplicates..."
entries=(
"Ko Samui/S"
"Koh Samui/S"
"Ko Lanta/S"
"Koh Lanta/S"
"Ko Phangan/S"
"Koh Phangan/S"
"Koh Pha Ngan/S"
"Ko Pha-ngan/S"
"Sexiest Man Alive/S #eng"
"Sexiest Woman Alive/S #eng"
)
for entry in "${entries[@]}"; do
# Strip comments and suffixes for base form
base=$(echo "$entry" | sed 's|/[A-Z]||g' | sed 's| #.*||')
echo "Checking: $base"
rg -i "^$base(/[A-Z])?([ ]*#.*)?$" "languagetool-language-modules/*/src/main/resources/org/languagetool/resource/*/multitoken-suggest.txt"
done
Length of output: 4886
Script:
#!/bin/bash
# Let's first check if the file exists and its content
fd -t f "multitoken-suggest.txt" languagetool-language-modules/
# Then let's examine the content of the specific file we're reviewing
cat languagetool-language-modules/de/src/main/resources/org/languagetool/resource/de/multitoken-suggest.txt
# Let's also check the format of other entries in this file for consistency
head -n 20 languagetool-language-modules/de/src/main/resources/org/languagetool/resource/de/multitoken-suggest.txt
Length of output: 54633
Script:
#!/bin/bash
# Let's verify the format consistency by checking a few key aspects:
# 1. Check if the file exists and is readable
# 2. Check the format of entries around the new additions
# 3. Verify the specific lines in question
# First, let's get some context around the specific lines
echo "Checking context around the new entries..."
grep -C 5 "Ko Samui" languagetool-language-modules/de/src/main/resources/org/languagetool/resource/de/multitoken-suggest.txt
# Let's also check the general format pattern in the file
echo -e "\nChecking format patterns in the file..."
grep -E "^[A-Za-z].*(/[A-Z]|#eng)" languagetool-language-modules/de/src/main/resources/org/languagetool/resource/de/multitoken-suggest.txt | head -n 10
Length of output: 827
languagetool-core/src/main/resources/org/languagetool/resource/spelling_global.txt (1)
26868-26876
: Verify proper names accuracy
The added proper names appear to be correctly capitalized and formatted. However, let's verify their accuracy against authoritative sources.
Also applies to: 26879-26883
✅ Verification successful
Proper names verified and confirmed accurate
All the proper names in the spelling dictionary have been verified against Wikipedia and confirmed to exist as valid entities. Each entry returned a valid Wikipedia page ID, confirming their authenticity and correct spelling:
- Standard Oil (28931)
- Oxford Circus (229191)
- Oxford Street (229195)
- Jardine Matheson (77873779)
- Dagens Nyheter (207326)
- Opel Grandland (53035578)
- Vauxhall Grandland (53849423)
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Verify proper names against Wikipedia API for existence and spelling
# Note: This helps confirm these are real entities and spelled correctly
for name in "Standard Oil" "Oxford Circus" "Oxford Street" "Jardine Matheson" "Dagens Nyheter" "Opel Grandland" "Vauxhall Grandland" "Qeshm Air" "Santiago Posteguillo" "Portia de Rossi"; do
echo "Checking: $name"
curl -s "https://en.wikipedia.org/w/api.php?action=query&format=json&titles=${name// /%20}&prop=info" | jq '.query.pages | map_values(.pageid)'
done
Length of output: 4209
Summary by CodeRabbit
New Features
Bug Fixes
Documentation