Cb changes 20241203 #11081

tiff · 2024-12-04T09:02:58Z

Summary by CodeRabbit

New Features
- Expanded vocabulary with numerous scientific names, proper nouns, and new entries for various verbs and adjectives in the language processing tool.
- Enhanced grammar rules with new antipatterns and exceptions for improved detection of complex constructions in German.
- Added entries for popular names and terms in the English ignore list to improve spell checker accuracy.
Bug Fixes
- Improved detection of compound infinitives by refining existing patterns and examples.
Documentation
- Updated spelling and grammar resources to reflect new entries and changes for better user guidance.

coderabbitai · 2024-12-04T09:03:07Z

Walkthrough

This pull request introduces significant updates to various language processing files within the LanguageTool project. Key changes include the addition of numerous scientific names and proper nouns to spelling_global.txt, enhancements to the CompoundInfinitivRule in the German language module, and modifications to the spelling.txt and multitoken-suggest.txt files to expand recognized vocabulary. Additionally, new antipatterns and exceptions are added to the grammar.xml file, improving grammar checking capabilities. The ignore.txt and spelling.txt files in the English language module also see substantial additions to their respective lists.

Changes

File	Change Summary
languagetool-core/src/main/resources/org/languagetool/resource/spelling_global.txt	Added a large number of scientific names and proper nouns; modified formatting and structure.
languagetool-language-modules/de/src/main/java/org/languagetool/rules/de/CompoundInfinitivRule.java	Updated `ANTI_PATTERNS` list with new patterns and modified example pairs to enhance detection of compound infinitives.
languagetool-language-modules/de/src/main/resources/org/languagetool/resource/de/hunspell/spelling.txt	Updated entry for "Granola" to include part of speech; added various new entries for verbs, adjectives, and nouns.
languagetool-language-modules/de/src/main/resources/org/languagetool/resource/de/multitoken-suggest.txt	Added new entries for various locations and names, including "Ko Samui," "Koh Samui," and "Sexiest Man/Woman Alive." Minor formatting adjustments made for consistency.
languagetool-language-modules/de/src/main/resources/org/languagetool/rules/de/grammar.xml	Introduced multiple new antipatterns and exceptions to enhance grammar checking for specific constructions in German.
languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/hunspell/ignore.txt	Added numerous entries to the ignore list, including proper nouns and technical terms.
languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/hunspell/spelling.txt	Introduced new entries covering scientific terms and modern vernacular, enhancing spell checker functionality.
languagetool-language-modules/en/src/main/resources/org/languagetool/rules/en/grammar.xml	Updated existing `<token>` element in `<antipattern>` to include `inflected="yes"` for terms related to academic contexts, improving grammar rule capabilities.

Possibly related PRs

Cb changes 20240830 #10856: Adds proper nouns to spelling_global.txt, similar to the main PR's addition of scientific names and proper nouns.
Cb changes 20240905 #10865: Introduces new entries to spelling_global.txt, enhancing vocabulary consistent with the main PR's focus.
[pt] Removed "temp_off" from rule ID:OS_DOIS_AS_DUAS_AMBOS_AMBAS #10887: Adds scientific names and proper nouns to spelling_global.txt, directly aligning with the main PR's changes.
Cb changes 20241030 #10974: Adds a large number of proper nouns and scientific names to spelling_global.txt, closely related to the main PR's objectives.
Cb changes 20241105 #10987: Adds scientific names and proper nouns to spelling_global.txt, directly related to the main PR's focus on expanding vocabulary.
Cb changes 20241111 #11010: Continues adding scientific names and proper nouns to spelling_global.txt, aligning with the main PR's objectives.
Cb changes 20241113 #11021: Adds a large number of scientific names and proper nouns to spelling_global.txt, consistent with the main PR's changes.
Cb changes 20241121 #11051: Adds scientific names and proper nouns to spelling_global.txt, directly related to the main PR's focus on expanding the vocabulary.
[nl] add spelling, multiwords #11079: Adds various proper nouns and phrases to multiwords.txt, complementing the main PR's focus on expanding vocabulary.

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (3)

languagetool-language-modules/de/src/main/resources/org/languagetool/resource/de/hunspell/spelling.txt (1)
65383-65383: Add declension suffix for medical term

The medical term "Fasziitis" should include a declension suffix. Consider adding /S for singular noun declension, as it follows the same pattern as other medical terms in German.
-Fasziitis
+Fasziitis/S
languagetool-core/src/main/resources/org/languagetool/resource/spelling_global.txt (2)

26871-26872: Consider standardizing case variants

The entries "Vinci Autoroutes" and "VINCI Autoroutes" represent the same entity. Consider implementing case-insensitive matching in the spelling checker to avoid maintaining multiple variants.

26877-26878: Consider handling spelling variants systematically

The entries "Ko Samui" and "Koh Samui" represent alternative spellings of the same Thai location. Consider:

Documenting which spelling is preferred

Implementing a suggestion mechanism to recommend the preferred spelling

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between ecfa1ca and 1bcae25.

📒 Files selected for processing (9)

languagetool-core/src/main/resources/org/languagetool/resource/spelling_global.txt (1 hunks)
languagetool-language-modules/de/src/main/java/org/languagetool/rules/de/CompoundInfinitivRule.java (3 hunks)
languagetool-language-modules/de/src/main/resources/org/languagetool/resource/de/hunspell/spelling.txt (3 hunks)
languagetool-language-modules/de/src/main/resources/org/languagetool/resource/de/multitoken-suggest.txt (1 hunks)
languagetool-language-modules/de/src/main/resources/org/languagetool/rules/de/grammar.xml (2 hunks)
languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/disambiguation.xml (1 hunks)
languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/hunspell/ignore.txt (1 hunks)
languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/hunspell/spelling.txt (1 hunks)
languagetool-language-modules/en/src/main/resources/org/languagetool/rules/en/grammar.xml (1 hunks)

🔥 Files not summarized due to errors (1)

languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/disambiguation.xml: Error: Server error: no LLM provider could handle the message

✅ Files skipped from review due to trivial changes (1)

languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/hunspell/ignore.txt

🔇 Additional comments (14)

languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/disambiguation.xml (1)

17155-17172: 🛠️ Refactor suggestion

Consider a more maintainable approach for handling censored words

While the implementation is functional, there are a few concerns to address:

The current approach explicitly lists profanity in the source code, which might not be ideal for code maintenance and readability.
The pattern matching might miss variations or be too permissive.

Consider these improvements:

Move the censored word patterns to a separate configuration file that can be:
- Easily maintained without touching the main rules
- Protected with appropriate access controls
- Updated without redeploying the application
Use a more robust pattern matching approach:

<rulegroup id="CENSORED_WORDS" name="ignore spelling of censored words">
    <rule>
        <pattern>
            <token>
                <exception regexp="yes"><!-- include safe words that might match --></exception>
                <reference path="censored-patterns.xml"/>
            </token>
        </pattern>
        <disambig action="ignore_spelling"/>
    </rule>
</rulegroup>

Let's check if similar patterns exist elsewhere in the codebase:

languagetool-language-modules/en/src/main/resources/org/languagetool/rules/en/grammar.xml (1)

72202-72202: LGTM! Good enhancement to the academic terms pattern.

Adding inflected="yes" is a good improvement as it allows the antipattern to match plural forms and other inflections of academic terms (e.g., "theses", "students", "graduates"), making the rule more comprehensive.

Let's verify that this antipattern doesn't conflict with other rules:

✅ Verification successful

Let me analyze the search results:

The search found several academic-related terms in various rules:
- A US-specific rule for "master's dissertation/thesis" terminology
- Examples using academic terms in various contexts
- No conflicting antipatterns using these terms were found by ast-grep
Let me search for any other potential conflicts with inflected forms:

Based on the search results, I can now make a definitive assessment:

No conflicts found with inflected academic terms pattern

The search results show that:

The academic terms are used in various rules but in different contexts (e.g., subject-verb agreement, capitalization)
No conflicting antipatterns were found that would be affected by adding inflected="yes"
The existing rules that use these terms (like "master's thesis" vs "master's dissertation") work independently and won't be impacted by the inflection change

The change to allow inflected forms in the antipattern is safe and will improve pattern matching by correctly handling plural forms of academic terms.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check for potential conflicts with other rules using the same terms
# Look for other rules that might be affected by this change

# Search for other instances of these academic terms in grammar rules
rg -A 5 -B 5 '\b(thesis|major|student|graduate|class|degree|program|diploma|course)\b' \
    --type xml \
    languagetool-language-modules/en/src/main/resources/org/languagetool/rules/en/

# Check for any existing antipatterns using these terms
ast-grep --pattern '<antipattern>
  $$$
  <token $$$>thesis|major|student|graduate|class|degree|program|diploma|course</token>
  $$$
</antipattern>'

Length of output: 71113

Script:

#!/bin/bash
# Search for other rules that might use inflected forms of these terms
rg -A 5 -B 5 'inflected="yes".*>(thesis|major|student|graduate|class|degree|program|diploma|course)<' \
    --type xml \
    languagetool-language-modules/en/src/main/resources/org/languagetool/rules/en/

# Also check for any rules that might be affected by plural forms
rg -A 5 -B 5 '\b(theses|majors|students|graduates|classes|degrees|programs|diplomas|courses)\b' \
    --type xml \
    languagetool-language-modules/en/src/main/resources/org/languagetool/rules/en/

Length of output: 85539

languagetool-language-modules/de/src/main/resources/org/languagetool/resource/de/hunspell/spelling.txt (3)

17576-17576: LGTM: Proper morphological annotation added

The addition of the /S suffix to "Granola" correctly identifies it as a noun with standard singular declension while maintaining its proper noun categorization.

65360-65373: LGTM: Complete verb conjugation sets

The additions for "rumkippeln" and "herumkippeln" include all necessary conjugation forms (infinitive, present tense, past tense, and participle) for both colloquial variants.

65434-65434: LGTM: Proper band name entry

The addition of "AnnenMayKantereit" with the /S suffix correctly follows the conventions for proper nouns in the dictionary.

languagetool-language-modules/de/src/main/resources/org/languagetool/rules/de/grammar.xml (2)

27276-27326: Well-structured antipatterns covering common German constructions!

The new antipatterns effectively handle various German linguistic constructions:

Verb-preposition combinations (e.g., "reichen aus zu", "kommen komisch vor zu")
Intensifier patterns with adjectives
Common phrasal expressions (e.g., "mit zu tun", "Angst vor zu")

The patterns make good use of XML attributes (inflected, skip, regexp) to ensure flexible matching while maintaining accuracy.

27372-27374: LGTM! Valid preposition exceptions added.

The new exceptions "kraft", "oberhalb", and "unterhalb" are valid prepositions that can precede "zu" in correct German grammar.

Let's verify if we have comprehensive coverage of similar prepositions:

✅ Verification successful

LGTM! The preposition exceptions are well covered

The search results confirm that similar directional prepositions ending in "-halb" (ausserhalb, oberhalb, unterhalb) are consistently handled as exceptions. The new additions align with the existing pattern and maintain consistency in the grammar rules.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for other prepositions that might need similar exception handling
rg -A 2 '<exception>[a-z]+(halb|kraft)</exception>' languagetool-language-modules/de/src/main/resources/

Length of output: 1117

languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/hunspell/spelling.txt (1)

534-535: LGTM! The new dictionary entries follow the established format.

The additions include well-formed technical terms, compound words, and general vocabulary. The entries maintain consistency with the file's format and existing patterns.

Let's verify that these new entries don't conflict with existing ones:

✅ Verification successful

The new dictionary entries are correctly formatted and do not have any duplicates

The verification confirms that both "controllability" and "controlment" are present in the spelling dictionary exactly once, following the established format of one word per line without any variations or duplicates.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check for potential duplicates of the new entries
# Note: Using case-insensitive search to catch any variations

echo "Checking for potential duplicates..."
for word in controllability controlment organocatalyst organocatalysts organocatalysis bioorganometallic organometallic organometallics victimhood culturewise unintrusive unintrusively unobtrusive unobtrusively questline questlines overindepted underindepted "UV-degradable" "oxy-degradable" "bio-degradable" "bio-degradables" biodegradable biodegradables Braudelian outflux outfluxes torr torrs; do
  echo "Checking: $word"
  rg -i "^$word(/[A-Z])?$" "languagetool-language-modules/*/src/main/resources/org/languagetool/resource/*/hunspell/spelling.txt"
done

Length of output: 21278

Script:

#!/bin/bash
# Let's try with the correct path pattern for the spelling files
for word in controllability controlment; do
  echo "Checking: $word"
  rg -i "^$word$" languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/hunspell/spelling.txt
done

Length of output: 483

languagetool-language-modules/de/src/main/java/org/languagetool/rules/de/CompoundInfinitivRule.java (4)

83-118: LGTM! The new anti-patterns handle common German expressions correctly.

The added patterns properly handle various German expressions that should not trigger the compound infinitive rule, including:

Expressions with "kurz davor zu"
Time-related expressions with "Jahr/Monat/Zeit über zu"
Expressions with "endlich wieder zu"
Expressions with "bis hin zu"
Expressions with "von ... her zu"
Expressions with adjectives and intensifiers

141-144: LGTM! The time-related anti-patterns are comprehensive.

The pattern correctly handles temporal expressions with "über zu" construction, covering a wide range of time-related nouns:

Spiel|Tag|Nacht|Morgen|Nachmittag|Abend|Zeit|.+zeit|Jahr(zehnt)?|Monat|.+tag|Mittwoch|Januar|Februar|März|April|Mai|Juni|Juli|August|September|Oktober|November|Dezember

171-175: LGTM! The anti-pattern for repeated "zu" is correctly implemented.

The pattern properly handles cases where "zu" appears twice in sequence with an adjective in between.

222-223: LGTM! The example pair update is appropriate.

The example has been updated from "sicher zu gehen" to "sicherzugehen", which better illustrates the compound infinitive rule.

languagetool-language-modules/de/src/main/resources/org/languagetool/resource/de/multitoken-suggest.txt (1)

3120-3121: LGTM! The new multi-token entries are well-formed.

The additions include:

Alternative spellings for Thai locations (Ko/Koh variants)
English phrases with proper language tags

Let's verify the consistency of the new entries:

✅ Verification successful

The new entries follow the established format and are correctly placed in the file

The verification confirms that:

The new entries (Ko Samui/S, Koh Samui/S) follow the same format pattern as other entries in the file (/S suffix for plural forms)
They are properly grouped with other similar Thai location variants
The format is consistent with the file's documented suffix rules (# /S will add s)
The entries are properly sorted and integrated with surrounding entries

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify that the new entries follow the established format
# and check for potential duplicates

echo "Checking format consistency and duplicates..."
entries=(
  "Ko Samui/S"
  "Koh Samui/S"
  "Ko Lanta/S"
  "Koh Lanta/S"
  "Ko Phangan/S"
  "Koh Phangan/S"
  "Koh Pha Ngan/S"
  "Ko Pha-ngan/S"
  "Sexiest Man Alive/S #eng"
  "Sexiest Woman Alive/S #eng"
)

for entry in "${entries[@]}"; do
  # Strip comments and suffixes for base form
  base=$(echo "$entry" | sed 's|/[A-Z]||g' | sed 's| #.*||')
  echo "Checking: $base"
  rg -i "^$base(/[A-Z])?([ ]*#.*)?$" "languagetool-language-modules/*/src/main/resources/org/languagetool/resource/*/multitoken-suggest.txt"
done

Length of output: 4886

Script:

#!/bin/bash
# Let's first check if the file exists and its content
fd -t f "multitoken-suggest.txt" languagetool-language-modules/

# Then let's examine the content of the specific file we're reviewing
cat languagetool-language-modules/de/src/main/resources/org/languagetool/resource/de/multitoken-suggest.txt

# Let's also check the format of other entries in this file for consistency
head -n 20 languagetool-language-modules/de/src/main/resources/org/languagetool/resource/de/multitoken-suggest.txt

Length of output: 54633

Script:

#!/bin/bash
# Let's verify the format consistency by checking a few key aspects:
# 1. Check if the file exists and is readable
# 2. Check the format of entries around the new additions
# 3. Verify the specific lines in question

# First, let's get some context around the specific lines
echo "Checking context around the new entries..."
grep -C 5 "Ko Samui" languagetool-language-modules/de/src/main/resources/org/languagetool/resource/de/multitoken-suggest.txt

# Let's also check the general format pattern in the file
echo -e "\nChecking format patterns in the file..."
grep -E "^[A-Za-z].*(/[A-Z]|#eng)" languagetool-language-modules/de/src/main/resources/org/languagetool/resource/de/multitoken-suggest.txt | head -n 10

Length of output: 827

languagetool-core/src/main/resources/org/languagetool/resource/spelling_global.txt (1)

26868-26876: Verify proper names accuracy

The added proper names appear to be correctly capitalized and formatted. However, let's verify their accuracy against authoritative sources.

Also applies to: 26879-26883

✅ Verification successful

Proper names verified and confirmed accurate

All the proper names in the spelling dictionary have been verified against Wikipedia and confirmed to exist as valid entities. Each entry returned a valid Wikipedia page ID, confirming their authenticity and correct spelling:

Standard Oil (28931)
Oxford Circus (229191)
Oxford Street (229195)
Jardine Matheson (77873779)
Dagens Nyheter (207326)
Opel Grandland (53035578)
Vauxhall Grandland (53849423)

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify proper names against Wikipedia API for existence and spelling
# Note: This helps confirm these are real entities and spelled correctly

for name in "Standard Oil" "Oxford Circus" "Oxford Street" "Jardine Matheson" "Dagens Nyheter" "Opel Grandland" "Vauxhall Grandland" "Qeshm Air" "Santiago Posteguillo" "Portia de Rossi"; do
  echo "Checking: $name"
  curl -s "https://en.wikipedia.org/w/api.php?action=query&format=json&titles=${name// /%20}&prop=info" | jq '.query.pages | map_values(.pageid)'
done

Length of output: 4209

tiff added 4 commits December 3, 2024 20:23

[core] add global spelling

d0cbdcb

[de] add spelling

7fc1981

[en] improve rules

b13ac0b

[de] improve rules

1bcae25

coderabbitai bot reviewed Dec 4, 2024

View reviewed changes

tiff merged commit 1f3b803 into master Dec 4, 2024
5 checks passed

This was referenced Dec 13, 2024

Cb changes 20241213 #11107

Merged

Optimizations, part 2 #11064

Merged

Cb changes 20241217 #11120

Merged

coderabbitai bot mentioned this pull request Jan 6, 2025

Cb changes 20240106 #11164

Merged

coderabbitai bot mentioned this pull request Jan 15, 2025

[en] improve rules #11191

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cb changes 20241203 #11081

Cb changes 20241203 #11081

tiff commented Dec 4, 2024 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Dec 4, 2024 •

edited

Loading

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (`.coderabbit.yaml`)

Documentation and Community

coderabbitai bot left a comment

Cb changes 20241203 #11081

Cb changes 20241203 #11081

Conversation

tiff commented Dec 4, 2024 • edited by coderabbitai bot Loading

Summary by CodeRabbit

coderabbitai bot commented Dec 4, 2024 • edited Loading

Walkthrough

Changes

Possibly related PRs

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

coderabbitai bot left a comment

Choose a reason for hiding this comment

tiff commented Dec 4, 2024 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Dec 4, 2024 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)