Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cb changes 20241213 #11107

Merged
merged 7 commits into from
Dec 13, 2024
Merged
Show file tree
Hide file tree
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -26875,6 +26875,10 @@ Opel Grandland
Vauxhall Grandland
Ko Samui
Koh Samui
Fox & Friends
Fox and Friends
Royal Dutch Airlines
Paul Atkins
Qeshm Air
Santiago Posteguillo
Portia de Rossi
Expand All @@ -26900,6 +26904,12 @@ Stade Brestois
Cecilia Bartoli
Joyce DiDonato
Murray Perahia
Luigi Mangione
Gukesh Dommaraju
Spotify Wrapped
Albert Park
Victorinox
Gout Gout
Lang Lang
Jascha Heifetz
Yehudi Menuhin
Expand Down Expand Up @@ -26927,6 +26937,9 @@ Tallis Scholars
James Longstreet
Ernesto Halffter
Lars Vogt
al-Bashir
Al Bashir
Dassault Falcon
Thomas Kinkade
Kash Patel
Paul Bocuse
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -122,6 +122,10 @@ public class GermanWordRepeatRule extends WordRepeatRule {
token("möp"),
token("möp")
),
Arrays.asList(
token("gout"),
token("gout")
),
Arrays.asList(
token("piep"),
token("piep")
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -362,7 +362,7 @@ Global Goals?
Global Site Tag
Global Business|Teams?|Cokes?|Services?|Pensions?|Marketings?|Talents?|Rockstars?|Geoparks?|News|Hubs?|Managements?|Junior|Senior|Head|Vice|President|Player[ns]?|Locations?|Targets?|Pass|Funds?|Sales?|Channels?|City|Times|Navigation|Offensive|Investors?|Real|Values?|Grade|Minimum|Revolution|Media|Accounting|Governance|Copyright|Automotives?|Balances?|Leaders?|Leaderships?|Reportings?
Global(er|en|em)? Süden|Norden
Golden (Globes?|Delicious|Retrievers?|Goals?|Gates?|Tulip|Boys?|Age|Lager|Masters?|Eagles?|State|Toasts?|Tigers?|Knights?|Fish|Touch(es)?|Tans?|Suns?|Circles?)
Golden (Globes?|Delicious|Retrievers?|Goals?|Gates?|Tulip|Boys?|Age|Lager|Masters?|Eagles?|State|Toasts?|Tigers?|Knights?|Fish|Touch(es)?|Tans?|Suns?|Circles?|Bachelors?|Bachelorette)
Goldene[nrsm]? Schnitte?s?|Bullen?|Palme|Bär(en)?|Spatz(en)?|Hirsch|Löwen?|Himbeeren?|Laus|Saals?|Windbeuteln?|Chlorellas?|Zeitalters?|(Ehren|Verdienst)zeichens?|Zwanzigern?|Schallplatten?|Buchs?|Stadt|Vlies(es)?|Kamera|Horden?|Tore?s?|Ordnung|Schwans?|Reiters?|Zirkels?|Dreiecks?|Horns?
Googles? Fits?
Gordische[nrm]? Knotens?
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -11614,6 +11614,11 @@ Hand-Fuß-Mund-Krankheit*
Reenactor-Messe$
Reenactor-Messen$
Bundesliga-Start$
Neo-Angin*
Coxsackie-A-Virus*
Coxsackie-A-Viren*
Coxsackie-Virus*
Coxsackie-Viren*
Inhouse-Seminar$
Inhouse-Seminars$
Inhouse-Seminare$
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ German Disambiguation Rules for LanguageTool
Copyright © 2013 Markus Brenneis, Daniel Naber, Jan Schreiber
-->
<!DOCTYPE rules [
<!ENTITY zal_adj_ig "jähr|bänd|seit|monat|täg|köpf|tür|spur|geschoss|stöck|mal|teil|lag|stünd|minüt|sekünd|zeil|wöch|stell|räum|schicht|zähn|eck|arm|karät|buchstab">
<!ENTITY zal_adj_ig "jähr|bänd|seit|monat|täg|köpf|tür|spur|geschoss|stöck|mal|teil|lag|stünd|minüt|sekünd|zeil|wöch|stell|räum|schicht|zähn|eck|arm|karät|buchstab|phas">
<!ENTITY zal_adj "(?-i)(\d+-|(ein|zw(ei|an)|dreiß?|vier|fünf|s(echs?|ieb(en)?)|acht|neun|zehn|elf|zwölf)(zehn|zig)?)(&zal_adj_ig;)">
<!ENTITY apostrophe "['’`´‘]">
<!ENTITY filename_extensions "ai|asp|aspx|avi|bak|bat|bmp|cab|cfg|cgi|com|css|csv|dat|db|dbf|dll|doc|docx|eps|exe|flv|gif|htm|html|ibooks|ico|idml|ini|indd|jar|jpeg|jpg|js|jsp|lnk|md|mdb|mid|mov|mp3|mp4|mpa|mpg|ods|pdf|php|pl|png|pps|ppt|pptx|ps|psd|py|rar|rss|scss|sh|shtml|sql|stl|svg|sys|tar|tif|tiff|tmp|ts|txt|wav|wma|xhtml|xls|xlsx|xml|xsl|yaml|yml|zip|zipx|7z|web[mp]">
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -243823,6 +243823,11 @@ Ekitiké/S #name
WPA #abk
Embedded #eng
VAT #abk
XYZ
xyz
ABC
abc
al-Bashir/S #name
Madsack/S #name
Hegseth/S #name
Lugeder/S
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -44195,6 +44195,11 @@ Stilempfehlungen
Preisetabelle/N
3D-Secure-Verfahren/S
Ivy-League-Universität
Ivy-League-Uni/S
Ivy-League-Student
Ivy-League-Studentin
Ivy-League-Studentinnen
Ivy-League-Studenten
Ivy-League-Universitäten
Selbstabtönung
Selbstabtönungen
Expand Down Expand Up @@ -65432,6 +65437,24 @@ straßenbündig/A
Durchfahrtssperre/N
Bøgseth/S
AnnenMayKantereit/S
syrischstämmig/A
Dunkelflaute/N
Neo-Angin/S #name
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codebase verification

Review of medication trademark handling in German spelling dictionary

Based on the search results, I found several medication-related trademarks that should be consistently marked with #name:

  • Chlorhexamed (antiseptic medication)
  • Medistar (medical software)
  • Mundipharma/S (pharmaceutical company)

These entries appear to be trademarks similar to Neo-Angin/S but currently lack the #name marker.

🔗 Analysis chain

Verify trademark handling for Neo-Angin.

The entry "Neo-Angin/S" is correctly marked with #name as it's a trademark. Consider if other branded medications in the dictionary should follow the same pattern.


Let me generate new shell scripts to specifically search for medication names in the German spelling dictionary:

This will help us identify if there are other medication names that should be marked with #name like Neo-Angin/S.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check for other medication names that might need #name marking

# Search for potential medication names without #name marking
rg -l '[A-Z][a-z]+(?:-[A-Z][a-z]+)?(?:/[A-Z])?' | xargs rg -v '#name' | grep -i 'med\|pharma\|drug'

Length of output: 69674


Script:

#!/bin/bash
# Search for medication-related entries in the German spelling dictionary
rg -i "(?:med|pharm|drug|angin)" languagetool-language-modules/de/src/main/resources/org/languagetool/resource/de/hunspell/spelling.txt

Length of output: 4010

privatinsolvent/A
oberhell/A
blitzkriegartig/A
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Consider alternative terminology for "blitzkriegartig".

The term "blitzkriegartig" might be sensitive due to its historical connotations. Consider using alternatives like "blitzschnell" or "schlagartig" instead.

umzugsfreudig/A
Techanalyse/N
Techanalyst
Techanalystin
Techanalystinnen
Techanalysten
Coxsackievirus
Coxsackieviren
Coxsackie-Virus
Coxsackie-Viren
Coxsackie-A-Virus
Coxsackie-A-Viren
Comment on lines +65452 to +65457
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Ensure consistent hyphenation in virus terminology.

The Coxsackievirus entries show inconsistent hyphenation:

  • "Coxsackievirus" (without hyphen)
  • "Coxsackie-Virus" (with hyphen)

Medical terminology should follow a consistent pattern.

Consider standardizing all entries to use hyphens:

-Coxsackievirus
-Coxsackieviren
+Coxsackie-Virus
+Coxsackie-Viren

Committable suggestion skipped: line range outside the PR's diff.

ein_trainieren
eintrainiert/P
Mittelblockerposition
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -3130,6 +3130,12 @@ Sexiest Woman Alive/S #eng
Hugo Ekitiké/S #name
Embedded Systems #eng
Duo Infernale
Low Performer/SN
Low Performerin
Low Performerinnen
High Performer/SN
High Performerin
High Performerinnen
Ingvar Kamprad/S #name
Masel tov
masel tov
Expand All @@ -3147,6 +3153,10 @@ Artocarpus heterophyllus
Hua Luogeng/S
Pascal Comelade/S
Carl Boese/S
Mixed Zone/S
Al Bashir/S #name
Golden Bachelor/S #name
Dommaraju Gukesh/S #name
Darryl Jones
Tom Herman/S
Hallstein Bøgseth/S
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -23965,7 +23965,7 @@ Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301, USA
<rule>
<pattern>
<token regexp="yes">[0-9]+</token>
<token regexp="yes">(wöch|[zt]eil|bändig|räum|köpf|täg|monat|jähr|stünd|sekünd|minüt|mal|lag|prozent|seit|stell|spur|eck|tür|geschoss|stöck|schicht|zähn|arm)ig(e[mnrs]?)?</token>
<token regexp="yes">(wöch|[zt]eil|bändig|räum|köpf|täg|monat|jähr|stünd|sekünd|minüt|mal|lag|prozent|seit|stell|spur|eck|tür|geschoss|stöck|schicht|zähn|arm|phas)ig(e[mnrs]?)?</token>
</pattern>
<message><suggestion>\1-<match no="2" case_conversion="alllower" /></suggestion> wird mit Bindestrich geschrieben.</message>
<example correction='7-teiliges'>Ich biete ein <marker>7 teiliges</marker> Set zum Verkauf an.</example>
Expand All @@ -23977,7 +23977,7 @@ Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301, USA
</rule>
<rule>
<pattern>
<token regexp="yes">\d+(wöch|[zt]eil|bändig|räum|köpf|täg|monat|jähr|stünd|sekünd|minüt|mal|lag|prozent|seit|stell|spur|eck|tür|geschoss|stöck|schicht|zähn|arm)ig(e[mnrs]?)?</token>
<token regexp="yes">\d+(wöch|[zt]eil|bändig|räum|köpf|täg|monat|jähr|stünd|sekünd|minüt|mal|lag|prozent|seit|stell|spur|eck|tür|geschoss|stöck|schicht|zähn|arm|phas)ig(e[mnrs]?)?</token>
</pattern>
<message><suggestion><match no="1" regexp_match="^(\d+).+" regexp_replace="$1" />-<match no="1" regexp_match="^\d+(.+)$" regexp_replace="$1" case_conversion="alllower" /></suggestion> wird mit Bindestrich geschrieben.</message>
<example correction='7-teiliges'>Ich biete ein <marker>7teiliges</marker> Set zum Verkauf an.</example>
Expand All @@ -23998,7 +23998,7 @@ Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301, USA
<pattern>
<token regexp="yes">zwei|drei|vier|fünf|sechs|sieben|acht|neun|zehn|elf|zwölf</token>
<!-- ohne 'lich' wg. Fällen wie "1869 verkehrten vier tägliche Zugpaare": -->
<token regexp="yes">(wöch|[zt]eil|bändig|räum|köpf|täg|monat|jähr|stünd|sekünd|minüt|mal|lag|prozent|seit|stell|spur|eck|tür|geschoss|stöck|schicht|zähn|arm)ig(e[mnrs]?)?</token>
<token regexp="yes">(wöch|[zt]eil|bändig|räum|köpf|täg|monat|jähr|stünd|sekünd|minüt|mal|lag|prozent|seit|stell|spur|eck|tür|geschoss|stöck|schicht|zähn|arm|phas)ig(e[mnrs]?)?</token>
</pattern>
<message><suggestion>\1<match no="2" case_conversion="alllower" /></suggestion> wird zusammengeschrieben.</message>
<example correction='fünfsekündige'>Es gab eine <marker>fünf sekündige</marker> Verzögerung.</example>
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -1831,7 +1831,7 @@ To ignore a remote rule match, set the <marker> so that it exactly covers the te
<rule>
<pattern>
<marker>
<token regexp="yes">Barbarazweige?|obengenannt</token>
<token regexp="yes">Barbarazweige?|obengenannt.*|privatinsolvent.*</token>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codebase verification

Based on the findings, I can now generate the final response:

Add examples and documentation for new orthography space rule patterns

The rule additions need proper documentation and examples:

  • obengenannt.* and privatinsolvent.* lack example cases in the rule definition
  • All terms are listed in the spelling dictionaries, suggesting they are valid words
  • The rule group AI_DE_GGEC_MISSING_ORTHOGRAPHY_SPACE appears to be for catching missing spaces, but the specific cases for these terms are not documented

Please:

  1. Add example cases for obengenannt and privatinsolvent patterns
  2. Document why these terms specifically need orthography space checks
  3. Consider adding test cases to verify the correct behavior
🔗 Analysis chain

Verify and document the new pattern additions

The addition of "Barbarazweige|obengenannt.|privatinsolvent." to the pattern needs verification and documentation.

Let's verify the usage of these terms in the codebase and documentation:

Please:

  1. Add examples for each new term
  2. Document why these terms require orthography space checks
  3. Consider adding test cases to verify the rules

Let me analyze the findings:

  1. The terms are listed in spelling dictionaries:

    • "Barbarazweig/E" and "Barbarazweigen" in spelling.txt
    • "obengenannt/nA" in all German dictionaries (DE, CH, AT)
    • "privatinsolvent/A" in spelling.txt
  2. There's only one example for "Barbarazweige" but none for the other terms.

Let me check if there are any message/description files that might explain these rules:


Let me check for any rule descriptions or messages that might explain the purpose of these additions:

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check for occurrences and documentation of the new terms

# Search for existing examples or documentation
rg -A 5 -B 5 "Barbarazweige|obengenannt|privatinsolvent" 

# Search for any related test cases
fd -e "test.xml" -e "test.txt" -x rg "Barbarazweige|obengenannt|privatinsolvent" {}

Length of output: 8391


Script:

#!/bin/bash
# Check for rule descriptions and messages
rg -A 5 -B 5 "AI_DE_GGEC_MISSING_ORTHOGRAPHY_SPACE" languagetool-language-modules/de/src/main/resources/org/languagetool/rules/de/

# Check if these terms appear in other rule files
fd -e xml -x rg "Barbarazweige?|obengenannt|privatinsolvent" {} languagetool-language-modules/de/src/main/resources/org/languagetool/rules/de/

Length of output: 66429


Script:

#!/bin/bash
# Search for rule descriptions or messages containing these terms or rule ID
rg -B 5 -A 5 "AI_DE_GGEC_MISSING_ORTHOGRAPHY_SPACE" languagetool-language-modules/de/src/main/resources/org/languagetool/messages/

# Check for any XML comments near the rule definition
rg -B 10 -A 2 "AI_DE_GGEC_MISSING_ORTHOGRAPHY_SPACE" --type xml

Length of output: 2365

</marker>
</pattern>
<example correction=""><marker>Barbarazweige</marker> sind Zweige von Obstbäumen.</example>
Expand Down Expand Up @@ -2000,6 +2000,13 @@ To ignore a remote rule match, set the <marker> so that it exactly covers the te
</pattern>
<example correction="">Wir <marker>motzen</marker> heute.</example>
</rule>
<rule>
<pattern>
<token>sorry</token>
<token>für</token>
</pattern>
<example correction=""><marker>Sorr für</marker> die Unannehmlichkeit.</example>
</rule>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Enhance the rule implementation for 'sorry für' pattern

The rule implementation has several issues that need to be addressed:

  1. Missing suggestion for the correct German alternative
  2. Pattern doesn't match the example case ("Sorr für" vs "sorry für")
  3. Pattern might be too broad without additional context checks

Consider enhancing the rule with:

            <rule>
                <pattern>
-                    <token>sorry</token>
-                    <token>für</token>
+                    <token regexp="yes">sorry|sorr</token>
+                    <token>für</token>
                </pattern>
+                <suggestion>Entschuldigung für</suggestion>
                <example correction="">Sorr für die Unannehmlichkeit.</example>
+                <example>Es tut mir leid für die Unannehmlichkeit.</example>
            </rule>
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
<rule>
<pattern>
<token>sorry</token>
<token>für</token>
</pattern>
<example correction=""><marker>Sorr für</marker> die Unannehmlichkeit.</example>
</rule>
<rule>
<pattern>
<token regexp="yes">sorry|sorr</token>
<token>für</token>
</pattern>
<suggestion>Entschuldigung für</suggestion>
<example correction="">Sorr für die Unannehmlichkeit.</example>
<example>Es tut mir leid für die Unannehmlichkeit.</example>
</rule>

</rulegroup>

<rulegroup name="" id="AI_DE_GGEC_REPLACEMENT_ORTHOGRAPHY_SPELL.*">
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -169,6 +169,8 @@ public boolean ignore(AnalyzedTokenReadings[] tokens, int position) {
return true;
} else if (repetitionOf("yeah", tokens, position)) {
return true;
} else if (repetitionOf("gout", tokens, position)) {
return true;
} else if (repetitionOf("wait", tokens, position) && position == 2) {
return true;
} else if (repetitionOf("quack", tokens, position)) {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -9052,5 +9052,7 @@ nail-biter*
nail-biters*
royalty-free*
Non-Hodgkin*
coxsackie-virus?
coxsackie-viruses?
get-togethers*
star-spangled*
Original file line number Diff line number Diff line change
Expand Up @@ -10970,6 +10970,15 @@ hardcode
hardcoded
DPA
DPAs
SOW
SOWs
XYZ
xyz
ABC
abc
PPHR
DoP
DoPs
intl
decontrol
decontrols
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -557,6 +557,34 @@ biodegradable
biodegradables
Braudelian
outflux
Transjordan
overenforcement
overenforce
overenforced
overenforces
overenforcing
redecorate
redecorated
redecorates
redecorating
redecoration
spätzle
spaetzle
echovirus
echoviruses
pathogenic
pathogenicity
serotype
serotypes
enterovirus
enteroviruses
enteroviral
nonpolio
nonenveloped
non-enveloped
coxsackievirus
coxsackieviruses
Coxsackie
outfluxes
torr
torrs
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -8016,6 +8016,7 @@ Dickson Fjord NNP
Iron Boulder NNP
prima facie UH
Masoud Pezeshkian NNP
Al Bashir NNP
o tempora, o mores UH
O tempora! O mores! UH
Daffy Duck NNP
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -403,7 +403,6 @@ zoveel
bovendien
kans
'ik
website
gezicht
werkt
probleem
Expand Down Expand Up @@ -1829,7 +1828,6 @@ bevatten
ingang
wisten
wetenschappelijke
blog
risico's
alsnog
lengte
Expand Down Expand Up @@ -2085,7 +2083,6 @@ wijst
feite
hierop
ongetwijfeld
app
verdeeld
tientallen
kat
Expand Down Expand Up @@ -2484,7 +2481,6 @@ verschijnt
christelijke
ingediend
hoofdstad
websites
dienstverlening
voorheen
teams
Expand Down Expand Up @@ -3091,7 +3087,6 @@ professor
gecontroleerd
bruin
integratie
apple
jongste
evenementen
raar
Expand Down Expand Up @@ -5339,7 +5334,6 @@ voorlichting
kapel
afvragen
zinvol
apps
peper
voltooid
record
Expand Down
Loading