-
Notifications
You must be signed in to change notification settings - Fork 68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
experimental multilingual idea #171
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some initial comments. Its ok if we don't have other languages models plugged in but we should stub out how to swap or at least validate these match the configured language.
_transformer_model = Encoder(transformer_name, custom_encoder) | ||
register_dataset_udf( | ||
[_prompt, _response], | ||
f"{language}.{_response}.relevance_to_{_prompt}", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this renaming prefixing the language in the metric name will create a discontinuity with existing integrations and break back-compat.
We shouldn't prefix the localization in the metric name, at least not for the original english only launch of LangKit. Better would be to put this in metadata or in the platform something like column entity schema?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want, for example, to track English and French toxicity in the same column?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we could keep the original name for english, and add the language prefix only for other languages?
@@ -41,6 +39,16 @@ def init(lexicon: Optional[str] = None, config: Optional[LangKitConfig] = None): | |||
_nltk_downloaded = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The lexicon
downloaded I believe is language specific, we can't just rename the metric but still download the english based corpus from nltk right? At least we should perform a check and raise an error or log a warning in many of these metrics where the existing models don't target other languages than en
?
input_output.init(config=config) | ||
text_schema = udf_schema() | ||
def init(languages: List[str] = ["en"], config: Optional[LangKitConfig] = None) -> DeclarativeSchema: | ||
for language in langauges: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo here? "langauges"
textstat.init(config=config) | ||
def init(languages: List[str] = ["en"], config: Optional[LangKitConfig] = None) -> DeclarativeSchema: | ||
for language in languages: | ||
regexes.init(language, config=config) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like identation is wrong here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Considering that the modules are imported before calling init with the desired languages, does that mean that english will always be applied, and others will be additional language-specific metrics?
Uses proposed schema chaining 1380 to support a schema per language for each metric module. Multiple languages can be selected when initializing a metric collection. Metrics are prefixed with the language code.