Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

experimental multilingual idea #171

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

richard-rogers
Copy link
Contributor

@richard-rogers richard-rogers commented Oct 26, 2023

Uses proposed schema chaining 1380 to support a schema per language for each metric module. Multiple languages can be selected when initializing a metric collection. Metrics are prefixed with the language code.

Copy link
Collaborator

@jamie256 jamie256 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some initial comments. Its ok if we don't have other languages models plugged in but we should stub out how to swap or at least validate these match the configured language.

_transformer_model = Encoder(transformer_name, custom_encoder)
register_dataset_udf(
[_prompt, _response],
f"{language}.{_response}.relevance_to_{_prompt}",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this renaming prefixing the language in the metric name will create a discontinuity with existing integrations and break back-compat.

We shouldn't prefix the localization in the metric name, at least not for the original english only launch of LangKit. Better would be to put this in metadata or in the platform something like column entity schema?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want, for example, to track English and French toxicity in the same column?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we could keep the original name for english, and add the language prefix only for other languages?

@@ -41,6 +39,16 @@ def init(lexicon: Optional[str] = None, config: Optional[LangKitConfig] = None):
_nltk_downloaded = True
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The lexicon downloaded I believe is language specific, we can't just rename the metric but still download the english based corpus from nltk right? At least we should perform a check and raise an error or log a warning in many of these metrics where the existing models don't target other languages than en?

input_output.init(config=config)
text_schema = udf_schema()
def init(languages: List[str] = ["en"], config: Optional[LangKitConfig] = None) -> DeclarativeSchema:
for language in langauges:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo here? "langauges"

textstat.init(config=config)
def init(languages: List[str] = ["en"], config: Optional[LangKitConfig] = None) -> DeclarativeSchema:
for language in languages:
regexes.init(language, config=config)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like identation is wrong here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Considering that the modules are imported before calling init with the desired languages, does that mean that english will always be applied, and others will be additional language-specific metrics?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants