Skip to content

dl-translate v0.2.0

Compare
Choose a tag to compare
@xhluca xhluca released this 08 Apr 15:29
· 28 commits to main since this release
6279c4c

Add m2m100 as the new default model to support 100 languages

Added

  • dlt.lang.m2m100 module: Now has variables for over 100 languages, also auto-complete ready. Example: dlt.lang.m2m100.ENGLISH.
  • dlt.utils.available_languages, dlt.utils.available_codes: Now supports argument "m2m100"
  • Available languages for each model family
  • Script and template to generate available languages

Changed

  • [BREAKING] dlt.lang.TranslationModel: A new model parameter called model_family in the initialization function. Either "mbart50" or "m2m100". By default, it will be inferred based on model_or_path. Needs to be explicitly set if model_or_path is a path.
  • [BREAKING] Default model changed to m2m100
  • Docs and readme about mbart50 were reframed to take into account the new model
  • dlt.TranslationModel.translate: Improved docstring to be more general.
  • Tests pertaining to m2m100
  • scripts/generate_langs.py: Renamed, mechanism now changed to loading from json files
  • docs/index.md: Expand the "Usage" and "Advanced" sections
  • README.md: Add acknowledgement about m2m100, significantly trim "Advanced" section, make "Usage" more concise

Fixed

  • dlt.TranslationModel.available_codes() was returning the languages instead of the codes. It will now correctly return the code.

Removed

  • Output type hints for TranslationModel.get_transformers_model and TranslationModel.get_tokenizer
  • [BREAKING] dlt.TranslationModel.bart_model and dlt.TranslationModel.tokenizer are no longer available to be used directly. Please use dlt.TranslationModel.get_transformers_model and dlt.TranslationModel.get_tokenizer instead.