- Better parallelization: parallel and mono data are scheduled at once (previously it was one after the other)
mtdata cache
added. Improves concurrency by supporting multiple recipes- Added WMT general test 2022 and 2023
- mtdata-bcp47 : -p/--pipe to map codes from stdin -> stdout
- mtdata-bcp47 : --script {suppress-default,suppress-all,express}
- Uses
pigz
to read and write gzip files by default when pigz is in PATH. exportUSE_PIGZ=0
to disable
- Fix: allenai_nllb.json is now included in MANIFEST.in #137. Also fixed CI: Travis -> github actions
- Update ELRC datasets #138. Thanks @AlexUmnov
- Add Jparacrawl Chinese-Japanese subset #143. Thanks @BrightXiaoHan
- Add Flores200 dev and devtests #145. Thanks @ZenBel
- Add support for
mtdata echo <ID>
- dataset entries only store bibtext keys and not full citation text
- creates index cache as JSONLine file. (WIP towards dataset statistics)
- Simplified index loading
- simplified compression format handlers. Added support for opening .bz2 files without creating temp files.
- all resources are moved to
mtdata/resource
dir and any new additions to that dir are automatically included in python package (Fail proof for future issues like #137 )
New and exciting features:
- Support for adding new datasets at runtime (
mtdata*.py
from run dir). Note: you have to reindex by callingmtdata -ri list
- Monolingual datasets support in progress (currently testing)
- Dataset IDs are now
Group-name-version-lang1-lang2
for bitext andGroup-name-version-lang
for monolingual mtdata list
is updated.mtdata list -l eng-deu
for bitext andmtdata list -l eng
for monolingual- Added: Statmt Newscrawl, news-discussions, Leipzig corpus, ...
- Dataset IDs are now
skipped 0.3.9 because the chages are significant
- CLI arg
--log-level
with default set toWARNING
- progressbar can be disabled from CLI
--no-pbar
; default is enabled--pbar
mtdata stats --quick
does HTTP HEAD and shows content length; e.g.mtdata stats --quick Statmt-commoncrawl-wmt19-fra-deu
python -m mtdata.scripts.recipe_stats
to read stats from output directory- Security fix with tar extract | Thanks @TrellixVulnTeam
- Added NLLB datasets prepared by AllenAI | Thanks @AlexUmnov
- Opus and ELRC datasets update | Thanks @ZenBel
- Update ELRC data including EU acts which is used for wmt22 (thanks @kpu)
- fixes and additions for wmt22
- Fixed KECL-JParaCrawl
- added Paracrawl bonus for ukr-eng
- added Yandex rus-eng corpus
- added Yakut sah-eng
- update recipe for wmt22 constrained eval
- Parallel download support
-j/--n-jobs
argument (with default4
) - Add histogram to web search interface (Thanks, @sgowdaks)
- Update OPUS index. Use OPUS API to download all datasets
- A lot of new datasets are added.
- WARNING: Some OPUS IDs are not backward compatible (version number mismatch)
- Fix: JESC dataset language IDs were wrong
- New datasets:
- jpn-eng: add paracrawl v3, and wmt19 TED
- backtranslation datasets for en2ru ru2en en2ru
- Option to set
MTDATA_RECIPES
dir (default is $PWD). All files matching the glob${MTDATA_RECIPES}/mtdata.recipes*.yml
are loaded - WMT22 recipes added
- JW300 is disabled #77
- Automatically create references.bib file based on datasets selected
- ELRC datasets updated
- Added docs, separate copy for each version (github pages)
- Dataset search via web interface. Support for regex match
- Added two new datasets Masakane fon-fra
- Improved TMX files BCP47 lang ID matching: compatibility instead of exact match
- bug fix: xml reading inside tar: Element tree's compain about TarPath
mtdata list
has-g/--groups
and-ng/--not-groups
as include exclude filters on group name (#91)mtdata list
has-id/--id
flag to print only dataset IDs (#91)- add WMT21 tests (#90)
- add ccaligned datasets wmt21 (#89)
- add ParIce datasets (#88)
- add wmt21 en-ha (#87)
- add wmt21 wikititles v3 (#86)
- Add train and test sets from StanfordNLP NMT page (large: en-cs, medium: en-de, small: en-vi) (#84)
- Add support for two URLs for a single dataset (i.e. without zip/tar files)
- Fix: buggy matching of languages
y1==y1
- Fix:
get
command: ensure train/dev/test datasets are indeed compatible with languages specified in--langs
args
- Fix: recipes.yml is missing in the pip installed package
- Add Project Anuvaad: 196 datasets belonging to Indian languages
- add CLI
mtdata get
has--fail / --no-fail
arguments to tell whether to crash or no-crash upon errors
- Add support for recipes; list-recipe get-recipe subcommands added
- add support for viewing stats of dataset; words, chars, segs
- FIX url for UN dev and test sets (source was updated so we updated too)
- Multilingual experiment support; ISO 639-3 code
mul
implies multilingual; e.g. mul-eng or eng-mul --dev
accepts multiple datasets, and merges it (useful for multilingual experiments)- tar files are extracted before read (performance improvements)
- setup.py: version and descriptions accessed via regex
Big Changes: BCP-47, data compression
-
BCP47: (Language, Script, Region)
- Our implementation is strictly not BCP-47. We differ on the following
- We use ISO 639-3 codes (i.e three letters) for all languages, where as BCP47 uses two letters for some (e.g.
en
) and three letters for many. - We use
_
(underscore) to join language, script, region whereas BCP-47 uses-
(hyphen)
- We use ISO 639-3 codes (i.e three letters) for all languages, where as BCP47 uses two letters for some (e.g.
- Our implementation is strictly not BCP-47. We differ on the following
-
Dataset IDs (aka
did
in short) are standardized<group>-<name>-<version>-<lang1>-<lang2>
<group>
can have mixed case,<name>
has to be lowercase
-
CLI interface now accept
did
s. -
mtdata get --dev <did>
now accepts a single dataset ID; createsdev.{xxx,yyy}
links at the root of out dir -
mtdata get --test <did1> ... <did3>
createstest{1..4}.{xxx,yyy}
links at the root of out dir -
--compress
option to store compressed datasets under output dir -
zip
andtar
files are no longer extracted. we read directly from compressed files without extracting them -
._lock
files are removed after download job is done -
Add JESC, jpn paracrawl, news commentary 15 and 16
-
Force unicode encoding; make it work on windows (Issue #71)
-
JW300 -> JW300_v1 (tokenized); Added JW300_v1c (raw) (Issue #70)
-
Add all Wikititle datasets from lingual tool (Issue #63)
-
progressbar :
englighten
is used -
wget
is replaced withrequests
. User-Agent header along with mtdata version is sent in HTTP request headers -
Paracrawl v9 added
- OPUS index updated (crawled on 20210522)
- new:
- CCAlignedV1
- EiTBParCC_v1
- EuroPat_v2
- MultiCCAligned_v1.1
- NewsCommentary_v14
- WikiMatrix_v1
- tico19_v20201028
- updates (replaces old with new):
- new:
- Multilingual TMX parsing, add ECDC and EAC -- #39 -- by @kpu
- Removed Global Voices -- now available via OPUS -- #41
- Move all BibTeX to a separate file -- #42
- Add ELRC-Share datasets #43 -- by @kpu
- Fix line count mismatch in some XML formats #45
- Parse BCP47 codes by removing everything after first hyphen #48 -- by @kpu
- Add Khresmoi datasets #53 -- by @kpu
- Optimize index loading by using cache;
- Added
-re | --reindex
CLI flag to force update index cache #54 - Removed
--cache
CLI argument. Useexport MTDATA=/path/to/cache-dir
instead (which was already supported)
- Added
- Add :
DCEP
corpus, 253 language pairs #58 -- by @kpu - Add : WMT 21 dev sets: eng-hau eng-isl isl-eng hau-eng #36
- New datasets
- New features
- 'mtdata -b' for short outputs and crash on error input
- Fixes and improvements:
- Paracrawl v7 and v7.1 -- 29 new datasets
- Fix swapping issue with TMX format (TILDE corpus); add a testcase for TMX entry
- Add mtdata-iso shell command
- Add "mtdata report" sub command to summarize datasets by language and names
- Add OPUS 100 corpus
- Add all pairs of neulab_tedtalksv1 - train,test,dev -- 4,455 of them
- Add support for cleaning noise. Entry.is_noise(seg1, seg2)
- some basic noise is removed by default from training
- add
__slots__
to Entry class (takes less memory and faster attrib lookup)
- Add all pairs of Wikimatrix -- 1,617 of them
- Add support for specifying
cols
of.tsv
file - Add all Europarlv7 sets
- Remove hin-eng
dict
from JoshuaIndianParallelCorpus - Remove Wikimatrix1 from statmt -- they are moved to separate file
- File locking using portalocker to deal with race conditions
when multiple
mtdata get
are invoked in parallel - Remove language name from local file name -- as a same tar file can be used by many languages, and they dont need copy
- CLI to print version name
- Added KFTT Japanese-English set
- IITB hin-eng datasets
- Fix issue with dataset counting
- Pypi release bug fix: select all nested packages
- add UnitedNations test set
- Add JW300 Corpus
- All Languages are internally mapped to 3 letter codes of ISO codes
- 53,000 entries from OPUS are indexed