Skip to content

Release v0.2: Dataset versioning

Compare
Choose a tag to compare
@zhiltsov-max zhiltsov-max released this 14 Oct 15:42
7e8615c

This release adds dataset versioning capabilities and significantly changes the command line.
It also improves CLI and API documentation, and extends the transformations library.

A Datumaro project can contain and manage multiple datasets instead of a single one.
CLI operations can be applied to the whole project, or to separate datasets.
Datasets are now modified inplace, by default. The project layout is updated. To update
an old project to the new version, use datum project migrate.

Added

  • A new installation target: pip install datumaro[default], which should be
    used in most cases by default. The simple datumaro is supposed for library users (#238)
  • Dataset and project versioning capabilities (Git-like) (#238)
  • [CLI] "dataset revpath" concept in CLI, allowing to pass a dataset path with
    the dataset format in diff, merge, explain and info CLI commands (#238)
  • [CLI] import, remove, commit, checkout, log, status, info CLI commands (#238)
  • [CLI] patch CLI command to patch one dataset from another (#401)
  • [CLI, API] ProjectLabels transform to change dataset labels for merging etc. (#401, #478)
  • [API] Type annotations and docs for Annotation classes (#493)
  • [formats] Support for custom labels in the KITTI detection format (#481)
  • [formats] Coco*Extractor classes now have an option to preserve label IDs from the
    original annotation file (#453)
  • [formats] Options to control label loading behavior in imagenet_txt import (#434, #489)
  • Data collection by telemetry. Check this notice about the details (#495)

Changed

  • A project can contain and manage multiple datasets instead of a single one.
    CLI operations can be applied to the whole project, or to separate datasets.
    Datasets are modified inplace, by default (#328)
  • [CLI] The import command copies datasets by default. Use add to add datasets without copying (#508)
  • [CLI] Projects use new file layout, incompatible with old projects.
    An old project can be updated with datum project migrate (#238)
  • [CLI] diff and ediff are joined into a single diff CLI command (#238)
  • [CLI] CLI help for builtin plugins doesn't require project (#328)
  • [API] The Project class from datumaro.components is changed completely (#238)
  • [API] Inheriting CliPlugin is not required in plugin classes (#238)
  • [API] Importers do not create Projects anymore and just return a list of
    extractor configurations (#238)
  • [API] Annotation-related classes were moved into a new module,
    datumaro.components.annotation (#439)
  • [API] Rollback utilities replaced with Scope utilities (#444)

Removed

  • [CLI] project merge CLI command (#238)
  • Support for project hierarchies. A project cannot be a source anymore (#238)
  • A project cannot have independent internal dataset anymore. All the project
    data must be stored in the project data sources (#238)
  • datumaro_project format (#238)
  • [API] Unused path field of DatasetItem (#455)

Fixed

  • Deprecation warning in open_images_format.py (#440)
  • lazy_image returning unrelated data sometimes (#409)
  • Invalid call to pycocotools.mask.iou (#450)
  • Importing of Open Images datasets without image data (#463)
  • Return value type in Dataset.is_modified (#401)
  • Incorrect remapping of secondary categories in RemapLabels (#401)
  • VOC dataset patching for classification and segmentation tasks (#478)
  • Exported mask label ids in KITTI segmentation (#481)
  • Missing label for Points read in the LFW format (#494)