Skip to content

Commit

Permalink
Prioritize using PIL to get image size (#1259)
Browse files Browse the repository at this point in the history
<!-- Contributing guide:
https://github.com/openvinotoolkit/datumaro/blob/develop/CONTRIBUTING.md
-->

### Summary

Accelerate loading of image file-based datasets.

I found that printing out the YOLO dataset information for the first
time was slow. After some digging I found that `datamaro` was reading
the entire dataset through to get the size of each image.

```python
ds = Dataset.import_from("/yolo-ultralytics", "yolo")
print(ds)  # <-- wait a long time
```

```python
    # from class Image
    @Property
    def size(self) -> Optional[Tuple[int, int]]:
        """Returns (H, W)"""

        if self._size is None:
            try:
                data = self.data  # <-- load the whole media into memory
            except _image_loading_errors:
                return None
            if data is not None:
                self._size = tuple(map(int, data.shape[:2]))
        return self._size
```

Interactive encoding with datasets on HDD is slow. So I added an
override `size()` property in the `ImageFromFile` class which first
tries to get the image size using `PIL`. The `PIL` library is about 8
times faster than `OpenCV` in getting the image size.

All dataset classes that use the `size` property of `ImageFromFile` can
benefit from this modification.

<!--
Resolves #111 and #222.
Depends on #1000 (for series of dependent commits).

This PR introduces this capability to make the project better in this
and that.

- Added this feature
- Removed that feature
- Fixed the problem #1234
-->

### How to test
<!-- Describe the testing procedure for reviewers, if changes are
not fully covered by unit tests or manual testing can be complicated.
-->

### Checklist
<!-- Put an 'x' in all the boxes that apply -->
- [ ] I have added unit tests to cover my changes.​
- [ ] I have added integration tests to cover my changes.​
- [ ] I have added the description of my changes into
[CHANGELOG](https://github.com/openvinotoolkit/datumaro/blob/develop/CHANGELOG.md).​
- [ ] I have updated the
[documentation](https://github.com/openvinotoolkit/datumaro/tree/develop/docs)
accordingly

### License

- [x] I submit _my code changes_ under the same [MIT
License](https://github.com/openvinotoolkit/datumaro/blob/develop/LICENSE)
that covers the project.
  Feel free to contact the maintainers if that's a concern.
- [x] I have updated the license header for each file (see an example
below).

```python
# Copyright (C) 2023 Intel Corporation
#
# SPDX-License-Identifier: MIT
```

---------

Co-authored-by: Vinnam Kim <[email protected]>
  • Loading branch information
imyhxy and vinnamkim authored Feb 6, 2024
1 parent 784e039 commit 76fc941
Show file tree
Hide file tree
Showing 2 changed files with 15 additions and 0 deletions.
1 change: 1 addition & 0 deletions requirements-core.txt
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
attrs>=21.3.0
defusedxml>=0.7.0
h5py>=2.10.0
imagesize>=1.4.1
lxml>=4.4.1
matplotlib>=3.3.1
networkx>=2.6
Expand Down
14 changes: 14 additions & 0 deletions src/datumaro/components/media.py
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,7 @@
)

import cv2
import imagesize
import numpy as np

from datumaro.components.crypter import NULL_CRYPTER, Crypter
Expand Down Expand Up @@ -330,6 +331,19 @@ def data(self) -> Optional[np.ndarray]:
self._size = tuple(map(int, data.shape[:2]))
return data

@property
def size(self) -> Optional[Tuple[int, int]]:
"""Returns (H, W)"""

if self._size is None:
try:
width, height = imagesize.get(self.path)
assert width != -1 and height != -1
self._size = (height, width)
except Exception:
_ = super().size
return self._size

def save(
self,
fp: Union[str, io.IOBase],
Expand Down

0 comments on commit 76fc941

Please sign in to comment.