-
Notifications
You must be signed in to change notification settings - Fork 211
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement a pytorch dataloader that filter and download at run time #39
Comments
related rom1504/img2dataset#56 I'm thinking of implementing the download+resize inside img2dataset since these features are already there. img2dataset would not need to depend on pytorch since implementing an iterable dataset only requires having a class with an |
the filtering / retrieving from an index part would however make more sense to live here, so clip-retrieval could depend on img2dataset and use its UrlStreamingDataset to provide a FilteredUrlStreamingDataset let's hope this can be made to work with the same speed at img2dataset (1300 sample/s) |
Could be interesting to investigate this path
The img2dataset service can also expose a shard endpoint that takes as input some url, caption files and turn them into shard files. Then all that is needed will be an orchestrator with a metadata database, that makes sure all the shards are properly done. Benefits:
To check:
|
new idea: rethink all these tools as dataflow/stream/transformers taking an input a collection and producing an output collection with optional caching and pressure push back reader:
writer:
transformer:
These bricks could then be naturally composed to form downloaders, inferences and indexers defining good interfaces for each subtool then making each tool a separate package, well tested and with good examples Check if https://docarray.jina.ai/fundamentals/documentarray/ could be helpful to build this This new structure should make it possible to make all these tools both more powerful and more reusable |
let's first try and check how to read in parallel a large file with fsspec |
reading a large file with fsspec works by seeking and reading up to a length, it's much faster |
next step will be implementing a clean embedding-reader package |
independently I think that https://towardsdatascience.com/data-pipelines-with-apache-beam-86cd8eb55fd8 looks good |
this is an online version of #31
Combine the whole pipeline not as a big batch job, but instead as a data loader that
It makes sense in particular when the model training speed is low. For example dalle is such a model.
For clip it could make less sense
it could be a lot more convenient than downloading TB of webdataset if it works:
The text was updated successfully, but these errors were encountered: