Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change pull_paths_from_storage to use gitattributes style filters #3518

Open
Panaetius opened this issue Jun 9, 2023 · 0 comments
Open

Change pull_paths_from_storage to use gitattributes style filters #3518

Panaetius opened this issue Jun 9, 2023 · 0 comments

Comments

@Panaetius
Copy link
Member

Panaetius commented Jun 9, 2023

If we have a workflow that creates thousands of files in an output directory, the gitattributes filter might look like data/raw-data/** filter=lfs diff=lfs merge=lfs -text.
When running a subsequent run with this directory as input, we call pull_paths_from_storage with ALL the files in the directory which in turn calls git lfs pull on them in batches of 100, which can take a pretty long time (1 hour for 100k files on my machine). But git lfs pull does accept filters, so instead just calling git lfs pull -I data/raw-data/** would only take 0.17 seconds.

For directory inputs, we should call pull with just the directory.

for testing, you can use https://ai.stanford.edu/~amaas/data/sentiment/ . Use renku run to extract the archive and then use something like cat extracted/aclImdb/train/* > output in a subsequent workflow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Ready
Development

No branches or pull requests

1 participant