Change pull_paths_from_storage
to use gitattributes style filters
#3518
Labels
pull_paths_from_storage
to use gitattributes style filters
#3518
If we have a workflow that creates thousands of files in an output directory, the gitattributes filter might look like
data/raw-data/** filter=lfs diff=lfs merge=lfs -text
.When running a subsequent run with this directory as input, we call
pull_paths_from_storage
with ALL the files in the directory which in turn callsgit lfs pull
on them in batches of 100, which can take a pretty long time (1 hour for 100k files on my machine). Butgit lfs pull
does accept filters, so instead just callinggit lfs pull -I data/raw-data/**
would only take 0.17 seconds.For directory inputs, we should call pull with just the directory.
for testing, you can use https://ai.stanford.edu/~amaas/data/sentiment/ . Use renku run to extract the archive and then use something like
cat extracted/aclImdb/train/* > output
in a subsequent workflow.The text was updated successfully, but these errors were encountered: