potentially shifting to large files and adding processed datasets to gitignore #27

Niklewa · 2023-10-20T08:08:36Z

          Some possibilities on these data files that might help us scale to mucho larger datasets but keep things centralized to git (at least for this more "researchy" repo):

store raw data files with git-lfs, this enables some configuration/control around downloading potentially large files. The current raw files aren't that big, but I imagine we'll be stepping into some larger ones soon.
potentially add the processed directory to .gitignore, and provide instructions in the readme for running the cleaning pipeline and generating the processed data files. My reasoning here is that this will a) reduce repo size and b) ensure that new clones always have up-to-date cleaned files (i.e. forgetting to run the cleaning pipeline after changing it won't be an issue for new users).

Anyway — just ideas, these are my preferences but we definitely don't have to use them (but github hosted repos do have a maximum size — I don't recall what it is though).

Originally posted by @azane in #14 (comment)

The text was updated successfully, but these errors were encountered:

emackev · 2023-11-02T19:57:41Z

Related to #59

rfl-urbaniak · 2024-11-15T16:31:05Z

no longer relevant

rfl-urbaniak closed this as completed Nov 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

potentially shifting to large files and adding processed datasets to gitignore #27

potentially shifting to large files and adding processed datasets to gitignore #27

Niklewa commented Oct 20, 2023

emackev commented Nov 2, 2023

rfl-urbaniak commented Nov 15, 2024

potentially shifting to large files and adding processed datasets to gitignore #27

potentially shifting to large files and adding processed datasets to gitignore #27

Comments

Niklewa commented Oct 20, 2023

emackev commented Nov 2, 2023

rfl-urbaniak commented Nov 15, 2024