Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

potentially shifting to large files and adding processed datasets to gitignore #27

Closed
Niklewa opened this issue Oct 20, 2023 · 2 comments

Comments

@Niklewa
Copy link
Collaborator

Niklewa commented Oct 20, 2023

          Some possibilities on these data files that might help us scale to mucho larger datasets but keep things centralized to git (at least for this more "researchy" repo):
  1. store raw data files with git-lfs, this enables some configuration/control around downloading potentially large files. The current raw files aren't that big, but I imagine we'll be stepping into some larger ones soon.

  2. potentially add the processed directory to .gitignore, and provide instructions in the readme for running the cleaning pipeline and generating the processed data files. My reasoning here is that this will a) reduce repo size and b) ensure that new clones always have up-to-date cleaned files (i.e. forgetting to run the cleaning pipeline after changing it won't be an issue for new users).

Anyway — just ideas, these are my preferences but we definitely don't have to use them (but github hosted repos do have a maximum size — I don't recall what it is though).

Originally posted by @azane in #14 (comment)

@emackev
Copy link
Contributor

emackev commented Nov 2, 2023

Related to #59

@rfl-urbaniak
Copy link
Contributor

no longer relevant

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants