-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implementation of LakeFS #14
Draft
jameshod5
wants to merge
34
commits into
main
Choose a base branch
from
james/lakefs
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…ir creation to the branch seperately
…ly due to network problems with the lakefs instance. WIP
…, requires a repo name
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
EDIT: I've now since approached the versioning a different way, but I think it is more in line with the existing codebase and more efficient.
What I had originally done was the LakeFS versioning would happen after either the local or S3 ingestion workflow was completed. The problem with this was that I was leaving the shots in local storage so that LakeFS could do versioning on all of them at once. Obviously, this quickly becomes a storage problem when we have hundreds of shots.
As LakeFS already uploads new/changed data to the S3 storage you point to, I have now changed the S3 ingestion workflow into a LakeFS workflow. This means that we no longer have 2 S3 storage endpoints (why would we!), instead we just use the LakeFS S3 endpoint as our S3 storage.
Moving to this approach makes the overall code much easier to run, as the LakeFS workflow replaces the "--upload" argument that was previously used for S3 ingestion. When you now use the "--upload" argument and provide a correct and existing repository name, the workflow will create a unique branch at first before going onto creating the shot file locally. Once the shot file is created, it is committed to the branch and then deleted from local storage. Once all shots are done, this branch is merged into main. Local ingestion works the exact same (but with no versioning or deleting).
This change means that if we do a huge ingestion of thousands of shots, we are not storing thousands of shots locally BEFORE uploading them to LakeFS and then removing them all at once. Instead, we remove the shot just after it is created and uploaded, freeing up local space for the next shot.
TODO: This still needs to work when submitting the ingestion as a job. Current problem is each MPI process not having access to the LakeFS server through SSH tunnels, as we need a public IP. However it does work when running off the command line for a simple run.
At the moment, I have a LakeFS server and Postgres server running on the STFC machine. Then, on the machine you want to run your ingestion/versioning code:
lakectl config
and input the access key and secret key that the LakeFS UI setup gave us earlier. This will allow our machine to access the LakeFS server.The main changes made to the actual code base are creating a LakeFS workflow (src/lake_fs.py) and moving thecleanup
process that was previously apart of the ingestion workflow into this LakeFS workflow, to take advantage of the fact that the ingestion files are written to the disk locally during the whole process.The idea is to run the ingestion workflow as normal, either locally or to s3, and then run the LakeFS versioning code when we want to version. This will upload all of the data that is written to 'data/local' (for example, can be anywhere we want) and then remove those files similar to thecleanup()
process that used to be apart of the ingestion workflow itself.The reasoning for seperating LakeFS from the ingestion workflow is with the MPI processes that we use to speed up ingestion. I tried to work around this by only letting one MPI rank do the versioning task but found it cumbersome. I think having it seperate as well allows us to pick and choose when we want to version our data. But perhaps there is a better way around this.Output:
<------->
LakeFS: