Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implementation of LakeFS #14

Draft
wants to merge 34 commits into
base: main
Choose a base branch
from
Draft

Implementation of LakeFS #14

wants to merge 34 commits into from

Conversation

jameshod5
Copy link
Collaborator

@jameshod5 jameshod5 commented Sep 10, 2024

EDIT: I've now since approached the versioning a different way, but I think it is more in line with the existing codebase and more efficient.

What I had originally done was the LakeFS versioning would happen after either the local or S3 ingestion workflow was completed. The problem with this was that I was leaving the shots in local storage so that LakeFS could do versioning on all of them at once. Obviously, this quickly becomes a storage problem when we have hundreds of shots.

As LakeFS already uploads new/changed data to the S3 storage you point to, I have now changed the S3 ingestion workflow into a LakeFS workflow. This means that we no longer have 2 S3 storage endpoints (why would we!), instead we just use the LakeFS S3 endpoint as our S3 storage.

Moving to this approach makes the overall code much easier to run, as the LakeFS workflow replaces the "--upload" argument that was previously used for S3 ingestion. When you now use the "--upload" argument and provide a correct and existing repository name, the workflow will create a unique branch at first before going onto creating the shot file locally. Once the shot file is created, it is committed to the branch and then deleted from local storage. Once all shots are done, this branch is merged into main. Local ingestion works the exact same (but with no versioning or deleting).

This change means that if we do a huge ingestion of thousands of shots, we are not storing thousands of shots locally BEFORE uploading them to LakeFS and then removing them all at once. Instead, we remove the shot just after it is created and uploaded, freeing up local space for the next shot.

TODO: This still needs to work when submitting the ingestion as a job. Current problem is each MPI process not having access to the LakeFS server through SSH tunnels, as we need a public IP. However it does work when running off the command line for a simple run.

At the moment, I have a LakeFS server and Postgres server running on the STFC machine. Then, on the machine you want to run your ingestion/versioning code:

  • Set up a SSH tunnel into the STFC machine. With this, we can access the LakeFS setup UI when we first run the LakeFS server which gives us the access key and secret key to the server. We want to keep this somewhere safe. Create a repository through the UI and point to the S3 storage we want to use. Import all the existing date from the S3 storage through the big green "Import" button on the UI.
  • (Optional) If you want to create a new LakeFS repo that points to an S3 object store that already has a repo associated, then you will need to remove the "/data" and "_lakefs" directories from that S3 store first using s5cmd.
  • You will need to install LakeCTL (download the .tar with wget, un-tar it) so that we can use lakectl config and input the access key and secret key that the LakeFS UI setup gave us earlier. This will allow our machine to access the LakeFS server.

The main changes made to the actual code base are creating a LakeFS workflow (src/lake_fs.py) and moving the cleanup process that was previously apart of the ingestion workflow into this LakeFS workflow, to take advantage of the fact that the ingestion files are written to the disk locally during the whole process.

The idea is to run the ingestion workflow as normal, either locally or to s3, and then run the LakeFS versioning code when we want to version. This will upload all of the data that is written to 'data/local' (for example, can be anywhere we want) and then remove those files similar to the cleanup() process that used to be apart of the ingestion workflow itself.

The reasoning for seperating LakeFS from the ingestion workflow is with the MPI processes that we use to speed up ingestion. I tried to work around this by only letting one MPI rank do the versioning task but found it cumbersome. I think having it seperate as well allows us to pick and choose when we want to version our data. But perhaps there is a better way around this.

Output:

image

<------->

image

LakeFS:

image

@jameshod5 jameshod5 added the enhancement New feature or request label Sep 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant