Implementation of LakeFS #14

jameshod5 · 2024-09-10T10:06:23Z

EDIT: I've now since approached the versioning a different way, but I think it is more in line with the existing codebase and more efficient.

What I had originally done was the LakeFS versioning would happen after either the local or S3 ingestion workflow was completed. The problem with this was that I was leaving the shots in local storage so that LakeFS could do versioning on all of them at once. Obviously, this quickly becomes a storage problem when we have hundreds of shots.

As LakeFS already uploads new/changed data to the S3 storage you point to, I have now changed the S3 ingestion workflow into a LakeFS workflow. This means that we no longer have 2 S3 storage endpoints (why would we!), instead we just use the LakeFS S3 endpoint as our S3 storage.

Moving to this approach makes the overall code much easier to run, as the LakeFS workflow replaces the "--upload" argument that was previously used for S3 ingestion. When you now use the "--upload" argument and provide a correct and existing repository name, the workflow will create a unique branch at first before going onto creating the shot file locally. Once the shot file is created, it is committed to the branch and then deleted from local storage. Once all shots are done, this branch is merged into main. Local ingestion works the exact same (but with no versioning or deleting).

This change means that if we do a huge ingestion of thousands of shots, we are not storing thousands of shots locally BEFORE uploading them to LakeFS and then removing them all at once. Instead, we remove the shot just after it is created and uploaded, freeing up local space for the next shot.

TODO: This still needs to work when submitting the ingestion as a job. Current problem is each MPI process not having access to the LakeFS server through SSH tunnels, as we need a public IP. However it does work when running off the command line for a simple run.

At the moment, I have a LakeFS server and Postgres server running on the STFC machine. Then, on the machine you want to run your ingestion/versioning code:

Set up a SSH tunnel into the STFC machine. With this, we can access the LakeFS setup UI when we first run the LakeFS server which gives us the access key and secret key to the server. We want to keep this somewhere safe. Create a repository through the UI and point to the S3 storage we want to use. Import all the existing date from the S3 storage through the big green "Import" button on the UI.
(Optional) If you want to create a new LakeFS repo that points to an S3 object store that already has a repo associated, then you will need to remove the "/data" and "_lakefs" directories from that S3 store first using s5cmd.
You will need to install LakeCTL (download the .tar with wget, un-tar it) so that we can use lakectl config and input the access key and secret key that the LakeFS UI setup gave us earlier. This will allow our machine to access the LakeFS server.

The main changes made to the actual code base are creating a LakeFS workflow (src/lake_fs.py) and moving the cleanup process that was previously apart of the ingestion workflow into this LakeFS workflow, to take advantage of the fact that the ingestion files are written to the disk locally during the whole process.

The idea is to run the ingestion workflow as normal, either locally or to s3, and then run the LakeFS versioning code when we want to version. This will upload all of the data that is written to 'data/local' (for example, can be anywhere we want) and then remove those files similar to the cleanup() process that used to be apart of the ingestion workflow itself.

The reasoning for seperating LakeFS from the ingestion workflow is with the MPI processes that we use to speed up ingestion. I tried to work around this by only letting one MPI rank do the versioning task but found it cumbersome. I think having it seperate as well allows us to pick and choose when we want to version our data. But perhaps there is a better way around this.

Output:

<------->

LakeFS:

…ir creation to the branch seperately

…ly due to network problems with the lakefs instance. WIP

…r the fact

…, requires a repo name

…n at the end

jameshod5 and others added 13 commits September 3, 2024 08:40

lakefs python code for branch creation, committing and merging

7cf606e

initial test of lakefs workflow

e0b0fe4

added crude versioning to local ingestion

eb69a9d

data versioning task added

06889bf

removed config requirements, reads from lakectl config now

1ae88df

removed cleanup function and versioning from workflow file

aa87edd

improved versioning to get past mpi processes overlapping

27ffe0b

seperate lakefs versioning task, run external to ingestion

bb327df

removed mpi versioning, now doing seperate versioning process

190fa49

added extra validation for data dir, lakectl install

0e0437f

added cleanup functionality

781df1e

removed cleanup from workflow, moved to lake_fs.py

30831e8

removed unused versioning data task

d866f06

jameshod5 added the enhancement New feature or request label Sep 10, 2024

jameshod5 requested review from samueljackson92, NathanCummings and khalsz September 10, 2024 10:06

James Hodson and others added 13 commits September 12, 2024 14:36

create branch for versioning at beginning

fd80c08

commit shot after it is creating to branch ingestion

a3b586f

removed versioning from workflow

207074e

working example of versioning, committing each shot at the end of the…

5cdaed6

…ir creation to the branch seperately

organised versioning into functions

8613524

added option to turn on or off versioning

a425809

removed args for versioning for now

2313959

added merge and branch delete

42c32f8

rework of ingestion workflow to use lakefs instead, does not work ful…

306f868

…ly due to network problems with the lakefs instance. WIP

added committing to the branch and cleanup to remove local files afte…

7629c80

…r the fact

included merge at the end of versioning run

5190edd

upload and commit the shot name instead of the local_path prefix

885508c

WIP file. Location of merge and execute functions used in main.

ade3c38

James Hodson and others added 8 commits September 18, 2024 12:36

now able to use --upload arg as an option to enable lakefs versioning…

a270e28

…, requires a repo name

added branch to config to enable merging of ingestion branch into mai…

950f186

…n at the end

removed excess class

d813606

better info messages

5a8919e

removed logging info

239e971

ruff fixes

426e6f9

ruff fixes

5ea3f84

changed job to work with lakefs command

fe93dce

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implementation of LakeFS #14

Implementation of LakeFS #14

jameshod5 commented Sep 10, 2024 •

edited

Loading

Implementation of LakeFS #14

Are you sure you want to change the base?

Implementation of LakeFS #14

Conversation

jameshod5 commented Sep 10, 2024 • edited Loading

jameshod5 commented Sep 10, 2024 •

edited

Loading