Start by ⭐️ starring lakeFS open source project.
Data engineers typically don't develop against production data due to concerns regarding PII, time, and scale. Instead, they develop ETL jobs on a subset of data and promote code via Git. The challenge is that this code is not tested on production data until promotion. This process can lead to issues because the subset of data used will differ from the production environment.
-Git Action Integration: lakeFS can use a Git action during pull requests to import code to lakeFS
-Isolated Production Copy: Every promotion creates an isolated copy of the production data using a zero-copy import to lakeFS
-Safe Testing: Code runs against this isolated copy, allowing testing in a production-like environment
-Safety Net: If anything fails, the code isn't promoted, providing a safety net
This demo shows how lakeFS uses Git actions to perform a zero-copy import of production data for ETL promotions in both Python and Scala. This allows testing against production-like data and only promotes successful code changes. In this demo, we’ll see how changing an ETL job can trigger a validation error through a Git action.
- lakeFS installed and running on a server or in the cloud. If you don't have lakeFS already running then either use lakeFS Cloud which provides free lakeFS server on-demand with a single click or refer to lakeFS Quickstart doc.
- Databricks server with the ability to run compute clusters on top of it.
- Configure your Databricks cluster to use the lakeFS Hadoop file system. Read this blog Databricks and lakeFS Integration: Step-by-Step Configuration Tutorial or lakeFS documentation for the configuration.
- Permissions to manage the cluster configuration, including adding libraries.
- GitHub account.
-
Create Databricks personal access token.
-
Create Databricks secret scope e.g. demos or use an existing secret scope. Add following secrets in that secret scope by following Secret management docs:
lakefs_access_key_id e.g. 'AKIAIOSFOLKFSSAMPLES' lakefs_secret_access_key e.g. 'wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY'
You can use following Databricks commands to create secrets:
databricks secrets put-secret --json '{ "scope": "demos", "key": "lakefs_access_key_id", "string_value": "AKIAIOSFOLKFSSAMPLES" }' databricks secrets put-secret --json '{ "scope": "demos", "key": "lakefs_secret_access_key", "string_value": "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY" }'
-
Create a Git repository. It can be named lakeFS-samples-ci-cd.
-
Clone this repository:
git clone https://github.com/treeverse/lakeFS-samples && cd lakeFS-samples/01_standalone_examples/databricks-ci-cd
-
Create folders .github/workflows and databricks-notebooks in your Git repo.
-
Upload pr_commit_run_databricks_etl_job.yml file in lakeFS-samples/01_standalone_examples/databricks-ci-cd/.github/workflows folder to .github/workflows folder in your Git repo.
-
Upload all files in lakeFS-samples/01_standalone_examples/databricks-ci-cd/databricks-notebooks folder to databricks-notebooks folder in your Git repo.
-
Add following secrets in your Git repo by following Creating secrets for a repository docs. This is the Databricks token created in 1st step above. If you copy & paste the secret name then verify that there are no spaces before and after the secret name.
DATABRICKS_TOKEN
-
Add following variables in your Git repo by following Creating configuration variables for a repository docs:
-
Variable to store your Databricks host name or URL e.g. https://cust-success.cloud.databricks.com
DATABRICKS_HOST
-
Variable to store your Databricks Cluster ID e.g. 1115-164516-often242
DATABRICKS_CLUSTER_ID
-
Variable to store your Databricks Workspace Folder path e.g. /Shared/lakefs_demos/ci_cd_demo or /Users/[email protected]/MyFolder/lakefs_demos/ci_cd_demo
DATABRICKS_WORKSPACE_NOTEBOOK_PATH
-
Variable to store your Databricks Secret Scope created in 2nd step e.g. demos
DATABRICKS_SECRET_SCOPE
-
Variable to store your lakeFS Endpoint e.g. https://company.region.lakefscloud.io
LAKEFS_END_POINT
-
Variable to store your lakeFS repository name (which will be created by this demo) e.g. databricks-ci-cd-repo
LAKFES_REPO_NAME
-
Variable to store the storage namespace for the lakeFS repo. It is a location in the underlying storage where data for the lakeFS repository will be stored. e.g. s3://example
LAKEFS_REPO_STORAGE_NAMESPACE
-
Variable to store the storage namespace where Delta tables created by this demo will be stored e.g. s3://data-source/delta-tables. Do NOT use the same storage namespace as above.
If it is not there then create Databricks External Location to write to s3://data-source URL and you should have READ FILES and WRITES FILES permissions on and External Location
DATA_SOURCE_STORAGE_NAMESPACE
- Create a new branch in your Git repository. Select the newly created branch.
We just safely promoted our ETL jobs using lakeFS and Git actions. By creating a branch, modifying code, and running validation checks, you ensured that changes are tested in an isolated environment.
For a Scala ETL job demo, go to the "scala-demo" section. Below, find useful references for this demo, and for Git actions.
- Databricks Continuous integration and delivery using GitHub Actions.
- Information on how to run Databricks notebooks from GitHub Action.
- See action.yml for the latest interface and docs for databricks/run-notebook.
- Databricks REST API reference.
- GitHub Events that trigger workflows.
- GitHub Webhook events and payloads.
- GitHub Payloads for Pull Request.
- Documentation on GitHub Action that uploads a file to Amazon S3
- Documentation on GitHub Action that uploads a file to Databricks DFBS
-
Code to run the Action workflow only if any file changes in a specific folder e.g. databricks-notebooks. So, changing README file, which is outside the databricks-notebooks folder, will not run the workflow:
name: Run Databricks ETL jobs in an isolated environment by using lakeFS on: pull_request: paths: - 'databricks-notebooks/**'
-
Upload a file to S3 e.g. upload Scala JAR file:
- name: Upload JAR file to S3 uses: hkusu/s3-upload-action@v2 id: upload_file_to_s3 with: aws-access-key-id: ${{secrets.AWS_ACCESS_KEY}} aws-secret-access-key: ${{secrets.AWS_SECRET_KEY}} aws-region: ${{ vars.AWS_REGION }} aws-bucket: ${{ vars.AWS_BUCKET_FOR_JARS }} bucket-root: ${{ vars.AWS_BUCKET_ROOT_FOLDER_FOR_JARS }} destination-dir: "jars/pr-${{ github.event.number }}" file-path: "${{ env.LOCAL_NOTEBOOK_PATH }}/scala_etl_jobs/target/scala-2.12/etl_jobs-assembly-0.1.0-SNAPSHOT.jar" output-file-url: 'true' - name: Print JAR file location on S3 run: | echo "JAR location on S3: ${{ steps.upload_file_to_s3.outputs.file-url }}"
-
When creating a new Databricks cluster then install JAR file from S3:
libraries-json: > [ { "jar": "s3://${{ vars.AWS_BUCKET_FOR_JARS }}/${{ vars.AWS_BUCKET_ROOT_FOLDER_FOR_JARS }}/jars/pr-${{ github.event.number }}/etl_jobs-assembly-0.1.0-SNAPSHOT.jar" } ]
-
Upload a file to Databricks DBFS e.g. upload Scala JAR file:
- name: Upload JAR file to DBFS uses: databricks/upload-dbfs-temp@v0 with: local-path: ${{ env.LOCAL_NOTEBOOK_PATH }}/scala_etl_jobs/target/scala-2.12/etl_jobs-assembly-0.1.0-SNAPSHOT.jar id: upload_file_to_dbfs - name: Print JAR file location on DBFS run: | echo "JAR location on DBFS: ${{ steps.upload_file_to_dbfs.outputs.dbfs-file-path }}"
-
When creating a new Databricks cluster then install JAR file from Databricks DBFS:
libraries-json: > [ { "jar": "${{ steps.upload_file_to_dbfs.outputs.dbfs-file-path }}" } ]
-
Code to create a new Databricks cluster while triggering a Notebook and to install the libraries to the new cluster:
- name: Trigger Databricks Scala ETL Job uses: databricks/[email protected] id: trigger_databricks_notebook_scala_etl_job with: run-name: "GitHub Action - PR ${{ github.event.number }} - Scala ETL Job" local-notebook-path: "./scala_etl_jobs/Run Scala ETL Job.py" notebook-params-json: > { "environment": "dev", "data_source_storage_namespace": "${{ vars.DATA_SOURCE_STORAGE_NAMESPACE }}", "lakefs_end_point": "${{ vars.LAKEFS_END_POINT }}", "lakefs_repo": "${{ vars.LAKFES_REPO_NAME }}", "lakefs_branch": "${{ env.LAKFES_BRANCH_NAME }}" } new-cluster-json: > { "num_workers": 1, "spark_version": "14.3.x-scala2.12", "node_type_id": "m5d.large", "spark_conf": { "spark.hadoop.fs.lakefs.access.mode": "presigned", "spark.hadoop.fs.lakefs.impl": "io.lakefs.LakeFSFileSystem", "spark.hadoop.fs.lakefs.endpoint": "${{ vars.LAKEFS_END_POINT }}/api/v1", "spark.hadoop.fs.lakefs.access.key": "${{secrets.LAKEFS_ACCESS_KEY}}", "spark.hadoop.fs.lakefs.secret.key": "${{secrets.LAKEFS_SECRET_KEY}}", "spark.hadoop.fs.s3a.access.key": "${{secrets.AWS_ACCESS_KEY}}", "spark.hadoop.fs.s3a.secret.key": "${{secrets.AWS_SECRET_KEY}}" } } libraries-json: > [ { "jar": "s3://${{ vars.AWS_BUCKET_FOR_JARS }}/${{ vars.AWS_BUCKET_ROOT_FOLDER_FOR_JARS }}/jars/pr-${{ github.event.number }}/etl_jobs-assembly-0.1.0-SNAPSHOT.jar" }, { "maven": {"coordinates": "io.lakefs:hadoop-lakefs-assembly:0.2.4"} }, { "pypi": {"package": "lakefs==0.6.0"} } ] outputs: > run-url >> "$GITHUB_OUTPUT"
-
Code to checkout a folder from the repo instead of full repo:
# Checkout project code # Use sparse checkout to only select files in a directory # Turning off cone mode ensures that files in the project root are not included during checkout - name: Checks out the repo uses: actions/checkout@v4 with: sparse-checkout: 'scala_etl_jobs/src' sparse-checkout-cone-mode: false
-
Get list of branches in Git repo and store it in a GitHub multi-line environment variable:
- name: Get branch list run: | { echo 'PR_BRANCHES<<EOF' git log -${{ env.PR_FETCH_DEPTH }} --pretty=format:'%H' echo '' echo 'EOF' } >> $GITHUB_ENV
-
Get Git branch name:
- name: Extract branch name shell: bash run: echo "branch=${GITHUB_HEAD_REF:-${GITHUB_REF#refs/heads/}}" >> $GITHUB_OUTPUT id: extract_branch
-
Get date & timestamp:
- name: Get current date id: date run: echo "::set-output name=date::$(date +'%Y-%m-%d-%H-%M-%S')"