Name	Name	Last commit message	Last commit date
parent directory ..
.github/workflows	.github/workflows
databricks-notebooks	databricks-notebooks
files/images	files/images
scala_etl_jobs	scala_etl_jobs
README.md	README.md
scala-demo.md	scala-demo.md

lakeFS-samples-ci-cd

Start by ⭐️ starring lakeFS open source project.

Data engineers typically don't develop against production data due to concerns regarding PII, time, and scale. Instead, they develop ETL jobs on a subset of data and promote code via Git. The challenge is that this code is not tested on production data until promotion. This process can lead to issues because the subset of data used will differ from the production environment.

Solution with lakeFS:

-Git Action Integration: lakeFS can use a Git action during pull requests to import code to lakeFS

-Isolated Production Copy: Every promotion creates an isolated copy of the production data using a zero-copy import to lakeFS

-Safe Testing: Code runs against this isolated copy, allowing testing in a production-like environment

-Safety Net: If anything fails, the code isn't promoted, providing a safety net

This demo shows how lakeFS uses Git actions to perform a zero-copy import of production data for ETL promotions in both Python and Scala. This allows testing against production-like data and only promotes successful code changes. In this demo, we’ll see how changing an ETL job can trigger a validation error through a Git action.

Prerequisites

lakeFS installed and running on a server or in the cloud. If you don't have lakeFS already running then either use lakeFS Cloud which provides free lakeFS server on-demand with a single click or refer to lakeFS Quickstart doc.
Databricks server with the ability to run compute clusters on top of it.
Configure your Databricks cluster to use the lakeFS Hadoop file system. Read this blog Databricks and lakeFS Integration: Step-by-Step Configuration Tutorial or lakeFS documentation for the configuration.
Permissions to manage the cluster configuration, including adding libraries.
GitHub account.

Setup

Create Databricks personal access token.

Create Databricks secret scope e.g. demos or use an existing secret scope. Add following secrets in that secret scope by following Secret management docs:

lakefs_access_key_id e.g. 'AKIAIOSFOLKFSSAMPLES'

lakefs_secret_access_key e.g. 'wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY'

You can use following Databricks commands to create secrets:

databricks secrets put-secret --json '{
  "scope": "demos",
  "key": "lakefs_access_key_id",
  "string_value": "AKIAIOSFOLKFSSAMPLES"
}'

databricks secrets put-secret --json '{
  "scope": "demos",
  "key": "lakefs_secret_access_key",
  "string_value": "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY"
}'

Create a Git repository. It can be named lakeFS-samples-ci-cd.

Clone this repository:

git clone https://github.com/treeverse/lakeFS-samples && cd lakeFS-samples/01_standalone_examples/databricks-ci-cd

Create folders .github/workflows and databricks-notebooks in your Git repo.
Upload pr_commit_run_databricks_etl_job.yml file in lakeFS-samples/01_standalone_examples/databricks-ci-cd/.github/workflows folder to .github/workflows folder in your Git repo.
Upload all files in lakeFS-samples/01_standalone_examples/databricks-ci-cd/databricks-notebooks folder to databricks-notebooks folder in your Git repo.
Add following secrets in your Git repo by following Creating secrets for a repository docs. This is the Databricks token created in 1st step above. If you copy & paste the secret name then verify that there are no spaces before and after the secret name.
```
DATABRICKS_TOKEN
```
Add following variables in your Git repo by following Creating configuration variables for a repository docs:

Variable to store your Databricks host name or URL e.g. https://cust-success.cloud.databricks.com
```
DATABRICKS_HOST
```
Variable to store your Databricks Cluster ID e.g. 1115-164516-often242
```
DATABRICKS_CLUSTER_ID
```
Variable to store your Databricks Workspace Folder path e.g. /Shared/lakefs_demos/ci_cd_demo or /Users/[email protected]/MyFolder/lakefs_demos/ci_cd_demo
```
DATABRICKS_WORKSPACE_NOTEBOOK_PATH
```
Variable to store your Databricks Secret Scope created in 2nd step e.g. demos
```
DATABRICKS_SECRET_SCOPE
```
Variable to store your lakeFS Endpoint e.g. https://company.region.lakefscloud.io
```
LAKEFS_END_POINT
```
Variable to store your lakeFS repository name (which will be created by this demo) e.g. databricks-ci-cd-repo
```
LAKFES_REPO_NAME
```
Variable to store the storage namespace for the lakeFS repo. It is a location in the underlying storage where data for the lakeFS repository will be stored. e.g. s3://example
```
LAKEFS_REPO_STORAGE_NAMESPACE
```
Variable to store the storage namespace where Delta tables created by this demo will be stored e.g. s3://data-source/delta-tables. Do NOT use the same storage namespace as above.

If it is not there then create Databricks External Location to write to s3://data-source URL and you should have READ FILES and WRITES FILES permissions on and External Location
```
DATA_SOURCE_STORAGE_NAMESPACE
```

Demo Instructions for Python ETL Jobs

Create a new branch in your Git repository. Select the newly created branch.

1. Remove the comment from the last 5 lines of code in **ETL Job.py** inside the **databricks-notebooks** folder and Commit your changes.

1. Go to the **Pull requests** tab in your Git repo, create Pull Request.

1. Go to the **Actions** tab in your Git repo. Git Action will start running automatically and validation checks will fail.

1. Go back to the **Code** tab in your Git repo and select the branch created in 1st step. Comment back the last 5 lines of code in **ETL Job.py** and Commit your changes.

1. Go back to the **Actions** tab in your Git repo. Git Action will start running again and validation checks will pass this time.

We just safely promoted our ETL jobs using lakeFS and Git actions. By creating a branch, modifying code, and running validation checks, you ensured that changes are tested in an isolated environment.

For a Scala ETL job demo, go to the "scala-demo" section. Below, find useful references for this demo, and for Git actions.

Useful Information

Databricks Continuous integration and delivery using GitHub Actions.
Information on how to run Databricks notebooks from GitHub Action.
See action.yml for the latest interface and docs for databricks/run-notebook.
Databricks REST API reference.
GitHub Events that trigger workflows.
GitHub Webhook events and payloads.
GitHub Payloads for Pull Request.
Documentation on GitHub Action that uploads a file to Amazon S3
Documentation on GitHub Action that uploads a file to Databricks DFBS

Additional Useful GitHub Action Code

Code to run the Action workflow only if any file changes in a specific folder e.g. databricks-notebooks. So, changing README file, which is outside the databricks-notebooks folder, will not run the workflow:
```
name: Run Databricks ETL jobs in an isolated environment by using lakeFS

on:
   pull_request:
      paths:
         - 'databricks-notebooks/**'
```

Upload a file to S3 e.g. upload Scala JAR file:

   - name: Upload JAR file to S3
     uses: hkusu/s3-upload-action@v2
     id: upload_file_to_s3
     with:
       aws-access-key-id: ${{secrets.AWS_ACCESS_KEY}}
       aws-secret-access-key: ${{secrets.AWS_SECRET_KEY}}
       aws-region: ${{ vars.AWS_REGION }}
       aws-bucket: ${{ vars.AWS_BUCKET_FOR_JARS }}
       bucket-root: ${{ vars.AWS_BUCKET_ROOT_FOLDER_FOR_JARS }}
       destination-dir: "jars/pr-${{ github.event.number }}"
       file-path: "${{ env.LOCAL_NOTEBOOK_PATH }}/scala_etl_jobs/target/scala-2.12/etl_jobs-assembly-0.1.0-SNAPSHOT.jar"
       output-file-url: 'true'
   - name: Print JAR file location on S3
     run: |
         echo "JAR location on S3: ${{ steps.upload_file_to_s3.outputs.file-url }}"

When creating a new Databricks cluster then install JAR file from S3:

       libraries-json: >
         [
           { "jar": "s3://${{ vars.AWS_BUCKET_FOR_JARS }}/${{ vars.AWS_BUCKET_ROOT_FOLDER_FOR_JARS }}/jars/pr-${{ github.event.number }}/etl_jobs-assembly-0.1.0-SNAPSHOT.jar" }
         ]

Upload a file to Databricks DBFS e.g. upload Scala JAR file:

   - name: Upload JAR file to DBFS
     uses: databricks/upload-dbfs-temp@v0
     with:
       local-path: ${{ env.LOCAL_NOTEBOOK_PATH }}/scala_etl_jobs/target/scala-2.12/etl_jobs-assembly-0.1.0-SNAPSHOT.jar
     id: upload_file_to_dbfs
   - name: Print JAR file location on DBFS
     run: |
         echo "JAR location on DBFS: ${{ steps.upload_file_to_dbfs.outputs.dbfs-file-path }}"

When creating a new Databricks cluster then install JAR file from Databricks DBFS:

       libraries-json: >
         [
           { "jar": "${{ steps.upload_file_to_dbfs.outputs.dbfs-file-path }}" }
         ]

Code to create a new Databricks cluster while triggering a Notebook and to install the libraries to the new cluster:

   - name: Trigger Databricks Scala ETL Job
     uses: databricks/[email protected]
     id: trigger_databricks_notebook_scala_etl_job
     with:
       run-name: "GitHub Action - PR ${{ github.event.number }} - Scala ETL Job"
       local-notebook-path: "./scala_etl_jobs/Run Scala ETL Job.py"
       notebook-params-json:  >
         {
           "environment": "dev",
           "data_source_storage_namespace": "${{ vars.DATA_SOURCE_STORAGE_NAMESPACE }}",
           "lakefs_end_point": "${{ vars.LAKEFS_END_POINT }}",
           "lakefs_repo": "${{ vars.LAKFES_REPO_NAME }}",
           "lakefs_branch": "${{ env.LAKFES_BRANCH_NAME }}"
         }
       new-cluster-json: >
         {
           "num_workers": 1,
           "spark_version": "14.3.x-scala2.12",
           "node_type_id": "m5d.large",
           "spark_conf": {
             "spark.hadoop.fs.lakefs.access.mode": "presigned",
             "spark.hadoop.fs.lakefs.impl": "io.lakefs.LakeFSFileSystem",
             "spark.hadoop.fs.lakefs.endpoint": "${{ vars.LAKEFS_END_POINT }}/api/v1",
             "spark.hadoop.fs.lakefs.access.key": "${{secrets.LAKEFS_ACCESS_KEY}}",
             "spark.hadoop.fs.lakefs.secret.key": "${{secrets.LAKEFS_SECRET_KEY}}",
             "spark.hadoop.fs.s3a.access.key": "${{secrets.AWS_ACCESS_KEY}}",
             "spark.hadoop.fs.s3a.secret.key": "${{secrets.AWS_SECRET_KEY}}"
           }
         }
       libraries-json: >
         [
           { "jar": "s3://${{ vars.AWS_BUCKET_FOR_JARS }}/${{ vars.AWS_BUCKET_ROOT_FOLDER_FOR_JARS }}/jars/pr-${{ github.event.number }}/etl_jobs-assembly-0.1.0-SNAPSHOT.jar" },
           { "maven": {"coordinates": "io.lakefs:hadoop-lakefs-assembly:0.2.4"} },
           { "pypi": {"package": "lakefs==0.6.0"} }
         ]
       outputs: >
         run-url >> "$GITHUB_OUTPUT"

Code to checkout a folder from the repo instead of full repo:

   # Checkout project code
   # Use sparse checkout to only select files in a directory
   # Turning off cone mode ensures that files in the project root are not included during checkout
   - name: Checks out the repo
     uses: actions/checkout@v4
     with:
       sparse-checkout: 'scala_etl_jobs/src'
       sparse-checkout-cone-mode: false

Get list of branches in Git repo and store it in a GitHub multi-line environment variable:

   - name: Get branch list
     run: |
       {
        echo 'PR_BRANCHES<<EOF'
        git log -${{ env.PR_FETCH_DEPTH }} --pretty=format:'%H'
        echo ''
        echo 'EOF'
       } >> $GITHUB_ENV

Get Git branch name:

   - name: Extract branch name
     shell: bash
     run: echo "branch=${GITHUB_HEAD_REF:-${GITHUB_REF#refs/heads/}}" >> $GITHUB_OUTPUT
     id: extract_branch

Get date & timestamp:

   - name: Get current date
     id: date
     run: echo "::set-output name=date::$(date +'%Y-%m-%d-%H-%M-%S')"

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

databricks-ci-cd

databricks-ci-cd

README.md

lakeFS-samples-ci-cd

Solution with lakeFS:

Prerequisites

Setup

Demo Instructions for Python ETL Jobs

Useful Information

Additional Useful GitHub Action Code

Files

databricks-ci-cd

Directory actions

More options

Directory actions

More options

Latest commit

History

databricks-ci-cd

Folders and files

parent directory

README.md

lakeFS-samples-ci-cd

Solution with lakeFS:

Prerequisites

Setup

Demo Instructions for Python ETL Jobs

Useful Information

Additional Useful GitHub Action Code