Skip to content

colemanja91/ansible-databricks

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Ansible Databricks

Galaxy role to manage Databricks resources and configurations. Helpful for easily keeping mission-critical items under source control. Uses the Databricks CLI, and attempts to apply idempotency to most configurable components.

Prerequisites

  • Databricks organization account set up in AWS or Azure
  • Databricks user account within your organization
  • Ansible >= 2.6
  • Token access to Databricks

Using in your Ansible playbook

  • Install in your Ansible repo: ansible-galaxy install colemanja91.ansible-databricks
  • Example playbook:
---
- hosts:
    - localhost
  vars_files:
    - "my/secret/file.yml"
    - "my/ansible/variables.yml"
  roles:
    - { role: colemanja91.ansible-databricks }

Tasks

CLI installation and setup

  • By default, attempts to install the CLI via pip
  • Sets up configuration file
  • Expects either an Ansible variable databricks_token or environment variable DATABRICKS_TOKEN to be defined
    • Recommended for each Ansible user to define the environment variable at their system-level, to ensure they are using their own account and have proper permissions
    • Ansible variable should be used only with a shared Databricks account (not recommended)
  • Automatically run for any role execution

DBFS mounts

ansible-playbook databricks.yml -t dbfs
  • The variable databricks_dbfs is used to configure this task:
databricks_dbfs:
  - s3_path: "s3a://my-s3-bucket-name"
    dbfs_mount: "/mnt/my-dbfs-mount"

Databricks Secrets

databricks_secrets:
  - scope: "my_secret_scope"
    key: "my_secret_name"
    value: "{{ my_secret_variable }}"

Libraries

  • NOTE: Currently only libraries used on Databricks Jobs are supported
  • Support for interactive cluster libraries is TBD
  • Adds the target file from local file system to a given DBFS path
  • The variable databricks_libraries is used to configure this task:
databricks_libraries:
  - src: "../path/to/my/jar.jar"
    dbfs: "dbfs:/target/path/to/my/jar.jar"

Jobs

databricks_jobs:
  - name: "my_job"
    notebook_task:
      notebook_path: "/User/Jeremy/my_notebook"
    new_cluster:
      autoscale:
        min_workers: 2
        max_workers: 4
      spark_version: "4.3.x-scala2.11"
      node_type_id: "r4.2xlarge"
      aws_attributes:
        first_on_demand: 0
        availability: ON_DEMAND
        zone_id: "{{ aws_zone }}"
        instance_profile_arn: "{{ aws_instance_profile_arn }}"
        ebs_volume_type: GENERAL_PURPOSE_SSD
        ebs_volume_count: 1
        ebs_volume_size: 100
      custom_tags:
        - key: environment
          value: "production"
      spark_env_vars:
        - key: "ENVIRONMENT"
          value: "production"
      enable_elastic_disk: true
    libraries:
      - jar: "dbfs:/target/path/to/my/jar.jar"
    email_notifications:
      on_start: []
      on_success: []
      on_failure:
        - [email protected]
    max_concurrent_runs: 1

About

Manage Databricks configurations via Ansible

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published