Skip to content

Databricks and Azure projects for experimenting, learning and demonstrating knowledge

Notifications You must be signed in to change notification settings

JANHMS/Databricks-on-Azure-Data-Engineering-project

Repository files navigation

Databricks Certified Data Engineer Associate Hands on projects

This repository aims to learn and showcase some features of Databricks. This is used as hands on preparation for Databricks Data Engineer Associate certification exam.

To import these resources into your Databricks workspace, clone this repository via Databricks Repos.

0. Databricks + Pyspark + Azure

  • Storing data in the FileStore of Databricks, loading into Workspace notebook and perfroming data science.
  • Storing Data in Azure Blob and mounting to Databricks. This includes the following steps:
  1. Create Resource Group in Azure.
  2. Create Storage account and assign to Resouce group.
  3. App registration (create a managed itenditiy), which we will use to connect Databricks to storage account. 3.1 Create a client secret and copy.
  4. Create Key vault (assign to same resource group) 4.1. Add the cleint secret here.
  5. Create secret scope within Databricks. 5.1 Use the keyvault DNS (url) and the ResourceID to allow Databricks to access the key valuts secrets within a specific scope.
  6. Use this scope to retreive secrets and connect to storage acount container, where data is stored in Azure:
configs = {"fs.azure.account.auth.type": "OAuth",
       "fs.azure.account.oauth.provider.type": "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
       "fs.azure.account.oauth2.client.id": "<appId>",
       "fs.azure.account.oauth2.client.secret": "<clientSecret>",
       "fs.azure.account.oauth2.client.endpoint": "https://login.microsoftonline.com/<tenant>/oauth2/token",
       "fs.azure.createRemoteFileSystemDuringInitialization": "true"}
  1. Finally we can mount the data:
dbutils.fs.mount(
source = "abfss://<container-name>@<storage-account-name>.dfs.core.windows.net/folder1",
mount_point = "/mnt/flightdata",
extra_configs = configs)
  1. Now we can load the data from the MountPoint into a Dataframe and perform actions.
flightDF = spark.read.format('csv').options(
    header='true', inferschema='true').load("/mnt/flightdata/*.csv")

# read the airline csv file and write the output to parquet format for easy query.
flightDF.write.mode("append").parquet("/mnt/flightdata/parquet/flights")
print("Done")

1. Delta Lake in Lakehouse

Working with Delta Tables and apply some transformations such as ZODRDER or OPTIMIZE.

Screenshot 2023-01-23 at 07 19 38!

Screenshot 2023-01-23 at 07 20 35

2. ETL with Pyspark in Databricks

Screenshot 2023-01-22 at 14 22 36

3.Incremental Data Processing

Using AutoLoader and COPY to process incremental Data Processing, though steaming. Screenshot 2023-01-23 at 07 24 57

About

Databricks and Azure projects for experimenting, learning and demonstrating knowledge

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages