Skip to content

Latest commit

 

History

History
62 lines (48 loc) · 3.1 KB

File metadata and controls

62 lines (48 loc) · 3.1 KB

Databricks Certified Data Engineer Associate Hands on projects

This repository aims to learn and showcase some features of Databricks. This is used as hands on preparation for Databricks Data Engineer Associate certification exam.

To import these resources into your Databricks workspace, clone this repository via Databricks Repos.

0. Databricks + Pyspark + Azure

  • Storing data in the FileStore of Databricks, loading into Workspace notebook and perfroming data science.
  • Storing Data in Azure Blob and mounting to Databricks. This includes the following steps:
  1. Create Resource Group in Azure.
  2. Create Storage account and assign to Resouce group.
  3. App registration (create a managed itenditiy), which we will use to connect Databricks to storage account. 3.1 Create a client secret and copy.
  4. Create Key vault (assign to same resource group) 4.1. Add the cleint secret here.
  5. Create secret scope within Databricks. 5.1 Use the keyvault DNS (url) and the ResourceID to allow Databricks to access the key valuts secrets within a specific scope.
  6. Use this scope to retreive secrets and connect to storage acount container, where data is stored in Azure:
configs = {"fs.azure.account.auth.type": "OAuth",
       "fs.azure.account.oauth.provider.type": "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
       "fs.azure.account.oauth2.client.id": "<appId>",
       "fs.azure.account.oauth2.client.secret": "<clientSecret>",
       "fs.azure.account.oauth2.client.endpoint": "https://login.microsoftonline.com/<tenant>/oauth2/token",
       "fs.azure.createRemoteFileSystemDuringInitialization": "true"}
  1. Finally we can mount the data:
dbutils.fs.mount(
source = "abfss://<container-name>@<storage-account-name>.dfs.core.windows.net/folder1",
mount_point = "/mnt/flightdata",
extra_configs = configs)
  1. Now we can load the data from the MountPoint into a Dataframe and perform actions.
flightDF = spark.read.format('csv').options(
    header='true', inferschema='true').load("/mnt/flightdata/*.csv")

# read the airline csv file and write the output to parquet format for easy query.
flightDF.write.mode("append").parquet("/mnt/flightdata/parquet/flights")
print("Done")

1. Delta Lake in Lakehouse

Working with Delta Tables and apply some transformations such as ZODRDER or OPTIMIZE.

Screenshot 2023-01-23 at 07 19 38!

Screenshot 2023-01-23 at 07 20 35

2. ETL with Pyspark in Databricks

Screenshot 2023-01-22 at 14 22 36

3.Incremental Data Processing

Using AutoLoader and COPY to process incremental Data Processing, though steaming. Screenshot 2023-01-23 at 07 24 57