Databricks Certified Data Engineer Associate Hands on projects

This repository aims to learn and showcase some features of Databricks. This is used as hands on preparation for Databricks Data Engineer Associate certification exam.

To import these resources into your Databricks workspace, clone this repository via Databricks Repos.

0. Databricks + Pyspark + Azure

Storing data in the FileStore of Databricks, loading into Workspace notebook and perfroming data science.
Storing Data in Azure Blob and mounting to Databricks. This includes the following steps:

Create Resource Group in Azure.
Create Storage account and assign to Resouce group.
App registration (create a managed itenditiy), which we will use to connect Databricks to storage account. 3.1 Create a client secret and copy.
Create Key vault (assign to same resource group) 4.1. Add the cleint secret here.
Create secret scope within Databricks. 5.1 Use the keyvault DNS (url) and the ResourceID to allow Databricks to access the key valuts secrets within a specific scope.
Use this scope to retreive secrets and connect to storage acount container, where data is stored in Azure:

configs = {"fs.azure.account.auth.type": "OAuth",
       "fs.azure.account.oauth.provider.type": "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
       "fs.azure.account.oauth2.client.id": "<appId>",
       "fs.azure.account.oauth2.client.secret": "<clientSecret>",
       "fs.azure.account.oauth2.client.endpoint": "https://login.microsoftonline.com/<tenant>/oauth2/token",
       "fs.azure.createRemoteFileSystemDuringInitialization": "true"}

Finally we can mount the data:

dbutils.fs.mount(
source = "abfss://<container-name>@<storage-account-name>.dfs.core.windows.net/folder1",
mount_point = "/mnt/flightdata",
extra_configs = configs)

Now we can load the data from the MountPoint into a Dataframe and perform actions.

flightDF = spark.read.format('csv').options(
    header='true', inferschema='true').load("/mnt/flightdata/*.csv")

# read the airline csv file and write the output to parquet format for easy query.
flightDF.write.mode("append").parquet("/mnt/flightdata/parquet/flights")
print("Done")

1. Delta Lake in Lakehouse

Working with Delta Tables and apply some transformations such as ZODRDER or OPTIMIZE.

!

2. ETL with Pyspark in Databricks

3.Incremental Data Processing

Using AutoLoader and COPY to process incremental Data Processing, though steaming.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Databricks Certified Data Engineer Associate Hands on projects

0. Databricks + Pyspark + Azure

1. Delta Lake in Lakehouse

2. ETL with Pyspark in Databricks

3.Incremental Data Processing

Files

README.md

Latest commit

History

README.md

File metadata and controls

Databricks Certified Data Engineer Associate Hands on projects

0. Databricks + Pyspark + Azure

1. Delta Lake in Lakehouse

2. ETL with Pyspark in Databricks

3.Incremental Data Processing