Databricks Certified Data Engineer Associate Hands on projects

This repository aims to learn and showcase some features of Databricks. This is used as hands on preparation for Databricks Data Engineer Associate certification exam.

To import these resources into your Databricks workspace, clone this repository via Databricks Repos.

0. Databricks + Pyspark + Azure

Storing data in the FileStore of Databricks, loading into Workspace notebook and perfroming data science.
Storing Data in Azure Blob and mounting to Databricks. This includes the following steps:

Create Resource Group in Azure.
Create Storage account and assign to Resouce group.
App registration (create a managed itenditiy), which we will use to connect Databricks to storage account. 3.1 Create a client secret and copy.
Create Key vault (assign to same resource group) 4.1. Add the cleint secret here.
Create secret scope within Databricks. 5.1 Use the keyvault DNS (url) and the ResourceID to allow Databricks to access the key valuts secrets within a specific scope.
Use this scope to retreive secrets and connect to storage acount container, where data is stored in Azure:

configs = {"fs.azure.account.auth.type": "OAuth",
       "fs.azure.account.oauth.provider.type": "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
       "fs.azure.account.oauth2.client.id": "<appId>",
       "fs.azure.account.oauth2.client.secret": "<clientSecret>",
       "fs.azure.account.oauth2.client.endpoint": "https://login.microsoftonline.com/<tenant>/oauth2/token",
       "fs.azure.createRemoteFileSystemDuringInitialization": "true"}

Finally we can mount the data:

dbutils.fs.mount(
source = "abfss://<container-name>@<storage-account-name>.dfs.core.windows.net/folder1",
mount_point = "/mnt/flightdata",
extra_configs = configs)

Now we can load the data from the MountPoint into a Dataframe and perform actions.

flightDF = spark.read.format('csv').options(
    header='true', inferschema='true').load("/mnt/flightdata/*.csv")

# read the airline csv file and write the output to parquet format for easy query.
flightDF.write.mode("append").parquet("/mnt/flightdata/parquet/flights")
print("Done")

1. Delta Lake in Lakehouse

Working with Delta Tables and apply some transformations such as ZODRDER or OPTIMIZE.

!

2. ETL with Pyspark in Databricks

3.Incremental Data Processing

Using AutoLoader and COPY to process incremental Data Processing, though steaming.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
0 - Databricks Notebooks		0 - Databricks Notebooks
1- Databricks Lakehouse Platform		1- Databricks Lakehouse Platform
2- ELT with Spark SQL and Python		2- ELT with Spark SQL and Python
3- Incremental Data Processing		3- Incremental Data Processing
4- Production Pipelines		4- Production Pipelines
Includes		Includes
dev		dev
.DS_Store		.DS_Store
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Databricks Certified Data Engineer Associate Hands on projects

0. Databricks + Pyspark + Azure

1. Delta Lake in Lakehouse

2. ETL with Pyspark in Databricks

3.Incremental Data Processing

About

Releases

Packages

Languages

JANHMS/Databricks-on-Azure-Data-Engineering-project

Folders and files

Latest commit

History

Repository files navigation

Databricks Certified Data Engineer Associate Hands on projects

0. Databricks + Pyspark + Azure

1. Delta Lake in Lakehouse

2. ETL with Pyspark in Databricks

3.Incremental Data Processing

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages