Datalake-Spark

Developed a data lake for the analytics team at music streaminng company.After considerable growth in user base and song database it was time to move the data warehouse to a data lake and enhance data processing through Spark.

Built an ETL pipeline, extracted data from S3 buckets, processed it through Spark and transformed into a star schema stored in S3 buckets with parquet formatting and efficient partitioning. The database and ETL pipeline were validated by running queries provided by the analytics team and compared expected results. Skills included:

- Building out an ETL pipeline using Spark, Python, Hadoop Clusters (EMR).
- Fast-tracking the data lake buildout using (serverless) AWS Lambda and cataloging tables with an AWS Glue Crawler
- Setting up IAM Roles, Hadoop Clusters, EMR, Config files and security groups.
- Scaling up the data analysis process through the use of a data lake and Spark, in order to further optimize queries on song play analysis

The primary file in this repo is the etl.py, which will read in files from S3 buckets,
process them using Spark and store them as parquet files in S3 buckets, partitioned appropriately.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
README.md		README.md
etl.py		etl.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Datalake-Spark

About

Releases

Packages

Languages

preeti-taneja/Datalake-Spark

Folders and files

Latest commit

History

Repository files navigation

Datalake-Spark

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages