Solution for Udacity's Data Engineering Nanodegree projects.
-
Data Modeling with Postgres: This project consists of creating fact and dimension tables for a star schema and writing an ETL pipeline that transfers data from files in two local directories to these tables in Postgres using Python and SQL.
-
Data Modeling with Apache Cassandra: To complete the project it was necessary to model part of the ETL pipeline that transfers data from a set of CSV files into a directory to create a simplified CSV file for modeling and inserting data into Apache Cassandra tables.
-
Data Warehouse with Redshift: In this project the task was to load data from S3 into the test tables in Redshift and execute SQL statements that create the analytic tables from these test tables.
-
Data Lake with Apache Spark: In this project, the learnings in Apache Spark and data lakes were applied to build an ETL pipeline for a data lake hosted on S3. To complete the project, data was loaded from S3, processed in analytical tables using Spark and loaded back into S3. This Spark process was deployed in a cluster using AWS.
-
Data Pipeline with Airflow: To complete this project, the main concepts of Apache Airflow were applied, such as creating custom operators to perform the tasks of preparing the data, populating the data warehouse and performing checks on the data.
-
Capstone Project: This project consists of building an ETL pipeline that uses I94 immigration and temperature data to create a database optimized for analyzing immigration events. And the fact table will be used to answer whether the temperature of cities is decisive for the choice of destination by immigrants.