The idea of this project is to get files from an S3 bucket, transform it in some way and make it available in another bucket to be analysed using Athena or stored in a Data Warehouse (Redshift) using the COPY command.
I found this dataset in Kaggle. You can find the entire documentation following the previous link. But, in short, it contains data about trips in NYC in Uber, taxis, etc.
In order to use Terraform firsly I went to my account, by console, and created an S3 bucket to save the tfstate
. This is necessary to keep sensitive data regarding the infrastructure outside the repository. I wrote all the backend configuration in a hcl
file entitled backend.hcl
.
Then I ran terraform init -backend=true -backend-cofig=backend.hcl
and all the buckets were set up.
Then I created the S3 buckets that sliced data was supposed to be saved and where I wanted to save the ipynb
file to be ran in Glue.
resource "aws_s3_bucket" "source-bucket" {
bucket = "903442739132-source-bucket-trips-data/year=2021/"
}
resource "aws_s3_bucket" "type-of-license-bucket" {
bucket = "903442739132-type-of-license-bucket/"
}
resource "aws_s3_bucket" "trips-not-shared-bucket" {
bucket = "903442739132-trips-not-shared-bucket/"
}
resource "aws_s3_bucket" "trips-more-than-10-dollars-bucket" {
bucket = "903442739132-trips-more-than-10-dollars-bucket/"
}
resource "aws_s3_bucket" "notebooks-source" {
bucket = "903442739132-notebooks-source/"
}
After setting the buckets up uploaded the code into 903442739132-notebooks-source
and lastly I created the Glue Job, in the glue.tf
file.
resource "aws_glue_job" "trips-etl" {
name = "trips-etl"
role_arn = "glue2"
command {
script_location = "s3://903442739132-notebooks-source/prod.ipynb"
}
default_arguments = {
"--job-language" = "python"
}
}
To send the data that are saved in Kaggle into S3 as fast as possible I created a t2.micro
EC2 instance and downloaded the parquet files using Kaggle's API. After downloading everything I uploaded them using AWS CLI, which was pretty fast. I did everything in the console because it was a transitory process.
Firstly I created a code in plain PySpark to better understand the data itself and just after I changed to a more Glue Job bound code, using dynamic dataframe. During development I used the Glue Jobs Docker image.
The notebook can be found in this link. The idea was to group all files in one dataframe (to get data from the entire year) and then slice it accordingly to some rule.