Skip to content

A data engineering project with Airflow, dbt, Terrafrom, GCP and much more!

License

Notifications You must be signed in to change notification settings

VicenteYago/steam-data-engineering

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Steam Data Engineering

Summary

This project creates a pipeline for processing and analyze the raw data scraped from the videogame digitial distribution platform Steam. The data is procesed in an ELT pipeline with the objective of provide insight about the best selling videogames since 2006.

Dataset

The dataset is originary from Kaggle, a repository of free datasets for data science. The dataset is composed as a mix of directly scraped records from Steam itself and Steamspy, a website dedicated to collect actual and speculative data about the former platform. All the files are enconded as raw jsons and need and extensive work of transformation in order to make then usable. Additionally all the reviews for each game are also retrieved from another Kaggle dataset.

Tools & Technologies

Architecture

To add more realism the datasets are placed in a S3 bucket in AWS. The trasfer from AWS to GCP is done trough Terraform as part of the initialization tasks, a one-time operation.

Since the reviews datasets is ~ 40 GB and ~ 500k files a special processing in Spark is performed. The Airflow server is located in a vitual machine deployed in Compute Engine GCP .

Dashboard

For a more compacted visualization only games with metacritic score are displayed (3.5k out of 51k).

A note about Revenue: The revenue is calculated as a product of owners * price. This is a pretty naive approximation, see Steamspy guidelines about "Owned" and "Sold" equivalence.

Orchestration

The DAG has to main branches :

  • Store branch : selected JSON files are converted into parquet and then loaded as tables in Bigquery in parallel.
  • Reviews branch : JSON files are compacted before the cluster is created, then processed in dataproc as a pyspark job. Finally the results are injected into Bigquery, where they are merged with other tables using dbt. Cluster creation and deletion is fully automated.

Both branches start by retrieving the respective data from the GCS Buckets, and they converge in the dbt task wich builds the definitive tables in BigQuery.

Transformation

Transformations are done with Spark (20%) and dbt (80%).

Spark

The spark job consists in process all 500k JSON files of reviews in a temporal Dataproc cluster and write them in a Bigquery table.

dbt

All the transformation for the steam store data is done in dbt using SQL intensively. The main table steam_games is built following the google recommendations, i.e. is a denormalized table featuring nested an repeated structers to avoid the cost of performing joins.

In order to built the dashboard a custom unnested tables are also built from steam_games.

Tests

In the final stage of the pipeline, inside the dbt task some checks are performed:

  • Checks for uniqueness and non-null for the gameid column in all affected tables.
  • Checking for consistency of release_date column in table steam_games

Improvements

  • partition core tables, for performance
  • Add a cost analysis of a full pipeline run
  • extract more value from steam reviews --> new dashboard ?
  • USE CASE: matrix factorization for recommendations
  • github action for upload spark script to bucket
  • More tests on the pipeline
  • Fully normalize the tables as exercise

Special Mentions

Thanks to Alexey and his community from datatalks club, create this project without the course of Data Engineering would have been much more difficult.