Building an ETL out of Spotify API and a weekly summary email. I built a program that extracts data from my Spotify Account with Spotipy. Specifically, I am interested in extracting my listening history. I then transform the data using python to clean it up, create unique identifiers, and load it into a Postgresql database. The loading piece of the ETL is a Postgresql database that is on my local machine. From there I utilized SQL to query the data and python to automate a weekly email that gets sent to my email giving a summary of my Spotify listening for that week. I built the metrics in that email utilizing SQL. The metrics are similar to the seen in Spotify Wrapped.
I extracted data out of the Spotify API using this endpoint to get the 50 most recently played tracks. The result of calling this endpoint is a dictionary which I will then take and create multiple dataframes after cleaning it up a bit first. Spotify has great documentation and great examples on their website for their API. You can my code for this in this file.
I extracted the Spotify data using Python so transforming the data with Pandas was a natural progression. I also performed some basic checks on the data within the code. There was an extremely necessary step was to create a unique record id for each track that was played. I don't want to pull in the same track that was played at the same time in the database as that would be a duplicate but we would want to have duplicate tracks in our database because you can listen to a track more than once. I created a new value for each track that combined Spotify's ID for the track and the time that I played the track. I converted the date-time of the song played to the UNIX timestamp version so creating an ID with it was simpler. You can not play more than one song at a time on a single Spotify account so this was good logic to implement. To prepare the data for loading I implemented two different data structures (List of Dictionaries and Dictionaries of Lists), I go into more detail in the code. I then transform those data structures into a dataframe to do some cleaning around the date-time datatypes, created the unique identifier for each song, and prepare to load into the Postgresql Database. The code for this step is in this file.
I created a database on my computer using Postgresql to store all of the Spotify data that I am ingesting with this ETL process. I created three tables and the sql code for the creation of these tables can be found with this file. To load the data into Postgresql I use the pandas method .to_sql to load the data into a temporary table and then insert into the final database from the temporary table after making sure the unique track did not exists already. The code for this can be found in this file. Here is a picture of the database schema
I utilized SQL to get the data from my Postgresql and create the metrics I needed for my weekly summary email. The metrics in this email are similar to Spotify Wrapped yearly roundup. A lot of the SQL skills needed to do this were creating Views, Joins, Case When, and Subqueries. The SQL utilized to generate these metrics can be found in this file. Additionally, I put these queries into functions to simlify the process of running them each week. I then utilized Python to take the resulting metrics from SQL and format them in a weekly email to myself, this code can be found in this file.
I utilized Apache Airflow to create two DAG's to schedule the data pipeline and the weekly email that goes out. I ended up creating two seperate DAG's to complete this and the DAG python code can be found in the these files ETL and Email.