Sparkify is a makeshift music streaming service. A data infrastructure needs to be set up to enable analytics and understand user behavior. Initially, the data collected by the service resides in a two directories.
log_data
contains JSON logs on user activitysong_data
contains JSON metadata on the songs
Create a Postgres database with tables designed to optimize queries on song play analysis. The steps I've taken to achive are
- Create database schema (in Postgres)
- Build an ETL pipeline for analysis using Python
- Test implement and improve data quality
- Compare query results to expected
Million Song Dataset offers a freely avaiable collection of audio features and metadata for music tracks. Each file is in JSON format and contains metadata about songs and the corresponding artists. The files are partitioned by the first three letters of each song's track ID. For example, here are filepaths to two of the files.
data/song_data/A/B/C/TRABCEI128F424C983.json
data/song_data/A/A/B/TRAABJL12903CDCF1A.json
Here is an example of a file from song_data
: TRABCFL128F149BB0D.json
{
"num_songs": 1,
"artist_id": "ARLTWXK1187FB5A3F8",
"artist_latitude": 32.74863,
"artist_longitude": -97.32925,
"artist_location": "Fort Worth, TX",
"artist_name": "King Curtis",
"song_id": "SODREIN12A58A7F2E5",
"title": "A Whiter Shade Of Pale (Live @ Fillmore West)",
"duration": 326.00771,
"year": 0
}
The second dataset consists of log file in JSON format generated by an event simulator based on the songs in the first dataset. The log files are partitioned by year and month. For example, here are filepaths to two of the files.
data/log_data/2018/11/2018-11-12-events.json
data/log_data/2018/11/2018-11-13-events.json
Here is an example of a file from log_data
:2018-11-12-events.json
Using the song and log datasets, we create a star schema optimized for queries on song play analysis. We build the following tables.
- songplays - records in log data associated with song plays i.e. records with page
NextSong
- songplay_id, start_time, user_id, level, song_id, artist_id, session_id, location, user_agent
- users - users of the service
- user_id, first_name, last_name, gender, level
- songs - songs in music database
- song_id, title, artist_id, year, duration
- artists - artists in music database
- artist_id, name, location, latitude, longitude
- time - timestamps of records in
songplays
broken down into specific units- start_time, hour, day, week, month, year, weekday
sql_queries.py
contains all the SQL queries that are used across the project.create_tables.py
creates (and drops) tables outlined above.etl.ipynb
pocesses a single file fromsong_data
andlog_data
and loads data into coresponding tables.etl.py
processes all files fromsong_data
andlog_data
and loads them into tables.test.ipynb
displays the first few rows of each table in the project.
python==3.9
requirements.txt