Data Modeling with Postgres

Introduction

Sparkify is a makeshift music streaming service. A data infrastructure needs to be set up to enable analytics and understand user behavior. Initially, the data collected by the service resides in a two directories.

log_data contains JSON logs on user activity
song_data contains JSON metadata on the songs

Goal

Create a Postgres database with tables designed to optimize queries on song play analysis. The steps I've taken to achive are

Create database schema (in Postgres)
Build an ETL pipeline for analysis using Python
Test implement and improve data quality
Compare query results to expected

Datasets

Song Data

Million Song Dataset offers a freely avaiable collection of audio features and metadata for music tracks. Each file is in JSON format and contains metadata about songs and the corresponding artists. The files are partitioned by the first three letters of each song's track ID. For example, here are filepaths to two of the files.

data/song_data/A/B/C/TRABCEI128F424C983.json
data/song_data/A/A/B/TRAABJL12903CDCF1A.json

Here is an example of a file from song_data: TRABCFL128F149BB0D.json

{
    "num_songs": 1,
    "artist_id": "ARLTWXK1187FB5A3F8",
    "artist_latitude": 32.74863,
    "artist_longitude": -97.32925,
    "artist_location": "Fort Worth, TX",
    "artist_name": "King Curtis",
    "song_id": "SODREIN12A58A7F2E5",
    "title": "A Whiter Shade Of Pale (Live @ Fillmore West)",
    "duration": 326.00771,
    "year": 0
}

Log Data

The second dataset consists of log file in JSON format generated by an event simulator based on the songs in the first dataset. The log files are partitioned by year and month. For example, here are filepaths to two of the files.

data/log_data/2018/11/2018-11-12-events.json
data/log_data/2018/11/2018-11-13-events.json

Here is an example of a file from log_data :2018-11-12-events.json

Schema for Song Play Analysis

Using the song and log datasets, we create a star schema optimized for queries on song play analysis. We build the following tables.

Fact Table

songplays - records in log data associated with song plays i.e. records with page NextSong
- songplay_id, start_time, user_id, level, song_id, artist_id, session_id, location, user_agent

Dimension Tables

users - users of the service
- user_id, first_name, last_name, gender, level
songs - songs in music database
- song_id, title, artist_id, year, duration
artists - artists in music database
- artist_id, name, location, latitude, longitude
time - timestamps of records in songplays broken down into specific units
- start_time, hour, day, week, month, year, weekday

File Breakdown

sql_queries.py contains all the SQL queries that are used across the project.
create_tables.py creates (and drops) tables outlined above.
etl.ipynb pocesses a single file from song_data and log_data and loads data into coresponding tables.
etl.py processes all files from song_data and log_data and loads them into tables.
test.ipynb displays the first few rows of each table in the project.

Dependencies

python==3.9
requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Modeling with Postgres

Introduction

Goal

Datasets

Song Data

Log Data

Schema for Song Play Analysis

Fact Table

Dimension Tables

File Breakdown

Dependencies

About

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
data		data
images		images
.gitignore		.gitignore
README.md		README.md
create_tables.py		create_tables.py
etl.ipynb		etl.ipynb
etl.py		etl.py
requirements.txt		requirements.txt
sql_queries.py		sql_queries.py
test.ipynb		test.ipynb

ponugoti/sparkify

Folders and files

Latest commit

History

Repository files navigation

Data Modeling with Postgres

Introduction

Goal

Datasets

Song Data

Log Data

Schema for Song Play Analysis

Fact Table

Dimension Tables

File Breakdown

Dependencies

About

Topics

Resources

Stars

Watchers

Forks

Contributors 2

Languages