Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Lab sql notes #160

Draft
wants to merge 20 commits into
base: main
Choose a base branch
from
Draft
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,2 +1,4 @@
.idea/
.DS_Store
*.pyc
__pycache__/

This file was deleted.

This file was deleted.

Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
/* Here we create the type season_stats and the players table.
Have a look at the player_seasons table already provided */
CREATE TYPE season_stats AS (
season Integer,
gp REAL,
pts REAL,
reb REAL,
ast REAL,
weight INTEGER
);
-- CREATE TYPE scoring_class AS ENUM ('bad', 'average', 'good', 'star');
CREATE TABLE players (
player_name TEXT,
height TEXT,
college TEXT,
country TEXT,
draft_year TEXT,
draft_round TEXT,
draft_number TEXT,
seasons season_stats [],
--scorer_class scoring_class,
--is_active BOOLEAN,
current_season INTEGER,
PRIMARY KEY (player_name, current_season)
);
/*
The player_seasons table has the attributes: player_name,
const_info (height, college, country, draft_year, draft_round, draft_number),
season_dependent_info (like gp, pts, reb, ast, weight, and season).
The players table has:
player_name, const_info, seasons which is an array of struct season_stats, and
current_season. Basically, for each player we have their history from previous
seasons.
There are attributes in player_seasons that we don't care much about like age
and other season-dependent values like netrtg, oreb-pct,...etc
*/
Original file line number Diff line number Diff line change
@@ -0,0 +1,129 @@
/* STEP 0: The skeleton is this outer join query that joins NULL player data from 1995
with today's season data. Both tables don't have duplicate players */
-- WITH yesterday AS (
-- SELECT *
-- FROM players
-- WHERE current_season = 1995
-- ),
-- today AS (
-- SELECT *
-- FROM player_seasons
-- WHERE season = 1996
-- )
-- SELECT *
-- FROM today t
-- FULL OUTER JOIN yesterday y ON t.player_name = y.player_name;
/*------------------------------------------------------------------------------*/
/* STEP 1: populate the players table with data of 1996 */
INSERT into players WITH yesterday AS (
SELECT *
FROM players
WHERE current_season = 1995
),
today AS (
SELECT *
FROM player_seasons
WHERE season = 1996
)
SELECT COALESCE (t.player_name, y.player_name) AS player_name,
COALESCE (t.height, y.height) AS height,
COALESCE (t.college, y.college) AS college,
COALESCE (t.country, y.country) AS country,
COALESCE(t.draft_year, y.draft_year) AS draft_year,
COALESCE(t.draft_round, y.draft_round) AS draft_round,
COALESCE(t.draft_number, y.draft_number) AS draft_number,
CASE
/* create a season_stats column in the final query result using CASE
and y.seasons and t.season values.*/
WHEN y.seasons IS NULL THEN ARRAY [ROW(
t.season,
t.gp,
t.pts,
t.reb,
t.ast,
t.weight
)::season_stats]
WHEN t.season IS NOT NULL THEN y.seasons || ARRAY [ROW(
t.season,
t.gp,
t.pts,
t.reb,
t.ast,
t.weight
)::season_stats]
ELSE y.seasons
END AS season_stats,
COALESCE(t.season, y.current_season + 1) as current_season
FROM today t
FULL OUTER JOIN yesterday y ON t.player_name = y.player_name;
/* ---------------------------------------------------------------------------------*/
/* STEP 2: we repeat the previous code with today data of 1997 */
INSERT into players WITH yesterday AS (
SELECT *
FROM players
WHERE current_season = 2000
),
today AS (
SELECT *
FROM player_seasons
WHERE season = 2001
)
SELECT -- we don't need to repeat the constant info in two columns in case the player played
-- in the previous season and today's season
COALESCE (t.player_name, y.player_name) AS player_name,
COALESCE (t.height, y.height) AS height,
COALESCE (t.college, y.college) AS college,
COALESCE (t.country, y.country) AS country,
COALESCE(t.draft_year, y.draft_year) AS draft_year,
COALESCE(t.draft_round, y.draft_round) AS draft_round,
COALESCE(t.draft_number, y.draft_number) AS draft_number,
CASE
/* create a season_stats column in the final query result using CASE
and y.seasons and t.season values.*/
WHEN y.seasons IS NULL THEN ARRAY [ROW(
t.season,
t.gp,
t.pts,
t.reb,
t.ast,
t.weight
)::season_stats]
WHEN t.season IS NOT NULL THEN y.seasons || ARRAY [ROW(
t.season,
t.gp,
t.pts,
t.reb,
t.ast,
t.weight
)::season_stats]
ELSE y.seasons
END AS season_stats,
COALESCE(t.season, y.current_season + 1) as current_season
FROM today t
FULL OUTER JOIN yesterday y ON t.player_name = y.player_name;
/* Very important note: The previous query ADDS to the players table new rows
generated by the outer join. It doesn't assign to the players table those new
rows. This is why after executing the query, we will have duplicated rows for
players who played in both 1996 and 1997. Example: The player Andrew Lang. This
player has two rows now:
one with seasons value: {"(1996,52,5.3,5.3,0.5,275)"} and current_season 1996
another with seasons value: {"(1996,52,5.3,5.3,0.5,275)", "(1997,57,2.7,2.7,0.3,270)"}
and current_season 1997.
This is why you might get an error complaining about duplicated keys when executing
the previous query. Because there will be repeated (player_name,current_season)
values.
This is also why in the yesterday table you specify the current_season value.
Now it makes sense!
I saved the results of the "inside" query into a table called players_till_1997
This table has 527 rows. Whereas, when we insert those new rows to the players
table, the players table now has 968 rows which makes sense because the players
table which had only data from the season 1996 had 441 rows (441+527=968).
*/
/* STEP3: Execute the previous query with:
yesterday 1997 and today 1998
yesterday 1998 and today 1999
yesterday 1999 and today 2000
yesterday 2000 and today 2001
Now we have the players table with highest current_season value
of 2001*/
/*-------------------------------------------------------------*/
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
SELECT player_name,
UNNEST(seasons) -- CROSS JOIN UNNEST
-- / LATERAL VIEW EXPLODE
FROM players
WHERE current_season = 1998
AND player_name = 'Michael Jordan';
/* This returns two rows: one for 1996 and one for 1997
Recall that when we specify current_season = 1998, in the seasons
array we have info untill 1998 (including 1998) */
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
/* STEP1: */
-- SELECT player_name,
-- UNNEST (seasons)::season_stats as season_info
-- /* ::season_stats is a type cast */
-- FROM players
-- where current_season = 2001
-- and player_name = 'Michael Jordan'
/* ------------------------------------------------------------*/
/* STEP2: */
-- WITH unnested AS (
-- SELECT player_name,
-- UNNEST (seasons)::season_stats as seasons_info
-- /* ::season_stats is a type cast */
-- FROM players
-- where current_season = 2001
-- and player_name = 'Michael Jordan'
-- )
-- SELECT player_name,
-- (seasons_info::season_stats).*
-- from unnested
-- /* ---------------------------------------------------------*/
-- /* STEP3: same query as before but querying for all players instead */
WITH unnested AS (
SELECT player_name,
UNNEST (seasons)::season_stats as seasons_info
/* ::season_stats is a type cast */
FROM players
where current_season = 2001
)
SELECT player_name,
(seasons_info::season_stats).*
from unnested
/* Notice that the players' names are sorted.
This helps us apply the run-length encoding data compression method. */
Loading