Now that you have learned about CSV and JSON file formats individually, it's time to bring them together with a cumulative lab! Even as a junior data scientist, you can often produce novel, interesting analyses by combining multiple datasets that haven't been combined before.
You will be able to:
- Practice reading serialized JSON and CSV data from files into Python objects
- Practice extracting information from nested data structures
- Practice cleaning data (filtering, normalizing locations, converting types)
- Combine data from multiple sources into a single data structure
- Interpret descriptive statistics and data visualizations to present your findings
Photo by Fauzan Saari on Unsplash
What is the relationship between the population of a country and their performance in the 2018 FIFA World Cup?
Intuitively, we might assume that countries with larger populations would have better performance in international sports competitions. While this has been demonstrated to be true for the Olympics, the results for the FIFA World Cup are more mixed:
In this analysis, we are going to look specifically at the sample of World Cup games in 2018 and the corresponding 2018 populations of the participating nations, to determine the relationship between population and World Cup performance for this year.
The data sources for this analysis will be pulled from two separate files.
- Source: This dataset comes from
football.db
, a "free and open public domain football database & schema for use in any (programming) language" - Contents: Data about all games in the 2018 World Cup, including date, location (city and stadium), teams, goals scored (and by whom), and tournament group
- Format: Nested JSON data (dictionary containing a list of rounds, each of which contains a list of matches, each of which contains information about the teams involved and the points scored)
- Source: This dataset comes from a curated collection by DataHub.io, originally sourced from the World Bank
- Contents: Data about populations by country for all available years from 1960 to 2018
- Format: CSV data, where each row contains a country name, a year, and a population
Create an alphabetically-sorted list of teams who competed in the 2018 FIFA World Cup.
Create a data structure that connects a team name (country name) to its performance in the 2018 FIFA World Cup. We'll use the count of games won in the entire tournament (group stage as well as knockout stage) to represent the performance.
This will help create visualizations to help the reader understand the distribution of games won and the performance of each team.
Add to the existing data structure so that it also connects each country name to its 2018 population, and create visualizations comparable to those from step 2.
Choose an appropriate statistical measure to analyze the relationship between population and performance, and create a visualization representing this relationship.
Before moving on to the next step, pause and think about the strategy for this analysis.
Remember, our business question is:
What is the relationship between the population of a country and their performance in the 2018 FIFA World Cup?
First, what is our unit of analysis, and what is the unique identifier? In other words, what will one record in our final data structure represent, and what attribute uniquely describes it?
.
.
.
Answer:
What is the relationship between the population of a country and their performance in the 2018 FIFA World Cup?
Our unit of analysis is a country and the unique identifier we'll use is the country name
Next, what features are we analyzing? In other words, what attributes of each country are we interested in?
.
.
.
Answer:
What is the relationship between the population of a country and their performance in the 2018 FIFA World Cup?
Our features are 2018 population and count of wins in the 2018 World Cup
Finally, which dataset should we start with? In this case, any record with missing data is not useful to us, so we want to start with the smaller dataset.
.
.
.
Answer: There are only 32 countries that compete in the World Cup each year, compared to hundreds of countries in the world, so we should start with the 2018 World Cup dataset. Then we can join it with the relevant records from the country population dataset.
Below we import the json
and csv
modules, which will be used for reading from world_cup_2018.json
and country_populations.csv
, respectively.
# Run this cell without changes
import json
import csv
Next, we open the relevant files.
# Run this cell without changes
world_cup_file = open("data/world_cup_2018.json", encoding='utf8')
population_file = open("data/country_populations.csv")
Hint: if your code below is not working, (e.g. ValueError: I/O operation on closed file.
, or you get an empty list or dictionary) try re-running the cell above to reopen the files, then re-run your code.
In the cell below, use the json
module to load the data from world_cup_file
into a dictionary called world_cup_data
# Replace None with appropriate code
world_cup_data = None
# Close the file now that we're done reading from it
world_cup_file.close()
Make sure the assert
passes, ensuring that world_cup_data
has the correct type.
# Run this cell without changes
# Check that the overall data structure is a dictionary
assert type(world_cup_data) == dict
# Check that the dictionary has 2 keys, 'name' and 'rounds'
assert list(world_cup_data.keys()) == ['name', 'rounds']
Now use the csv
module to load the data from population_file
into a list of dictionaries called population_data
(Recall that you can convert a csv.DictReader
object into a list of dictionaries using the built-in list()
function.)
# Replace None with appropriate code
population_data = None
# Close the file now that we're done reading from it
population_file.close()
Make sure the assert
s pass, ensuring that population_data
has the correct type.
# Run this cell without changes
# Check that the overall data structure is a list
assert type(population_data) == list
# Check that the 0th element is a dictionary
# (csv.DictReader interface differs slightly by Python version;
# either a dict or an OrderedDict is fine here)
from collections import OrderedDict
assert type(population_data[0]) == dict or type(population_data[0]) == OrderedDict
Create an alphabetically-sorted list of teams who competed in the 2018 FIFA World Cup.
This will take several steps, some of which have been completed for you.
Let's start by exploring the structure of world_cup_data
. Here is a pretty-printed preview of its contents:
{
"name": "World Cup 2018",
"rounds": [
{
"name": "Matchday 1",
"matches": [
{
"num": 1,
"date": "2018-06-14",
"time": "18:00",
"team1": { "name": "Russia", "code": "RUS" },
"team2": { "name": "Saudi Arabia", "code": "KSA" },
"score1": 5,
"score2": 0,
"score1i": 2,
"score2i": 0,
"goals1": [
{ "name": "Gazinsky", "minute": 12, "score1": 1, "score2": 0 },
{ "name": "Cheryshev", "minute": 43, "score1": 2, "score2": 0 },
{ "name": "Dzyuba", "minute": 71, "score1": 3, "score2": 0 },
{ "name": "Cheryshev", "minute": 90, "offset": 1, "score1": 4, "score2": 0 },
{ "name": "Golovin", "minute": 90, "offset": 4, "score1": 5, "score2": 0 }
],
"goals2": [],
"group": "Group A",
"stadium": { "key": "luzhniki", "name": "Luzhniki Stadium" },
"city": "Moscow",
"timezone": "UTC+3"
}
]
},
{
"name": "Matchday 2",
"matches": [
{
"num": 2,
"date": "2018-06-15",
"time": "17:00",
"team1": { "name": "Egypt", "code": "EGY" },
"team2": { "name": "Uruguay", "code": "URU" },
"score1": 0,
"score2": 1,
"score1i": 0,
"score2i": 0,
"goals1": [],
"goals2": [
{ "name": "Giménez", "minute": 89, "score1": 0, "score2": 1 }
],
"group": "Group A",
"stadium": { "key": "ekaterinburg", "name": "Ekaterinburg Arena" },
"city": "Ekaterinburg",
"timezone": "UTC+5"
},
...
],
},
],
}
As noted previously, world_cup_data
is a dictionary with two keys, 'name' and 'rounds'.
# Run this cell without changes
world_cup_data.keys()
The value associated with the 'name' key is simply identifying the dataset.
# Run this cell without changes
world_cup_data["name"]
The value associated with the 'rounds' key is a list containing all of the actual information about the rounds and the matches within those rounds.
# Run this cell without changes
rounds = world_cup_data["rounds"]
print("type(rounds):", type(rounds))
print("len(rounds):", len(rounds))
print("type(rounds[3])", type(rounds[3]))
print("rounds[3]:")
rounds[3]
Translating this output into English:
Starting with the original world_cup_data
dictionary, we used the key "rounds"
to extract a list of rounds, which we assigned to the variable rounds
.
rounds
is a list of dictionaries. Each dictionary inside of rounds
contains a name (e.g. "Matchday 4"
) as well as a list of matches.
Now we can go one level deeper and extract all of the matches in the tournament. Because the round is irrelevant for this analysis, we can loop over all rounds and combine all of their matches into a single list.
Hint: This is a good use case for using the .extend
list method rather than .append
, since we want to combine several lists of dictionaries into a single list of dictionaries, not a list of lists of dictionaries. Documentation here.
# Replace None with appropriate code
matches = []
# "round" is a built-in function in Python so we use "round_" instead
for round_ in rounds:
# Extract the list of matches for this round
round_matches = None
# Add them to the overall list of matches
None
matches[0]
Make sure the assert
s pass before moving on to the next step.
# Run this cell without changes
# There should be 64 matches. If the length is 20, that means
# you have a list of lists instead of a list of dictionaries
assert len(matches) == 64
# Each match in the list should be a dictionary
assert type(matches[0]) == dict
Each match has a team1
and a team2
.
# Run this cell without changes
print(matches[0]["team1"])
print(matches[0]["team2"])
Create a list of all unique team names by looping over every match in matches
and adding the "name"
values associated with both team1
and team2
. (Same as before when creating a list of matches, it doesn't matter right now whether a given team was "team1" or "team2", we just add everything to teams
.)
We'll use a set
data type (documentation here) to ensure unique teams, then convert it to a sorted list at the end.
# Replace None with appropriate code
teams_set = set()
for match in matches:
# Add team1 name value to teams_set
None
# Add team2 name value to teams_set
None
teams = sorted(list(teams_set))
print(teams)
Make sure the assert
s pass before moving on to the next step.
# Run this cell without changes
# teams should be a list, not a set
assert type(teams) == list
# 32 teams competed in the 2018 World Cup
assert len(teams) == 32
# Each element of teams should be a string
# (the name), not a dictionary
assert type(teams[0]) == str
Great, step 1 complete! We have unique identifiers (names) for each of our records (countries) that we will be able to use to connect 2018 World Cup performance to 2018 population.
Create a data structure that connects a team name (country name) to its performance in the 2018 FIFA World Cup. We'll use the count of games won in the entire tournament (group stage as well as knockout stage) to represent the performance.
Also, create visualizations to help the reader understand the distribution of games won and the performance of each team.
So, we are building a data structure that connects a country name to the number of wins. There is no universal correct format for a data structure with this purpose, but we are going to use a format that resembles the "dataframe" format that will be introduced later in the course.
Specifically, we'll build a dictionary where each key is the name of a country, and each value is a nested dictionary containing information about the number of wins and the 2018 population.
The final result will look something like this:
{
'Argentina': { 'wins': 1, 'population': 44494502 },
...
'Uruguay': { 'wins': 4, 'population': 3449299 }
}
For the current step (step 2), we'll build a data structure that looks something like this:
{
'Argentina': { 'wins': 1 },
...
'Uruguay': { 'wins': 4 }
}
Start by initializing a dictionary called combined_data
containing:
- Keys: the strings from
teams
- Values: each value the same, a dictionary containing the key
'wins'
with the associated value0
. However, note that each value should be a distinct dictionary object in memory, not the same dictionary linked as a value in multiple places.
Initially combined_data
will look something like this:
{
'Argentina': { 'wins': 0 },
...
'Uruguay': { 'wins': 0 }
}
# Replace None with appropriate code
# Create the variable combined_data as described above
None
Check that the assert
s pass.
# Run this cell without changes
# combined_data should be a dictionary
assert type(combined_data) == dict
# the keys should be strings
assert type(list(combined_data.keys())[0]) == str
# the values should be dictionaries
assert combined_data["Japan"] == {"wins": 0}
Now it's time to revisit the matches
list from earlier, in order to associate a team with the number of times it has won a match.
This time, let's write some functions to help organize our logic.
Write a function find_winner
that takes in a match
dictionary, and returns the name of the team that won the match. Recall that a match is structured like this:
{
'num': 1,
'date': '2018-06-14',
'time': '18:00',
'team1': { 'name': 'Russia', 'code': 'RUS' },
'team2': { 'name': 'Saudi Arabia', 'code': 'KSA' },
'score1': 5,
'score2': 0,
'score1i': 2,
'score2i': 0,
'goals1': [
{ 'name': 'Gazinsky', 'minute': 12, 'score1': 1, 'score2': 0 },
{ 'name': 'Cheryshev', 'minute': 43, 'score1': 2, 'score2': 0 },
{ 'name': 'Dzyuba', 'minute': 71, 'score1': 3, 'score2': 0 },
{ 'name': 'Cheryshev', 'minute': 90, 'offset': 1, 'score1': 4, 'score2': 0 },
{ 'name': 'Golovin', 'minute': 90, 'offset': 4, 'score1': 5, 'score2': 0 }
],
'goals2': [],
'group': 'Group A',
'stadium': { 'key': 'luzhniki', 'name': 'Luzhniki Stadium' },
'city': 'Moscow',
'timezone': 'UTC+3'
}
The winner is determined by comparing the values associated with the 'score1'
and 'score2'
keys. If score 1 is larger, then the name associated with the 'team1'
key is the winner. If score 2 is larger, then the name associated with the 'team2'
key is the winner. If the values are the same, there is no winner, so return None
. (Unlike the group round of the World Cup, we are only counting wins as our "performance" construct, not 3 points for a win and 1 point for a tie.)
# Replace None with appropriate code
def find_winner(match):
"""
Given a dictionary containing information about a match,
return the name of the winner (or None in the case of a tie)
"""
None
# Run this cell without changes
assert find_winner(matches[0]) == "Russia"
assert find_winner(matches[1]) == "Uruguay"
assert find_winner(matches[2]) == None
Now that we have this helper function, loop over every match in matches
, find the winner, and add 1 to the associated count of wins in combined_data
. If the winner is None
, skip adding it to the dictionary.
# Replace None with appropriate code
for match in matches:
# Get the name of the winner
winner = None
# Only proceed to the next step if there was
# a winner
if winner:
# Add 1 to the associated count of wins
None
# Visually inspect the output to ensure the wins are
# different for different countries
combined_data
While we could try to understand all 32 of those numbers just by scanning through them, let's use some descriptive statistics and data visualizations instead!
The code below calculates the mean, median, and standard deviation of the number of wins. If it doesn't work, that is an indication that something went wrong with the creation of the combined_data
variable, and you might want to look at the solution branch and fix your code before proceeding.
# Run this cell without changes
import numpy as np
wins = [val["wins"] for val in combined_data.values()]
print("Mean number of wins:", np.mean(wins))
print("Median number of wins:", np.median(wins))
print("Standard deviation of number of wins:", np.std(wins))
In addition to those numbers, let's make a histogram (showing the distributions of the number of wins) and a bar graph (showing the number of wins by country).
# Run this cell without changes
import matplotlib.pyplot as plt
# Set up figure and axes
fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(12, 7))
fig.set_tight_layout(True)
# Histogram of Wins and Frequencies
ax1.hist(x=wins, bins=range(8), align="left", color="green")
ax1.set_xticks(range(7))
ax1.set_xlabel("Wins in 2018 World Cup")
ax1.set_ylabel("Frequency")
ax1.set_title("Distribution of Wins")
# Horizontal Bar Graph of Wins by Country
ax2.barh(teams[::-1], wins[::-1], color="green")
ax2.set_xlabel("Wins in 2018 World Cup")
ax2.set_title("Wins by Country");
Before we move to looking at the relationship between wins and population, it's useful to understand the distribution of wins alone. A few notes of interpretation:
- The number of wins is skewed and looks like a negative binomial distribution, which makes sense conceptually
- The "typical" value here is 1 (both the median and the highest point of the histogram), meaning a typical team that qualifies for the World Cup wins once
- There are a few teams we might consider outliers: Belgium and France, with 6x the wins of the "typical" team and 1.5x the wins of the next "runner-up" (Uruguay, with 4 wins)
- This is a fairly small dataset, something that becomes more noticeable with such a "spiky" (not smooth) histogram
Add to the existing data structure so that it also connects each country name to its 2018 population, and create visualizations comparable to those from step 2.
Now we're ready to add the 2018 population to combined_data
, finally using the CSV file!
Recall that combined_data
currently looks something like this:
{
'Argentina': { 'wins': 1 },
...
'Uruguay': { 'wins': 4 }
}
And the goal is for it to look something like this:
{
'Argentina': { 'wins': 1, 'population': 44494502 },
...
'Uruguay': { 'wins': 4, 'population': 3449299 }
}
To do that, we need to extract the 2018 population information from the CSV data.
Recall that previously we loaded information from a CSV containing population data into a list of dictionaries called population_data
.
# Run this cell without changes
len(population_data)
12,695 is a very large number of rows to print out, so let's look at some samples instead.
# Run this cell without changes
np.random.seed(42)
population_record_samples = np.random.choice(population_data, size=10)
population_record_samples
There are 2 filtering tasks, 1 data normalization task, and 1 type conversion task to be completed, based on what we can see in this sample. We'll walk through each of them below.
(In a more realistic data cleaning environment, you most likely won't happen to get a sample that demonstrates all of the data cleaning steps needed, but this sample was chosen carefully for example purposes.)
We already should have suspected that this dataset would require some filtering, since there are 32 records in our current combined_data
dataset and 12,695 records in population_data
. Now that we have looked at this sample, we can identify 2 features we'll want to use in order to filter down the population_data
records to just 32. Try to identify them before looking at the answer below.
.
.
.
Answer: the two features to filter on are 'Country Name'
and 'Year'
. We can see from the sample above that there are countries in population_data
that are not present in combined_data
(e.g. Malta) and there are years present that are not 2018.
In the cell below, create a new variable population_data_filtered
that only includes relevant records from population_data
. Relevant records are records where the country name is one of the countries in the teams
list, and the year is "2018".
(It's okay to leave 2018 as a string since we are not performing any math operations on it, just make sure you check for "2018"
and not 2018
.)
# Replace None with appropriate code
population_data_filtered = []
for record in population_data:
# Add record to population_data_filtered if relevant
None
len(population_data_filtered) # 27
Hmm...what went wrong? Why do we only have 27 records, and not 32?
Did we really get a dataset with 12k records that's missing 5 of the data points we need?
Let's take a closer look at the population data samples again, specifically the third one:
# Run this cell without changes
population_record_samples[2]
And compare that with the value for Iran in teams
:
# Run this cell without changes
teams[13]
Ohhhh...we have a data normalization issue! One dataset refers to this country as 'Iran, Islamic Rep.'
, while the other refers to it as 'Iran'
. This is a common issue we face when using data about countries and regions, where there is no universally-accepted naming convention.
Sometimes data normalization can be a very, very time-consuming task where you need to find "crosswalk" data that can link the two formats together, or you need to write advanced regex formulas to line everything up.
For this task, there are only 5 missing, so we'll just go ahead and give you a function that makes the appropriate substitutions.
# Run this cell without changes
def normalize_location(country_name):
"""
Given a country name, return the name that the
country uses when playing in the FIFA World Cup
"""
name_sub_dict = {
"Russian Federation": "Russia",
"Egypt, Arab Rep.": "Egypt",
"Iran, Islamic Rep.": "Iran",
"Korea, Rep.": "South Korea",
"United Kingdom": "England"
}
# The .get method returns the corresponding value from
# the dict if present, otherwise returns country_name
return name_sub_dict.get(country_name, country_name)
# Example where normalized location is different
print(normalize_location("Russian Federation"))
# Example where normalized location is the same
print(normalize_location("Argentina"))
Now, write new code to create population_data_filtered
with normalized country names.
# Replace None with appropriate code
population_data_filtered = []
for record in population_data:
# Get normalized country name
None
# Add record to population_data_filtered if relevant
if None:
# Replace the country name in the record
None
# Append to list
None
len(population_data_filtered) # 32
Great, now we should have 32 records instead of 27!
We need to do one more thing before we'll have population data that is usable for analysis. Take a look at this record from population_data_filtered
to see if you can spot it:
# Run this cell without changes
population_data_filtered[0]
Every key has the same data type (str
), including the population value. In this example, it's '44494502'
, when it needs to be 44494502
if we want to be able to compute statistics with it.
In the cell below, loop over population_data_filtered
and convert the data type of the value associated with the "Value"
key from a string to an integer, using the built-in int()
function.
# Replace None with appropriate code
for record in population_data_filtered:
# Convert the population value from str to int
None
# Look at the last record to make sure the population
# value is an int
population_data_filtered[-1]
Check that it worked with the assert statement below:
# Run this cell without changes
assert type(population_data_filtered[-1]["Value"]) == int
Now it's time to add the population data to combined_data
! Recall that the data structure currently looks like this:
# Run this cell without changes
combined_data
The goal is for it to be structured like this:
{
'Argentina': { 'wins': 1, 'population': 44494502 },
...
'Uruguay': { 'wins': 4, 'population': 3449299 }
}
In the cell below, loop over population_data_filtered
and add information about population to each country in combined_data
:
# Replace None with appropriate code
for record in population_data_filtered:
# Extract the country name from the record
country = None
# Extract the population value from the record
population = None
# Add this information to combined_data
None
# Look combined_data
combined_data
Check that the types are correct with these assert statements:
# Run this cell without changes
assert type(combined_data["Uruguay"]) == dict
assert type(combined_data["Uruguay"]["population"]) == int
Let's perform the same analysis for population that we performed for count of wins.
# Run this cell without changes
populations = [val["population"] for val in combined_data.values()]
print("Mean population:", np.mean(populations))
print("Median population:", np.median(populations))
print("Standard deviation of population:", np.std(populations))
# Run this cell without changes
# Set up figure and axes
fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(12, 7))
fig.set_tight_layout(True)
# Histogram of Populations and Frequencies
ax1.hist(x=populations, color="blue")
ax1.set_xlabel("2018 Population")
ax1.set_ylabel("Frequency")
ax1.set_title("Distribution of Population")
# Horizontal Bar Graph of Population by Country
ax2.barh(teams[::-1], populations[::-1], color="blue")
ax2.set_xlabel("2018 Population")
ax2.set_title("Population by Country");
- Similar to the distribution of the number of wins, the distribution of population is skewed.
- It's hard to choose a single "typical" value here because there is so much variation.
- The countries with the largest populations (Brazil, Nigeria, and Russia) do not overlap with the countries with the most wins (Belgium, France, and Uruguay)
Choose an appropriate statistical measure to analyze the relationship between population and performance, and create a visualization representing this relationship.
So far we have learned about only two statistics for understanding the relationship between variables: covariance and correlation. We will use correlation here, because that provides a more standardized, interpretable metric.
# Run this cell without changes
np.corrcoef(wins, populations)[0][1]
In the cell below, interpret this number. What direction is this correlation? Is it strong or weak?
# Replace None with appropriate code
"""
None
"""
A scatter plot is he most sensible form of data visualization for showing this relationship, because we have two dimensions of data, but there is no "increasing" variable (e.g. time) that would indicate we should use a line graph.
# Run this cell without changes
# Set up figure
fig, ax = plt.subplots(figsize=(8, 5))
# Basic scatter plot
ax.scatter(
x=populations,
y=wins,
color="gray", alpha=0.5, s=100
)
ax.set_xlabel("2018 Population")
ax.set_ylabel("2018 World Cup Wins")
ax.set_title("Population vs. World Cup Wins")
# Add annotations for specific points of interest
highlighted_points = {
"Belgium": 2, # Numbers are the index of that
"Brazil": 3, # country in populations & wins
"France": 10,
"Nigeria": 17
}
for country, index in highlighted_points.items():
# Get x and y position of data point
x = populations[index]
y = wins[index]
# Move each point slightly down and to the left
# (numbers were chosen by manually tweaking)
xtext = x - (1.25e6 * len(country))
ytext = y - 0.5
# Annotate with relevant arguments
ax.annotate(
text=country,
xy=(x, y),
xytext=(xtext, ytext)
)
Interpret this plot in the cell below. Does this align with the findings from the statistical measure (correlation), as well as the map shown at the beginning of this lab (showing the best results by country)?
# Replace None with appropriate text
"""
None
"""
What is the relationship between the population of a country and their performance in the 2018 FIFA World Cup?
Overall, we found a very weakly positive relationship between the population of a country and their performance in the 2018 FIFA World Cup, as demonstrated by both the correlation between populations and wins, and the scatter plot.
In the cell below, write down your thoughts on these questions:
- What are your thoughts on why you may see this result?
- What would you research next?
# Replace None with appropriate text
"""
None
"""
Congratulations! That was a long lab, pulling together a lot of material. You read data into Python, extracted the relevant information, cleaned the data, and combined the data into a new format to be used in analysis. While we will continue to introduce new tools and techniques, these essential steps will be present for the rest of your data science projects from here on out!