UQ Library 2024-09-17
- Essential shortcuts
- Open RStudio
- Disclaimer
- What are we going to learn?
- Setting up
- Import data
- Explore data visually
- Aesthetics available
- Modifying scales
- Zooming in
- Interactive plots
- Histograms
- Faceting
- Theming
- A more refined facetted example
- Boxplots
- Summarise data and plot
- Plot from multiple summarised dataframes
- Close project
- Useful links
- function or dataset help: press F1 with your cursor anywhere in a function name.
- execute from script: Ctrl + Enter
- assignment operator (
<-
): Alt + -
On library computers:
- Log in with your UQ username and password (use your student credentials if you are both staff and student)
- Make sure you have a working internet connection
- Go to search the magnifying glass (bottom left)
- Open the ZENworks application
- Look for the letter R
- Double click on RStudio which will install both R and RStudio
If you are using your own laptop:
- Make sure you have a working internet connection
- Open RStudio
We will assume you are an R intermediate user and that you have used ggplot2 before.
During this hands-on session, you will:
- install a tool for picking colours
- customise scales and ranges
- divide a visualisation into facets
- explore new geometries
- modify statistical transformations
- adjust a geometry’s position
- further modify themes
- make a plot interactive
Install ggplot2 if you don’t already have it, with:
install.packages("ggplot2")
Create a new project to keep everything nicely contained in one directory:
- Click the “Create a project” button (top left cube icon)
- Click “New Directory”
- Click “New Project” (“Empty project” if you have an older version of RStudio)
- In “Directory name”, type the name of your project, e.g. “ggplot2_intermediate”
- Select the folder where to locate your project:
e.g.
Documents/RProjects
, which you can create if it doesn’t exist yet. You can use your H drive at UQ to make sure you can find it again. - Click the “Create Project” button
Let’s also create a “plots” folder to store exports:
dir.create("plots")
Create a new script (File > New File > R Script) and add a few comments to give context:
# Description : ggplot2 intermediate with gapminder data
# Author: <your name>
# Date: <today's date>
Finally, make sure you load ggplot2 so we can use its functions:
library(ggplot2)
Our data is located at https://raw.githubusercontent.com/resbaz/r-novice-gapminder-files/master/data/gapminder-FiveYearData.csv
Using the following syntax, how can you read the online CSV data into an R object?
gapminder <- ...
You have to use the read.csv()
function, which can take a URL:
gapminder <- read.csv(
file = "https://raw.githubusercontent.com/resbaz/r-novice-gapminder-files/master/data/gapminder-FiveYearData.csv")
If you are not familiar with the dataset, View()
and summary()
can
help you explore it.
View(gapminder) # view as a separate spreadsheet
summary(gapminder) # summary statistics for each variable
The Environment pane gives you an overview of the variables.
Let’s start with a question: how do Gross Domestic Product (GDP) and life expectancy relate?
We can make a simple plot with the basics of ggplot2:
ggplot(data = gapminder,
mapping = aes(x = gdpPercap,
y = lifeExp)) +
geom_point()
Remember that the 3 main elements of a ggplot2 visualisation are:
- the data
- the mapping of aesthetics to variables
- the geometry
So far we have been using the x
and y
aesthetics. There are more
available, depending on the geometry that you are using.
Here are some common examples:
- To change the shape based on a variable, use
shape = <discrete variable>
inside theaes()
call. - If you want to change the size of the geometric object, you can use
the
size = <continuous variable>
argument. - Similarly, to change the colour based on a variable, use
colour = <variable>
andfill = <variable>
inside theaes()
call.
Let’s modify our plot to colour the points according to the continent variable.
ggplot(data = gapminder,
mapping = aes(x = gdpPercap,
y = lifeExp)) +
geom_point(aes(colour = continent)) +
geom_smooth()
`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
How can we save our scatter plot to the ‘plots’ directory we created using a function?
ggsave(filename = "plots/gdpPercap_v_lifeExp.png", width = 7, height = 4)
`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
Remember that the
ggsave
function saves the last plot.
This plot uses the default discrete palette.
Saving some typing: We will keep modifying this plot. To reuse the constant base of our plot (the
ggplot()
call and the point geometry), we can create an object:
p <- ggplot(data = gapminder,
mapping = aes(x = gdpPercap,
y = lifeExp)) +
geom_point(aes(colour = continent)) +
geom_smooth()
We can use other palettes than the default one. ggplot2 provides extra functions to modify colour scales. For example:
p +
scale_colour_viridis_d()
`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
Viridis is a collection of palettes that are designed to be accessible
(i.e. perceptually uniform in colour or black and white, and perceivable
for various forms of colour blindness). The structure of the function
name is scale_<aesthetic>_viridis_<datatype>()
, the different data
types being discrete, continuous or binned.
Another collection of palettes is the ColorBrewer collection:
p +
scale_colour_brewer(palette = "Set1")
`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
You can see the palettes available by looking at the help page of
scale_colour_brewer()
, under the header “Palettes”. However, the names
alone might not be enough to picture them, so head to
http://colorbrewer2.org/ to find the one that you like. Importantly,
the website allows you to tick the options “colorblind safe” and “print
friendly”… which would rule out all the qualitative palettes for our 5
continents!
A useful package that introduces many palettes for ggplot2 is the colorspace package, which promotes the Hue-Chroma-Luminance (HCL) colour space. This colour space is perceptually-based, which means it is particularly suited for human perception of colours.
For colorspace, the function names are structured as follows:
scale_<aesthetic>_<datatype>_<colorscale>()
Let’s first use an alternative qualitative palette:
library(colorspace)
p +
scale_colour_discrete_qualitative()
`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
This is the default ggplot2 palette! Which means ggplot2 already use a HCL palette. But having the colorspace package loaded, we can now see all the palettes available:
hcl_palettes(plot = TRUE)
Let’s try a different one:
p +
scale_colour_discrete_sequential("Batlow")
`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
Finally, to use a custom palette, we can use the ggplot2 function
scale_colour_manual()
and provide a list of colour names.
p +
scale_colour_manual(values = c("lightblue", "pink", "purple", "black", "red"))
`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
You can list all the R colour names with the function colours()
, which
prints out a list of their names, but know that you are not limited to
these 657 values: you can also use HEX values, which is particularly
useful if you have to respect a colour scheme you were given.
You can find visual lists of all the R colours, but there is a way to pick colours more comfortably: we can use the colourpicker package, which adds a handy add-in to RStudio. Install it and use the new “Addins > Colour Picker” tool to create a vector of colours for your custom palette.
That was a lot of options about colours, but know that in some cases, the most straight-forward way to make your visualisations readable by most is to use symbols instead of colours. In ggplot2, you would use the
shape
aesthetic.
We could further modify our plot to make it more readable. For example, we can use a different x axis scale to distribute the data differently:
p +
scale_x_log10()
`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
We can also further customise a scale with breaks and labels:
p +
scale_x_log10(breaks = c(5e2, 1e3, 1e4, 1e5),
labels = c("500", "1 k", "10 k", "100 k"))
`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
You can use the scientific notation
1e5
to mean “a 1 followed by 5 zeros”.
We might want to focus on the left hand side part of our original plot:
p
`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
To zoom in, we might want to change our axis limits by using ylim()
.
p +
xlim(c(0, 6e4))
`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
Warning: Removed 5 rows containing non-finite outside the scale range
(`stat_smooth()`).
Warning: Removed 5 rows containing missing values or values outside the scale range
(`geom_point()`).
Notice the warning message? ggplot2 informs us that it couldn’t represent part of the data because of the axis limits.
The method we use works for our point geometry, but is problematic for other shapes that could disappear entirely or change their appearance because they are based on different data: we are actually clipping our visualisation! Notice how the trend line now looks different?
A better way to focus on one part of the plot would be to modify the coordinate system:
p +
coord_cartesian(xlim = c(0, 6e4))
`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
The Cartesian coordinate system is the default one in ggplot2. You could
change the coordinate system to coord_polar()
for circular
visualisations, or to coord_map()
to visualise spatial data.
The plotly package brings the power of the Plotly javascript library to
R. Install it with install.packages(plotly)
, and you’ll then be able
to convert a ggplot2 visualisation into an interactive HTML
visualisation with one single function!
Let’s reuse our original plot object, and feed it to ggplotly()
:
library(plotly)
ggplotly(p)
You can now identify single points, zoom into your plot, and show/hide
categories. However, the pop-up does not tell us which country the point
corresponds to. That’s because we don’t mention the country
variable
in our code. Let’s reveal that information by slightly modifying the p
object:
p <- ggplot(data = gapminder,
mapping = aes(x = gdpPercap,
y = lifeExp,
label = country)) + # one extra aesthetic for an extra variable
geom_point(aes(colour = continent)) +
geom_smooth()
The interactive version now tells us which country each point correspond to:
ggplotly(p)
You can export the visualisation as a HTML page for sharing with others.
Search for the histogram geometry function, and plot the life expectancy. How can we modify the bars?
ggplot(gapminder, aes(x = lifeExp)) +
geom_histogram() # by default, bins = 30
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Saving some typing: remember we can omit the names of the arguments if we use them in order? Being explicit about the argument names is useful when learning the ins and outs of a function, but as you get more familiar with ggplot2, you can do away with the obvious ones, like
data =
andmapping =
(as long as they are used in the right order!).
Let’s change the bin width:
ggplot(gapminder, aes(x = lifeExp)) +
geom_histogram(binwidth = 1)
Here, each bar contains one year of life expectancy.
We can also change the number of bins:
ggplot(gapminder, aes(x = lifeExp)) +
geom_histogram(bins = 10)
Now, let’s colour the bins by continent. Instinctively, you could try
the colour
aesthetic:
ggplot(gapminder, aes(x = lifeExp, colour = continent)) +
geom_histogram(bins = 10)
…but it only colours the outline of the rectangles!
Some aesthetics will work better with some geometries than others. We
have to use the fill
aesthetic to colour the areas instead:
ggplot(gapminder, aes(x = lifeExp, fill = continent)) +
geom_histogram(bins = 10)
Colouring our bins allows us to experiment with the geometry’s
position. The histogram geometry uses the “stack” position by
default. It might convey different information if we make it use ratios
instead, using the position = "fill"
argument:
ggplot(gapminder,
aes(x = lifeExp,
fill = continent)) +
geom_histogram(bins = 10,
position = "fill")
You might have noticed that the y-axis is still labeled ‘count’ when it has changed to a proportion. We can modify the y-axis labels to percent using the scales package which is loaded with the ggplot2 package.
We can also modify the y-axis labels with the labs()
function.
library(scales)
ggplot(gapminder,
aes(x = lifeExp,
fill = continent)) +
geom_histogram(bins = 10,
position = "fill") +
scale_y_continuous(labels = scales::percent) +
labs(y = "Percent")
We can also make the bars “dodge” each other:
ggplot(gapminder,
aes(x = lifeExp,
fill = continent)) +
geom_histogram(bins = 10,
position = "dodge")
An even more readable representation could use faceting:
ggplot(gapminder,
aes(x = lifeExp,
fill = continent)) +
geom_histogram(bins = 40) +
facet_wrap(vars(continent))
We have to wrap the variable(s) we want to facet by into the vars()
function.
Faceting is a great way to add yet another variable to your visualisation, instead of using another aesthetic.
The legend is probably superfluous. We want to keep the colours, but we
use the theme()
function to customise the look of our plot and remove
the legend:
ggplot(gapminder,
aes(x = lifeExp,
fill = continent)) +
geom_histogram(bins = 40) +
facet_wrap(~ continent) +
theme(legend.position = "none")
If you use a pre-built theme function, make sure you place it before customising the legend. Otherwise it will bring the legend back!
ggplot(gapminder,
aes(x = lifeExp,
fill = continent)) +
geom_histogram(bins = 40) +
facet_wrap(~ continent) +
theme_minimal() + # before customising the legend!
theme(legend.position = "none")
This extra example gives an idea of how a refined ggplot2 visualisation might be constructed. It represents 4 different variables, using the larger diamonds data set.
ggplot(diamonds,
aes(x = carat,
y = price)) +
geom_point(aes(colour = color),
alpha = 0.5,
size = 0.5) +
scale_color_brewer(palette = "Spectral") +
geom_smooth(se = FALSE,
linetype = "dashed",
colour = "black",
size = 0.5) +
facet_wrap(vars(cut)) +
theme_minimal() +
labs(y = "price (USD)")
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
In this visualisation:
- 4 different variables are represented, thanks to both aesthetics and facets
- two geometries are layered on top of each other to represent a relationship
- both geometries are customised to make the plot readable (important here, since there are close to 54000 rows of data)
- the default the colour scale is replaced
- a built-in theme is used
- a label clarifies the unit of measurement
A simple boxplot can help visualise a distribution in categories:
ggplot(gapminder, aes(x = continent, y = lifeExp)) +
geom_boxplot()
What do you think this extra line might do to our boxplots?
ggplot(gapminder, aes(x = continent, y = lifeExp)) +
geom_boxplot() +
theme(axis.text.x = element_text(angle = 90))
This is useful if the x labels get too cramped on the x axis: you can rotate them to whatever angle you want.
Try turning this plot into an interactive visualisation to see stats easily!
Let’s try summarising the average and standard deviation of life expectancy by continent from the gapminder data and piping it directly into a ggplot. We will need to install/load dplyr for this.
# install.packages("dplyr")
library(dplyr)
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
gapminder %>%
group_by(continent) %>%
summarise(aveLE = mean(lifeExp),
sdLE = sd(lifeExp)) %>%
ggplot(aes(x = continent,
y = aveLE)) +
geom_bar(stat = "identity",
fill = "tomato") +
geom_errorbar(aes(ymin = aveLE - sdLE,
ymax = aveLE + sdLE),
width = 0.1) +
labs(x = "Continent",
y = "Life Expectancy in Years")
In ggplot2, multiple data frames can be plotted on the same plot. For example, let’s say we wanted to make a bar graph of total population by year with the gapminder data set colored by continent and also include the standard deviation of the total population per year at the top of each column as an error bar? This may or may not make the most sense in terms of representing the data, but we will do it here as an exercise.
- summarise the data by year to calculate the standard deviation
- summarise the data by continent and year for the filled columns
- plot the continent data and add the error bars from the first summary
total <- gapminder %>%
group_by(year) %>% # group by year
summarise(tot = sum(pop), # sum the population for every year
SD = sd(pop))
total
# A tibble: 12 × 3
year tot SD
<int> <dbl> <dbl>
1 1952 2406957151. 58100863.
2 1957 2664404580 65504285.
3 1962 2899782974 69788650.
4 1967 3217478384 78375481.
5 1972 3576977158 88646817.
6 1977 3930045807 97481091.
7 1982 4289436840 105098650.
8 1987 4691477418 114756180.
9 1992 5110710260 124502589.
10 1997 5515204472 133417391.
11 2002 5886977579 140848283.
12 2007 6251013179 147621398.
cont_ave <- gapminder %>%
group_by(continent, year) %>%
summarise(totalpop = sum(pop)) %>%
ungroup()
`summarise()` has grouped output by 'continent'. You can override using the
`.groups` argument.
cont_ave
# A tibble: 60 × 3
continent year totalpop
<chr> <int> <dbl>
1 Africa 1952 237640501
2 Africa 1957 264837738
3 Africa 1962 296516865
4 Africa 1967 335289489
5 Africa 1972 379879541
6 Africa 1977 433061021
7 Africa 1982 499348587
8 Africa 1987 574834110
9 Africa 1992 659081517
10 Africa 1997 743832984
# ℹ 50 more rows
When specifying a data set outside of the main ggplot()
function - the
data =
argument must be used. The other functions geom_bar()
etc.
assume the first argument is mapping = aes()
unless the data argument
is explicitly defined.
ggplot(data = cont_ave, aes(x = year,
y = totalpop)) +
geom_bar(aes(fill = continent),
stat = "identity") + # default is `stat = count` like a histogram
geom_errorbar(data = total, # must use `data = ` to specify a new data set being used
aes(y = tot, # the y from the ggplot(aes()) is inherited, update
ymin = tot - SD, # error bar arguments
ymax = tot + SD),
width = 0.8) +
scale_x_continuous(breaks = seq(from = 1952, to = 2007, by = 5)) + # use seq to get years from 1952 - 2007 every 5 yrs to label every column
scale_y_continuous(breaks = c(2e9, 4e9, 6e9), # keep the same breaks
labels = c("2", "4", "6")) + # relabel so number represent billions
labs(x = "Year",
y = "Population in Billions", # rename units in billions
fill = "Continent") + # relabel legend from the 'fill' in the geom_bar
theme_dark() + # different theme
theme(panel.grid = element_blank()) # remove the grid lines
Closing RStudio will ask you if you want to save your workspace and scripts. Saving your workspace is usually not recommended if you have all the necessary commands in your script.
- For ggplot2:
- ggplot2 cheatsheet
- Official ggplot2 documentation
- Official ggplot2 website
- Chapter on data visualisation in the book R for Data Science
- From Data to Viz, a website to explore different visualisations and the code that generates them
- Selva Prabhakaran’s r-statistics.co section on ggplot2
- Coding Club’s data visualisation tutorial
- Cookbook for R graphs
- STHDA’s ggplot2 essentials
- Our compilation of general R resources