Merge pull request #17 from apawlik/lesson_layout

Applied lesson layout
datacarpentry · Aug 14, 2015 · 7906e52 · 7906e52
2 parents 91a7df9 + 1e6a214
commit 7906e52
Showing 1 changed file with 294 additions and 0 deletions.
diff --git a/05-visualisation-ggplot2.Rmd b/05-visualisation-ggplot2.Rmd
@@ -0,0 +1,294 @@
+---
+layout: topic
+title: Data visualisation with ggplot2
+subtitle: Visualising data in R with ggplot2 package
+minutes: 60
+---
+
+<!---
+show hide magic
+
+<style> div.hidecode + pre {display: none} div.hidecode {color: #337ab7}</style><script> doclick=function(e){ e.nextSibling.nextSibling.style.display = e.nextSibling.nextSibling.style.display === "block" ? "none" : "block"; }</script>
+-->
+
+```{r}
+knitr::opts_chunk$set(fig.keep='last')
+```
+
+```{r setup, echo=FALSE, purl=FALSE}
+source("setup.R")
+```
+
+Authors: **Mateusz Kuzak**, **Diana Marek**, **Hedi Peterson**
+
+
+
+#### Disclaimer
+
+We will here using functions of ggplot2 package. There are basic ploting
+capabilities in basic R, but ggplot2 adds more powerful plotting capabilities.
+
+> ### Learning Objectives
+>
+> -	Visualise some of the
+>[mammals data](http://figshare.com/articles/Portal_Project_Teaching_Database/1314459)
+>from Figshare [surveys.csv](http://files.figshare.com/1919744/surveys.csv)
+> -	Understand how to plot these data using R ggplot2 package. For more details
+>on using ggplot2 see
+>[official documentation](http://docs.ggplot2.org/current/).
+> -	Building step by step complex plots with ggplot2 package
+
+Load required packages
+
+```{r}
+# plotting package
+library(ggplot2)
+# piping / chaining
+library(magrittr)
+# modern dataframe manipulations
+library(dplyr)
+```
+
+Load data directly from figshare.
+
+```{r}
+surveys_raw <- read.csv("http://files.figshare.com/1919744/surveys.csv")
+```
+
+`surveys.csv` data contains some measurements of the animals caught in plots.
+
+## Data cleaning and preparing for plotting
+
+Let's look at the summary
+
+```{r}
+summary(surveys_raw)
+```
+
+There are few things we need to clean in the dataset.
+
+There is missing species_id in some records. Let's remove those.
+
+```{r}
+surveys <- surveys_raw %>%
+           filter(species_id != "")
+```
+
+There are a lot of species with low counts, let's remove the ones below 10 counts
+
+```{r}
+# count records per species
+species_counts <- surveys %>%
+                  group_by(species_id) %>%
+                  summarise(n=n())
+
+# get names of those frequent species
+frequent_species <- species_counts %>%
+                    filter(n >= 10) %>%
+                    select(species_id)
+
+surveys <- surveys %>%
+           filter(species_id %in% frequent_species$species_id)
+```
+
+We saw in summary, there were NA's in weight and hindfoot_length. Let's remove
+rows with missing weights.
+
+```{r}
+surveys_weight_present <- surveys %>%
+                      filter(!is.na(weight))
+```
+
+> ### Challenge
+>
+> - Do the same to remove rows without `hindfoot_length`. Save results in the new dataframe.
+
+
+```{r}
+surveys_length_present <- surveys %>%
+                      filter(!is.na(hindfoot_length))
+```
+
+- How would you get the dataframe without missing values?
+
+```{r}
+surveys_complete <- surveys_weight_present %>%
+                    filter(!is.na(hindfoot_length))
+```
+
+> We can chain filtering together using pipe operator (`%>%`) introduced earlier.
+
+```{r}
+surveys_complete <- surveys %>%
+                    filter(!is.na(weight)) %>%
+                    filter(!is.na(hindfoot_length))
+```
+
+> Make simple scatter plot of `hindfoot_length` (in millimeters) as a function of
+> `weight` (in grams), using basic R plotting capabilities.
+
+```{r}
+plot(x=surveys_complete$weight, y=surveys_complete$hindfoot_length)
+```
+
+## Plotting with ggplot2
+
+We will make the same plot using `ggplot2` package.
+
+`ggplot2` is a plotting package that makes it sipmple to create complex plots
+from data in a dataframe. It uses default settings, which help creating
+publication quality plotts with minimal amount of settings and tweaking.
+
+With ggplot graphics are build step by step by adding new elements.
+
+To build a ggplot we need to:
+
+- bind plot to a specific data frame
+
+```{r, eval=FALSE}
+ggplot(surveys_complete)
+```
+
+- define aestetics (`aes`), that maps variables in the data to axes on the plot
+     or to plotting size, shape color, etc.,
+
+```{r}
+ggplot(surveys_complete, aes(x = weight, y = hindfoot_length))
+```
+
+- add `geoms` -- graphical representation of the data in the plot (points,
+     lines, bars). To add a geom to the plot use `+` operator:
+
+```{r}
+ggplot(surveys_complete, aes(x = weight, y = hindfoot_length)) +
+  geom_point()
+```
+
+## Modifying plots
+
+- adding transparency (alpha)
+
+```{r}
+ggplot(surveys_complete, aes(x = weight, y = hindfoot_length)) +
+  geom_point(alpha=0.1)
+```
+
+- adding colors
+
+```{r}
+ggplot(surveys_complete, aes(x = weight, y = hindfoot_length)) +
+  geom_point(alpha=0.1, color="blue")
+```
+
+Example of complex visualisation in which plot area is divided into hexagonal
+sections and points are counted wihin hexagons. The number of points per hexagon
+is encoded by color.
+
+```{r}
+ggplot(surveys_complete, aes(x = weight, y = hindfoot_length)) + stat_binhex(bins=50) +
+  scale_fill_gradientn(trans="log10", colours = heat.colors(10, alpha=0.5))
+```
+
+## Boxplot
+
+Visualising the distribution of weight within each species.
+
+```{r}
+ggplot(surveys_weight_present, aes(factor(species_id), weight)) +
+                   geom_boxplot()
+```
+
+By adding points to boxplot, we can see particular measurements and the
+abundance of measurements.
+
+```{r}
+ggplot(surveys_weight_present, aes(factor(species_id), weight)) +
+                   geom_jitter(alpha=0.3, color="tomato") +
+                   geom_boxplot(alpha=0)
+```
+
+> ### Challenge
+>
+> Create boxplot for `hindfoot_length`.
+
+## Plotting time series data
+
+Let's calculate number of counts per year for each species. To do that we need
+to group data first and count records within each group.
+
+```{r}
+yearly_counts <- surveys %>%
+                 group_by(year, species_id) %>%
+                 summarise(count=n())
+```
+
+Timelapse data can be visualised as a line plot with years on x axis and counts
+on y axis.
+
+```{r}
+ggplot(yearly_counts, aes(x=year, y=count)) +
+                  geom_line()
+```
+
+Unfortunately this does not work, because we plot data for all the species
+together. We need to tell ggplot to split graphed data by `species_id`
+
+```{r}
+ggplot(yearly_counts, aes(x=year, y=count, group=species_id)) +
+  geom_line()
+```
+
+We will be able to distiguish species in the plot if we add colors.
+
+```{r}
+ggplot(yearly_counts, aes(x=year, y=count, group=species_id, color=species_id)) +
+  geom_line()
+```
+
+## Faceting
+
+ggplot has a special technique called *faceting* that allows to split one plot
+into mutliple plots based on some factor. We will use it to plot one time series
+for each species separately.
+
+```{r}
+ggplot(yearly_counts, aes(x=year, y=count, color=species_id)) +
+  geom_line() + facet_wrap(~species_id)
+```
+
+Now we wuld like to split line in each plot by sex of each individual
+measured. To do that we need to make counts in dataframe grouped by sex.
+
+> ### Challenges:
+>
+> - filter the dataframe so that we only keep records with sex "F" or "M"s
+>
+
+```{r}
+sex_values = c("F", "M")
+surveys <- surveys %>%
+           filter(sex %in% sex_values)
+```
+
+> - group by year, species_id, sex
+
+```{r}
+yearly_sex_counts <- surveys %>%
+                     group_by(year, species_id, sex) %>%
+                     summarise(count=n())
+```
+
+> - make the faceted plot spliting further by sex (within single plot)
+
+```{r}
+ggplot(yearly_sex_counts, aes(x=year, y=count, color=species_id, group=sex)) +
+  geom_line() + facet_wrap(~ species_id)
+```
+
+> We can improve the plot by coloring by sex instead of species (species are
+> already in separate plots, so we don't need to distinguish them better)
+
+```{r}
+ggplot(yearly_sex_counts, aes(x=year, y=count, color=sex, group=sex)) +
+  geom_line() + facet_wrap(~ species_id)
+```