-
Notifications
You must be signed in to change notification settings - Fork 5
/
CompareClusters.Rmd
145 lines (126 loc) · 4.57 KB
/
CompareClusters.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
---
# Copyright 2017 Verily Life Sciences Inc.
#
# Use of this source code is governed by a BSD-style
# license that can be found in the LICENSE file.
title: "Compare Clusters"
output:
html_document:
toc: yes
params:
PROJECT_ID: "PROJECT_ID"
DATASET_DESCRIPTION: "Brief description of the single-cell dataset."
# List of markers for which to compare cluster-specific expression.
MARKER_GENE_LIST: "'MARKER_GENE1b','MARKER_GENE2b','MARKER_GENE3b"
# These tables must exist.
RAW_DATA_TABLE: "PROJECT_ID_THE_DATA_IS_IN.DATASET_NAME.TABLE_NAME"
CLUSTER_TABLE: "PROJECT_ID_THE_DATA_IS_IN.DATASET_NAME.TABLE_NAME"
# This RMarkdown is a parameterized report. See
# http://rmarkdown.rstudio.com/developer_parameterized_reports.html
# for more detail.
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
# Whether or not to cache chunk results for faster rendering when tweaking
# narrative and plots.
knitr::opts_chunk$set(cache=FALSE)
# Whether or not to emit progress messages from bigrquery.
options("bigrquery.quiet"=TRUE)
```
This report performs compares expression of marker genes across clusters
on dataset: `r params$DATASET_DESCRIPTION`
```{r}
library(dplyr)
library(tidyr)
library(ggplot2)
library(scales)
library(bigrquery)
```
There are many ways to facilitate templated queries. Here we use Python via
[reticulate](https://github.com/rstudio/reticulate) and
[Jinja2](http://jinja.pocoo.org/docs/2.9/). As another example, see
[this R approach](https://github.com/googlegenomics/codelabs/blob/3a0a1b754e78bc159a9c6deb604a60344034cc2a/R/PlatinumGenomes-QC/rHelpers/setup.R).
```{r helper, cache=FALSE}
library(reticulate)
jinja = import("jinja2")
# If you get an error, in the shell run:
# pip install jinja2
py = import_builtins()
perform_bqquery = function(sql_path, ...) {
sql = jinja$Template(py$open(sql_path, "r")$read())$render(params)
cat(sql)
query_exec(sql, use_legacy_sql = FALSE, project=params$PROJECT_ID, ...)
}
```
## Retrieve cluster cell counts.
```{r comment=NA}
cluster_cell_counts = perform_bqquery(
sql_path = "cluster_cell_counts.sql")
dim(cluster_cell_counts)
```
```{r results='asis'}
knitr::kable(head(cluster_cell_counts))
```
## Plot the counts.
```{r cluster_cell_count_plot, fig.align="center"}
# Order the counts from least to greatest.
cluster_cell_counts$cluster = factor(cluster_cell_counts$cluster,
dplyr::arrange(cluster_cell_counts, desc(cnt))$cluster)
ggplot(cluster_cell_counts, aes(y=cnt, x=cluster)) +
geom_point() +
scale_y_continuous(labels=comma) +
xlab("cluster") +
ylab("number of cells") +
ggtitle("Cell counts per cluster")
```
## Retrieve the aggregate expression per cluster.
```{r comment=NA}
cluster_gene_expression = perform_bqquery(
sql_path = "gene_expression_by_cluster.sql")
dim(cluster_gene_expression)
```
```{r results='asis'}
knitr::kable(head(cluster_gene_expression))
```
## Cap some of the upper values.
```{r}
cluster_gene_expression$cluster = as.factor(cluster_gene_expression$cluster)
cluster_gene_expression$gene = as.factor(cluster_gene_expression$gene)
capped_cluster_expression = dplyr::mutate(
cluster_gene_expression,
capped_avg_trans_cnt = pmin(avg_trans_cnt, 3),
capped_perc_expr = pmin(perc_expr, 0.9))
```
## Set missing values to zero for a better-looking plot.
When we [loaded the data into BigQuery](../data_loading), we omitted any
zero-valued transcript counts. Here we set those missing values to an expression
rate of zero percent to improve the visualization.
```{r}
all_values =
tidyr::complete(capped_cluster_expression,
gene,
cluster,
fill=list(capped_perc_expr=0.0))
```
## Plot the data.
```{r retinal_dotplot, fig.align="center", fig.width=10, fig.height=10}
ggplot(all_values,
aes(y = cluster, x = gene)) +
geom_point(aes(colour = capped_avg_trans_cnt,
size = capped_perc_expr)) +
scale_color_gradient(low ="blue",
high = "red",
limits = c(1,
max(capped_cluster_expression$capped_avg_trans_cnt) ),
name = "Transcript\nCount\n(Capped\n3.0)") +
scale_size(range = c(0, 10),
labels=percent_format(),
name = "Percent\nCells\nExpressing\nTranscript\n(Capped\n90%)") +
ylab("Cluster") +
xlab("Gene") +
theme_bw() +
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank()) +
theme(axis.text.x=element_text(size=12, face="italic", angle=45, hjust=1)) +
theme(axis.text.y=element_text(size=12, face="italic"))
```