Skip to content

Commit

Permalink
markdown source builds
Browse files Browse the repository at this point in the history
Auto-generated via {sandpaper}
Source  : bfb75e1
Branch  : main
Author  : parkyed <[email protected]>
Time    : 2023-10-16 20:31:17 +0000
Message : Merge pull request #6 from carpentries-incubator/update/packages

Update 33 packages
  • Loading branch information
actions-user committed Oct 16, 2023
1 parent fd85ab9 commit 59dce59
Show file tree
Hide file tree
Showing 17 changed files with 116 additions and 585 deletions.
Binary file added .DS_Store
Binary file not shown.
83 changes: 0 additions & 83 deletions config.yaml

This file was deleted.

10 changes: 6 additions & 4 deletions episode2.md
Original file line number Diff line number Diff line change
Expand Up @@ -72,7 +72,7 @@ Now use the sort in the top right of the screen to sort the results by "Samples"

## Illustrative Dataset: IBD Dataset

In the search results displayed above, the top ranked dataset is E-MTAB-11349: *Whole blood expression profiling of patients with inflammatory bowel diseases in the IBD-Character cohort*. We'll call it the [IBD dataset](https://www.ebi.ac.uk/biostudies/arrayexpress/studies/E-MTAB-11349).
In the search results displayed above, the top-ranked dataset is E-MTAB-11349: *Whole blood expression profiling of patients with inflammatory bowel diseases in the IBD-Character cohort*. We'll call it the [IBD dataset](https://www.ebi.ac.uk/biostudies/arrayexpress/studies/E-MTAB-11349).

The IBD dataset comprises human samples of patients with inflammatory bowel diseases and controls. The 'Protocols' section explains the main steps in the generation of the data, from RNA extraction and sample preparation, to sequencing and processing of the raw RNA-Seq data. The nucleic acid sequencing protocol gives details of the sequencing platform used to generate the raw fastq data, and the version of the human genome used in the alignment step to generate the count data. The normalization data transformation protocol gives the tools used to normalise raw counts data for sequence depth and sample composition, in this case normalisation is conducted using the R package [DESeq2](https://bioconductor.org/packages/release/bioc/html/DESeq2.html). DESEq2 contains a range of functions for the transformation and analysis of RNA-Seq data. Let's look at some of the basic information on this dataset:

Expand All @@ -90,9 +90,11 @@ The dataset contains two alternative sets of processed data, a matrix of raw cou

## Downloading and Reading into R

Let's download the SDRF file and the raw counts matrix - these are the two files that contain the information we will need to build a machine learning classification model. In the data files box to the right hand side, check the follwoing two files `E-MTAB-11349.sdrf.txt` and `ArrayExpress-raw.csv`, and save in a folder called `data` in the working directory for your R project. For consistency, rename `ArrayExpress-raw.csv` as `E-MTAB-11349.counts.matrix.csv`.
Let's download the SDRF file and the raw counts matrix - these are the two files that contain the information we will need to build a machine-learning classification model. In the data files box to the right-hand side, check the following two files `E-MTAB-11349.sdrf.txt` and `ArrayExpress-raw.csv`, and save them in a folder called `data` in the working directory for your R project. For consistency, rename `ArrayExpress-raw.csv` as `E-MTAB-11349.counts.matrix.csv`.

For convenience, a copy of the files is also stored on zenodo. You can run the following code that uses the function `download.file()` to download the files and save them directly into your `data` directory. (You need to have created the `data` directory beforehand).
For convenience, a copy of the files is also stored on zenodo. You can run the following code that uses the function `download.file()` to download the files and save them directly into your `data` directory. (You need to have created the `data` directory beforehand).

To get or set a working directory in R `getwd` function can be used to return an absolute file path representing the current working directory of the R process; if not correct `setwd` can be used to set the working directory to the desired location.

```r

Expand All @@ -106,7 +108,7 @@ download.file(url = "https://zenodo.org/record/8125141/files/E-MTAB-11349.counts

### Raw counts matrix

In R studio, open your project workbook and read in the raw counts data. Then check the dimensions of the matrix to confirm we have the expected number of samples and transcript IDs.
In R studio, open your project workbook and read the raw counts data. Then check the dimensions of the matrix to confirm we have the expected number of samples and transcript IDs.


```r
Expand Down
21 changes: 14 additions & 7 deletions episode4.md
Original file line number Diff line number Diff line change
Expand Up @@ -164,7 +164,9 @@ sprintf("The unique IDs that match the counts matrix are in column: %s", colname


```r
suppressPackageStartupMessages(library(tidyverse, quietly = TRUE))
# suppressPackageStartupMessages(library(dplyr, quietly = TRUE))
# suppressPackageStartupMessages(library(ggplot2, quietly = TRUE))
# suppressPackageStartupMessages(library(tibble, quietly = TRUE))
```


Expand Down Expand Up @@ -238,9 +240,13 @@ unique(c(samp.info.ibd.sel$sex, samp.info.ibd.sel$condition))

* Both `sex` and `condition` are consistently encoded over all samples. We should however remove the spaces in the values for `sampleID` and for the `condition` value ulcerative colitis. Note: since we are changing our unique identifier, we'll need to make the same update later in the counts matrix.

Before we do this, since we'll be using the pipe operator from the tidyverse libraries, we will to define the pipe operator from the magrittr package in the environment.



```r
`%>%` <- magrittr::`%>%`

samp.info.ibd.sel <- samp.info.ibd.sel %>%
dplyr::mutate(sampleID = gsub(" ", "_", sampleID),
condition = gsub(" ", "_", condition))
Expand All @@ -258,7 +264,7 @@ samp.info.ibd.sel <- samp.info.ibd.sel %>%

```r
samp.info.ibd.sel <- samp.info.ibd.sel %>%
dplyr::mutate(class = if_else(condition == 'normal', -1, 1))
dplyr::mutate(class = dplyr::if_else(condition == 'normal', -1, 1))
```


Expand Down Expand Up @@ -288,7 +294,7 @@ The two classes are approximately equally represented, so let's check everything


```r
glimpse(samp.info.ibd.sel)
dplyr::glimpse(samp.info.ibd.sel)
```

```{.output}
Expand Down Expand Up @@ -367,7 +373,7 @@ counts.mat.ibd <- raw.counts.ibd[-1,-1]

rownames(counts.mat.ibd) <- NULL

counts.mat.ibd <- counts.mat.ibd %>% column_to_rownames('read')
counts.mat.ibd <- counts.mat.ibd %>% tibble::column_to_rownames('read')

counts.mat.ibd[1:10,1:6]
```
Expand Down Expand Up @@ -526,10 +532,11 @@ It is crucial that you document your data reformatting so that it is reproducibl

Take a look at [The Tuburculosis (TB) Dataset - E-MTAB-6845](https://www.ebi.ac.uk/biostudies/arrayexpress/studies/E-MTAB-6845). You can download the sdrf file from zenodo and save it to your `data` directory by running the following code.


```r

download.file(url = "https://zenodo.org/record/8125141/files/E-MTAB-6845.sdrf.txt",
destfile = "data/E-MTAB-6845.sdrf.txt")

```

Read the sdrf file into R and take a look at the data. Make a list of the potential issues with this data based on the checklist above? What steps would need to be taken to get them ready to train a machine learning classifier?
Expand Down Expand Up @@ -588,7 +595,7 @@ dplyr::mutate(prog_status = factor(prog_status, levels = c("non_progressor","tb_
dplyr::mutate(prog_status = dplyr::recode(prog_status,
"non-progressor" = "non_progressor",
"TB progressor" = "tb_progressor"),
class = if_else(prog_status == 'non_progressor', -1, 1)
class = dplyr::if_else(prog_status == 'non_progressor', -1, 1)
) %>%

dplyr::distinct() %>%
Expand Down Expand Up @@ -621,7 +628,7 @@ samp.info.tb %>%
dplyr::mutate(prog_status = dplyr::recode(prog_status, # recode target variable
"non-progressor" = "non_progressor",
"TB progressor" = "tb_progressor"),
class = if_else(prog_status == 'non_progressor', -1, 1) # create numerical class
class = dplyr::if_else(prog_status == 'non_progressor', -1, 1) # create numerical class
) %>%

dplyr::mutate(prog_status = factor(prog_status, levels = c("non_progressor","tb_progressor")),
Expand Down
66 changes: 32 additions & 34 deletions episode5.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,15 +21,15 @@ exercises: 2

## Technical Artefacts in RNA-Seq Data

Machine learning classification algorithms are highly sensitive to any feature data characteristic, regardless of scale, that may differ between experimental groups, and will exploit these data characteristic differences when training a model. Given this, it is important to make sure that data input into a machine learning model reflects true biological signal, and not technical artefacts or noise that stems from the experimental process used to generate the data. In simple terms, we need to remove things are aren't "real biological information".
Machine learning classification algorithms are highly sensitive to any feature data characteristic that differs between experimental groups, and will exploit these differences when training a model. Given this, it is important to make sure that data input into a machine learning model reflects true biological signal, and not technical artefacts or noise that stems from the experimental process used to generate the data. In simple terms, we need to remove things are aren't "real biological information".

There are two important sources of noise inherent in RNA-Seq data that may negatively impact machine learning modelling performance, namely low read counts, and influential outlier read counts.

<br>

## Low read counts

Genes with consistently low read count values across all samples in a dataset may be technical or biological stochastic artefacts such as the detection of a transcript from a gene that is not uniformly active in a heterogeneous cell population or as the result of a transcriptional error. Below some count threshold, genes are unlikely to be representative of true biological differences related to the condition of interest. Filtering out low count genes has been show to increase the classification performance of machine learning classifiers, and to increase the stability of the set of genes selected by a machine learning algorithm in the context of selecting relevant genes.
Genes with consistently low read count values across all samples in a dataset may be technical or biological stochastic artefacts such as the detection of a transcript from a gene that is not uniformly active in a heterogeneous cell population or as the result of a transcriptional error. Below some threshold, differences in counts between the conditions of interest for a given gene are unlikely to be representative of true biological differences between the groups. Filtering out low count genes has been show to increase the classification performance of machine learning classifiers, and to increase the stability of the set of genes selected by a machine learning algorithm in the context of selecting relevant genes.

### Investigating Low Counts

Expand All @@ -51,31 +51,32 @@ And now read the files into R...


```r
suppressPackageStartupMessages(library(tidyverse, quietly = TRUE))
# suppressPackageStartupMessages(library(dplyr, quietly = TRUE))
# suppressPackageStartupMessages(library(ggplot2, quietly = TRUE))
# suppressPackageStartupMessages(library(tibble, quietly = TRUE))
```


```r
samp.info.ibd.sel <- read.table(file="./data/ibd.sample.info.txt", sep="\t", header=T, fill=T, check.names=F)

counts.mat.ibd <- read.table(file="./data/counts.mat.ibd.txt", sep='\t', header=T, fill=T, check.names=F)

# head(counts.mat.ibd)
# any(is.na(samp.info.ibd.sel))
```

Run the following code to view the histogram giving the frequency of the maximum count for each gene in the sample (plotted on a log10 scale). You'll see in the resulting plot that over 800 genes have no counts for any gene (i.e., maximum = 0), and that there are hundreds of genes where the maximum count for the gene across all samples is below 10. This compares to a median maximum count value of around 250 for the dataset. These low count genes are likely to represent technical noise.
Run the following code to view the histogram giving the frequency of the maximum count for each gene in the sample (plotted on a log10 scale). You'll see in the resulting plot that over 800 genes have no counts for any gene (i.e., maximum = 0), and that there are hundreds of genes where the maximum count for the gene across all samples is below 10. For comparison, the median of these maximum counts across all genes is around 250. These low count genes are likely to represent technical noise.

A simple filtering approach is to remove all genes where the maximum read count for that gene over all samples is below a given threshold. The next step is to determine what this threshold should be.


```r
`%>%` <- magrittr::`%>%`

data.frame(max_count = apply(counts.mat.ibd, 1, max, na.rm=TRUE)) %>%
ggplot(aes(x = max_count)) +
geom_histogram(bins = 200) +
xlab("Max Counts (log10 scale)") +
ylab("Frequency") +
scale_x_log10(n.breaks = 6, labels = scales::comma)
ggplot2::ggplot(ggplot2::aes(x = max_count)) +
ggplot2::geom_histogram(bins = 200) +
ggplot2::xlab("Max Counts (log10 scale)") +
ggplot2::ylab("Frequency") +
ggplot2::scale_x_log10(n.breaks = 6, labels = scales::comma)
```

```{.warning}
Expand Down Expand Up @@ -110,16 +111,6 @@ There is no right answer. However, looking at the histogram, you can see an appr
In order to be able to compare read counts between samples, we must first adjust ('normalise') the counts to control for differences in sequence depth and sample composition between samples. To achieve this, we run the following code that will normalise the counts matrix using the median-of-ratios method implemented in the R package `DESeq2`. For more information on the rationale for scaling RNA-Seq counts and a comparison of the different metrics used see [RDMBites | RNAseq expression data](https://www.youtube.com/watch?v=tO2H3zuBouw).


```r
# load DESeq2 library
suppressPackageStartupMessages(library(DESeq2))
```

```{.warning}
Warning: replacing previous import 'S4Arrays::makeNindexFromArrayViewport' by
'DelayedArray::makeNindexFromArrayViewport' when loading 'SummarizedExperiment'
```

```r
# convert the condition variable to a factor as required by DESeq2
samp.info.ibd.sel[c('condition')] <- lapply(samp.info.ibd.sel[c('condition')], factor)
Expand All @@ -129,7 +120,14 @@ dds.ibd <- DESeq2::DESeqDataSetFromMatrix(
countData = as.matrix(counts.mat.ibd),
colData = data.frame(samp.info.ibd.sel, row.names = 'sampleID'),
design = ~ condition)

```

```{.warning}
Warning: replacing previous import 'S4Arrays::makeNindexFromArrayViewport' by
'DelayedArray::makeNindexFromArrayViewport' when loading 'SummarizedExperiment'
```

```r
# calculate the normalised count values using the median-of-ratios method
dds.ibd <- dds.ibd %>% DESeq2::estimateSizeFactors()

Expand Down Expand Up @@ -178,12 +176,12 @@ ms.jac = sapply(t.seq, function(t){

# plot the threshold value against the value of the Multiset Jaccard index to visualise

ggplot(data=data.frame(t = t.seq, jacc = ms.jac)) +
geom_line(aes(x=t, y=jacc)) +
geom_hline(yintercept = max(ms.jac), lty=2,col='gray') +
geom_point(aes(x=which.max(ms.jac), y=max(ms.jac)), col="red", cex=6, pch=1) +
xlab("Low Count Threshold") +
ylab("Multiset Jaccard Index")
ggplot2::ggplot(data=data.frame(t = t.seq, jacc = ms.jac)) +
ggplot2::geom_line(ggplot2::aes(x=t, y=jacc)) +
ggplot2::geom_hline(yintercept = max(ms.jac), lty=2,col='gray') +
ggplot2::geom_point(ggplot2::aes(x=which.max(ms.jac), y=max(ms.jac)), col="red", cex=6, pch=1) +
ggplot2::xlab("Low Count Threshold") +
ggplot2::ylab("Multiset Jaccard Index")
```

<img src="fig/episode5-rendered-unnamed-chunk-5-1.png" style="display: block; margin: auto;" />
Expand Down Expand Up @@ -347,11 +345,11 @@ Run the same plot as above, and we can see that the genes with very low maximum

```r
data.frame(max_count = apply(counts.mat.ibd.ol.filtered, 1, max, na.rm=TRUE)) %>%
ggplot(aes(x = max_count)) +
geom_histogram(bins = 200) +
xlab("Max Counts (log10 scale)") +
ylab("Frequency") +
scale_x_log10(n.breaks = 6, labels = scales::comma)
ggplot2::ggplot(ggplot2::aes(x = max_count)) +
ggplot2::geom_histogram(bins = 200) +
ggplot2::xlab("Max Counts (log10 scale)") +
ggplot2::ylab("Frequency") +
ggplot2::scale_x_log10(n.breaks = 6, labels = scales::comma)
```

<img src="fig/episode5-rendered-unnamed-chunk-13-1.png" style="display: block; margin: auto;" />
Expand Down
Loading

0 comments on commit 59dce59

Please sign in to comment.