From 410fa55d37aa93ac41afc7a8c0dd6ea96c5afd3c Mon Sep 17 00:00:00 2001 From: GitHub Actions Date: Tue, 3 Dec 2024 01:13:13 +0000 Subject: [PATCH] differences for PR #8 --- .DS_Store | Bin 6148 -> 0 bytes data/.DS_Store | Bin 6148 -> 0 bytes episode2.md | 20 ++++++------ episode3.md | 37 ++++++++++++---------- episode4.md | 82 ++++++++++++++++++++++++------------------------- episode5.md | 78 +++++++++++++++++++++++++--------------------- episode6.md | 20 ++++++------ fig/.DS_Store | Bin 6148 -> 0 bytes md5sum.txt | 32 +++++++++---------- 9 files changed, 141 insertions(+), 128 deletions(-) delete mode 100644 .DS_Store delete mode 100644 data/.DS_Store delete mode 100644 fig/.DS_Store diff --git a/.DS_Store b/.DS_Store deleted file mode 100644 index 0c67f3563145cbcd045f260377b1bf83d0b3097b..0000000000000000000000000000000000000000 GIT binary patch literal 0 HcmV?d00001 literal 6148 zcmeHKJ5Iwu5S>j-6a66*lG%x0;a&OD1e#G);|!m)f6xVOo6om{C!B^jHzNR=sz7;{1E^cA?ywNm^Z*g zF;%Puk%2jB1xl;aBZiZ9#1pMc6>C9hC#Q!Gr_4@ID9*Fv{)sOqmkQcy3YY?+0$sU` zdH$dOeEtuc?8+1{1^$%+F3QH)2#=(*we@g3YZLf2oQ>;h!F3BZeiWmYNAVHd8{!FX VfT?0FhzQL72zVK6F$MlqfiDj2SkM3f diff --git a/data/.DS_Store b/data/.DS_Store deleted file mode 100644 index b31f438d29a4ef8df65481a6ef16b9b502eecd07..0000000000000000000000000000000000000000 GIT binary patch literal 0 HcmV?d00001 literal 6148 zcmeHKu};H447E##TmZfg;+@1@sYA;X$0AL^HB$&$&p~M8k%rGiq1;QE%)KIn-gEbuXV1Ail zRMc={Yd+ZS%+{fBx;yfR "Sample 1", "Sample 2", "Sam… @@ -142,7 +142,7 @@ All of the checklist items apply in this case. Let's go through them... 1. Let's verify that there is a unique identifier that matches between the sample information and counts matrix, and identify the name of the column in the sample information. We manually find the names of the samples in `raw.counts.ibd` (the counts matrix), and set to a new variable `ibd.samp.names`. We then write a `for loop` that evaluates which columns in the sample information match the sample names. There is one match, the column named `Source Name`. Note that there are other columns such as `Assay Name` in this dataset that contain identifiers for some but not all samples. This is a good illustration of why it is important to check carefully to ensure you have a complete set of unique identifiers. -```r +``` r ibd.samp.names <- colnames(raw.counts.ibd)[3:ncol(raw.counts.ibd)] lst.colnames <- c() @@ -153,7 +153,7 @@ for(i in seq_along(1:ncol(samp.info.ibd))){ sprintf("The unique IDs that match the counts matrix are in column: %s", colnames(samp.info.ibd)[which(lst.colnames)]) ``` -```{.output} +``` output [1] "The unique IDs that match the counts matrix are in column: Source Name" ``` @@ -164,7 +164,7 @@ sprintf("The unique IDs that match the counts matrix are in column: %s", colname -```r +``` r samp.info.ibd.sel <- dplyr::select(samp.info.ibd, 'Source Name', 'Characteristics[age]', @@ -178,7 +178,7 @@ samp.info.ibd.sel <- dplyr::select(samp.info.ibd, 3. Let's then rename the variables to something more easy to interpret for humans, avoiding spaces and special characters. We'll save these selected columns to a new variable name `samp.info.ibd.sel`. -```r +``` r samp.info.ibd.sel <- dplyr::rename(samp.info.ibd.sel, 'sampleID' = 'Source Name', 'age' = 'Characteristics[age]', @@ -192,11 +192,11 @@ samp.info.ibd.sel <- dplyr::rename(samp.info.ibd.sel, 4. Now check how each of the variables are encoded in our reduced sample information data to identify any errors, data gaps and inconsistent coding of categorical variables. Firstly let's take a look at the data. -```r +``` r dplyr::glimpse(samp.info.ibd.sel) ``` -```{.output} +``` output Rows: 590 Columns: 4 $ sampleID "Sample 1", "Sample 2", "Sample 3", "Sample 4", "Sample 5", … @@ -219,13 +219,13 @@ What steps do we need to take to ensure that these four data columns are consist * There is clearly an issue with the coding of the `condition` Crohn's disease. We'll fix that first, and then check the consistency of the coding of categorical variables `sex` and `condition`. -```r +``` r samp.info.ibd.sel$condition[agrep("Crohns", samp.info.ibd.sel$condition)] <- "crohns_disease" unique(c(samp.info.ibd.sel$sex, samp.info.ibd.sel$condition)) ``` -```{.output} +``` output [1] "male" "female" "crohns_disease" [4] "normal" "ulcerative colitis" ``` @@ -237,7 +237,7 @@ Before we do this, since we'll be using the pipe operator from the tidyverse lib -```r +``` r `%>%` <- magrittr::`%>%` samp.info.ibd.sel <- samp.info.ibd.sel %>% @@ -255,14 +255,14 @@ samp.info.ibd.sel <- samp.info.ibd.sel %>% -```r +``` r samp.info.ibd.sel <- samp.info.ibd.sel %>% dplyr::mutate(class = dplyr::if_else(condition == 'normal', -1, 1)) ``` -```r +``` r samp.info.ibd.sel[c('sex', 'condition', 'class')] <- lapply(samp.info.ibd.sel[c('sex', 'condition', 'class')], factor) ``` @@ -271,11 +271,11 @@ samp.info.ibd.sel[c('sex', 'condition', 'class')] <- lapply(samp.info.ibd.sel[c( 6. Finally, let's check the distribution of the classes by creating a `table` from the `class` column. -```r +``` r table(samp.info.ibd.sel$class) ``` -```{.output} +``` output -1 1 267 323 @@ -286,11 +286,11 @@ table(samp.info.ibd.sel$class) The two classes are approximately equally represented, so let's check everything one last time. -```r +``` r dplyr::glimpse(samp.info.ibd.sel) ``` -```{.output} +``` output Rows: 590 Columns: 5 $ sampleID "Sample_1", "Sample_2", "Sample_3", "Sample_4", "Sample_5", … @@ -330,11 +330,11 @@ Item | Check For... | Rationale 1. Take a look at a sample of columns and rows to see what the downloaded file looks like -```r +``` r raw.counts.ibd[1:10,1:8] ``` -```{.output} +``` output read Sample 1 Sample 2 Sample 3 Sample 4 Sample 5 Sample 6 1 1 * 13961 16595 20722 17696 25703 20848 2 2 ERCC-00002 0 0 0 0 0 0 @@ -361,7 +361,7 @@ Can you see the irrelevant information that we need to remove from the counts ma * Move the column named `read` that contains to the transcript IDs to the row names -```r +``` r counts.mat.ibd <- raw.counts.ibd[-1,-1] rownames(counts.mat.ibd) <- NULL @@ -371,7 +371,7 @@ counts.mat.ibd <- counts.mat.ibd %>% tibble::column_to_rownames('read') counts.mat.ibd[1:10,1:6] ``` -```{.output} +``` output Sample 1 Sample 2 Sample 3 Sample 4 Sample 5 Sample 6 ERCC-00002 0 0 0 0 0 0 ERCC-00003 0 0 0 0 0 0 @@ -394,19 +394,19 @@ ERCC-00019 0 0 0 0 0 0 2. Let's check for duplicate sampleIDs and transcript IDs. Provided there are no duplicate row or column names, the following should return `interger(0)`. -```r +``` r which(duplicated(rownames(counts.mat.ibd))) ``` -```{.output} +``` output integer(0) ``` -```r +``` r which(duplicated(colnames(counts.mat.ibd))) ``` -```{.output} +``` output integer(0) ``` @@ -415,12 +415,12 @@ integer(0) 3. We'll double check the sampleIDs match the sample information file. -```r +``` r if(!identical(colnames(counts.mat.ibd), samp.info.ibd.sel$sampleID)){stop()} ``` -```{.error} -Error in eval(expr, envir, enclos): +``` error +Error: ``` ::::::::::::::::::::::::::::::::::::: challenge @@ -434,7 +434,7 @@ Why did we get an error here? We renamed the samples in the sample information to remove spaces, so we need to do the same here. -```r +``` r colnames(counts.mat.ibd) <- gsub(x = colnames(counts.mat.ibd), pattern = "\ ", replacement = "_") ``` @@ -447,13 +447,13 @@ colnames(counts.mat.ibd) <- gsub(x = colnames(counts.mat.ibd), pattern = "\ ", r -```r +``` r allMissValues <- function(x){all(is.na(x) | x == "")} allMissValues(counts.mat.ibd) ``` -```{.output} +``` output [1] FALSE ``` @@ -462,11 +462,11 @@ allMissValues(counts.mat.ibd) Take a final look at the cleaned up matrix. -```r +``` r counts.mat.ibd[1:10,1:6] ``` -```{.output} +``` output Sample_1 Sample_2 Sample_3 Sample_4 Sample_5 Sample_6 ERCC-00002 0 0 0 0 0 0 ERCC-00003 0 0 0 0 0 0 @@ -480,19 +480,19 @@ ERCC-00017 2 0 0 0 1 0 ERCC-00019 0 0 0 0 0 0 ``` -```r +``` r sprintf("There are %i rows, corresponding to the transcript IDs", dim(counts.mat.ibd)[1]) ``` -```{.output} +``` output [1] "There are 22750 rows, corresponding to the transcript IDs" ``` -```r +``` r sprintf("There are %i columns, corresponding to the samples", dim(counts.mat.ibd)[2]) ``` -```{.output} +``` output [1] "There are 590 columns, corresponding to the samples" ``` @@ -537,7 +537,7 @@ Read the sdrf file into R and take a look at the data. Make a list of the potent As a help, the code to read the file in from your data directory is: -```r +``` r samp.info.tb <- read.table(file="./data/E-MTAB-6845.sdrf.txt", sep="\t", header=T, fill=T, check.names=F) ``` @@ -605,7 +605,7 @@ dplyr::glimpse() -```r +``` r samp.info.tb %>% dplyr::select( @@ -630,7 +630,7 @@ samp.info.tb %>% dplyr::glimpse() # view output ``` -```{.output} +``` output Rows: 360 Columns: 3 $ sampleID "PR123_S19", "PR096_S13", "PR146_S14", "PR158_S12", "PR095… diff --git a/episode5.md b/episode5.md index 3f332dd..9ca7cc2 100644 --- a/episode5.md +++ b/episode5.md @@ -50,14 +50,14 @@ download.file(url = "https://zenodo.org/record/8125141/files/counts.mat.ibd.txt" And now read the files into R... -```r +``` r # suppressPackageStartupMessages(library(dplyr, quietly = TRUE)) # suppressPackageStartupMessages(library(ggplot2, quietly = TRUE)) # suppressPackageStartupMessages(library(tibble, quietly = TRUE)) ``` -```r +``` r samp.info.ibd.sel <- read.table(file="./data/ibd.sample.info.txt", sep="\t", header=T, fill=T, check.names=F) counts.mat.ibd <- read.table(file="./data/counts.mat.ibd.txt", sep='\t', header=T, fill=T, check.names=F) @@ -68,7 +68,7 @@ Run the following code to view the histogram giving the frequency of the maximum A simple filtering approach is to remove all genes where the maximum read count for that gene over all samples is below a given threshold. The next step is to determine what this threshold should be. -```r +``` r `%>%` <- magrittr::`%>%` data.frame(max_count = apply(counts.mat.ibd, 1, max, na.rm=TRUE)) %>% @@ -79,12 +79,14 @@ data.frame(max_count = apply(counts.mat.ibd, 1, max, na.rm=TRUE)) %>% ggplot2::scale_x_log10(n.breaks = 6, labels = scales::comma) ``` -```{.warning} -Warning: Transformation introduced infinite values in continuous x-axis +``` warning +Warning in ggplot2::scale_x_log10(n.breaks = 6, labels = scales::comma): log-10 +transformation introduced infinite values. ``` -```{.warning} -Warning: Removed 10 rows containing non-finite values (`stat_bin()`). +``` warning +Warning: Removed 10 rows containing non-finite outside the scale range +(`stat_bin()`). ``` @@ -111,7 +113,7 @@ There is no right answer. However, looking at the histogram, you can see an appr In order to be able to compare read counts between samples, we must first adjust ('normalise') the counts to control for differences in sequence depth and sample composition between samples. To achieve this, we run the following code that will normalise the counts matrix using the median-of-ratios method implemented in the R package `DESeq2`. For more information on the rationale for scaling RNA-Seq counts and a comparison of the different metrics used see [RDMBites | RNAseq expression data](https://www.youtube.com/watch?v=tO2H3zuBouw). -```r +``` r # convert the condition variable to a factor as required by DESeq2 samp.info.ibd.sel[c('condition')] <- lapply(samp.info.ibd.sel[c('condition')], factor) @@ -122,12 +124,12 @@ dds.ibd <- DESeq2::DESeqDataSetFromMatrix( design = ~ condition) ``` -```{.warning} +``` warning Warning: replacing previous import 'S4Arrays::makeNindexFromArrayViewport' by 'DelayedArray::makeNindexFromArrayViewport' when loading 'SummarizedExperiment' ``` -```r +``` r # calculate the normalised count values using the median-of-ratios method dds.ibd <- dds.ibd %>% DESeq2::estimateSizeFactors() @@ -142,7 +144,7 @@ There are a number of methodologies to calculate the appropriate threshold. One Calculating the Jaccard index between all pairs of samples in a dataset does not however scale well with the dimensionality of the data. Here we use an alternative approach that achieves a very similar result using the Multiset Jaccard index, which is faster to compute. Don't worry too much about the code, the main point is that we find a filter threshold, and filter out all the genes where the max count value is below the threshold, on the basis that these are likely technical noise. Run the code to plot the Multiset Jaccard Index for a series of values for the filter threshold. -```r +``` r # For the purpose of illustration, and to shorten the run time, we sample 5,000 genes from the counts matrix. set.seed(10) @@ -184,6 +186,12 @@ ggplot2::ggplot(data=data.frame(t = t.seq, jacc = ms.jac)) + ggplot2::ylab("Multiset Jaccard Index") ``` +``` warning +Warning in ggplot2::geom_point(ggplot2::aes(x = which.max(ms.jac), y = max(ms.jac)), : All aesthetics have length 1, but the data has 25 rows. +ℹ Please consider using `annotate()` or provide this layer with data containing + a single row. +``` + ::::::::::::::::::::::::::::::::::::: challenge @@ -197,11 +205,11 @@ What value should we use for the low counts threshold? The threshold value is given by the following code, which should return a value close to 10 for this dataset. -```r +``` r (t.hold <- which.max(ms.jac)) ``` -```{.output} +``` output [1] 11 ``` @@ -215,13 +223,13 @@ The threshold value is given by the following code, which should return a value Having determined a threshold, we then filter the raw counts matrix on the rows (genes) that meet the threshold criterion based on the normalised counts. As you can see, around 4,000 genes are removed from the dataset, which is approximately 20% of the genes. These genes are unlikely to contain biologically meaningful information but they might easily bias a machine learning classifier. -```r +``` r counts.mat.ibd.filtered <- counts.mat.ibd[which(apply(counts.ibd.norm, 1, function(x){sum(x > t.hold) >= 1})),] sprintf("Genes filtered: %s; Genes remaining: %s", nrow(counts.mat.ibd)-nrow(counts.mat.ibd.filtered), nrow(counts.mat.ibd.filtered)) ``` -```{.output} +``` output [1] "Genes filtered: 3712; Genes remaining: 19038" ``` @@ -235,20 +243,20 @@ of the general population and a result of biological heterogeneity and technical Run the following code to view the top 10 values of read counts in the raw counts matrix. Compare this with the histogram above, and the mean read count. The largest read count values range from 2MM to over 3MM counts for a gene in a particular sample. These require investigation! -```r +``` r tail(sort(as.matrix(counts.mat.ibd)),10) ``` -```{.output} +``` output [1] 2037946 2038514 2043983 2133125 2238093 2269033 2341479 2683585 3188911 [10] 3191428 ``` -```r +``` r sprintf("The mean read count value: %f", mean(as.matrix(counts.mat.ibd))) ``` -```{.output} +``` output [1] "The mean read count value: 506.731355" ``` @@ -259,7 +267,7 @@ As with low counts, multiple methods exist to identify influential outlier read Run the following code to create DESeq Data Set object from the filtered raw counts matrix, with condition as the experimental factor of interest. -```r +``` r dds.ibd.filt <- DESeq2::DESeqDataSetFromMatrix( countData = as.matrix(counts.mat.ibd.filtered), colData = data.frame(samp.info.ibd.sel, row.names = 'sampleID'), @@ -269,56 +277,56 @@ dds.ibd.filt <- DESeq2::DESeqDataSetFromMatrix( Run `DESeq2` differential expression analysis, which automatically calculates the Cook's distances for every read count. This may take a few minutes to run. -```r +``` r deseq.ibd <- DESeq2::DESeq(dds.ibd.filt) ``` -```{.output} +``` output estimating size factors ``` -```{.output} +``` output estimating dispersions ``` -```{.output} +``` output gene-wise dispersion estimates ``` -```{.output} +``` output mean-dispersion relationship ``` -```{.output} +``` output final dispersion estimates ``` -```{.output} +``` output fitting model and testing ``` -```{.output} +``` output -- replacing outliers and refitting for 1559 genes -- DESeq argument 'minReplicatesForReplace' = 7 -- original counts are preserved in counts(dds) ``` -```{.output} +``` output estimating dispersions ``` -```{.output} +``` output fitting model and testing ``` -```r +``` r cooks.mat <- SummarizedExperiment::assays(deseq.ibd)[["cooks"]] ``` We now calculate the cooks outlier threshold by computing the expected F-distribution for the number of samples in the dataset. -```r +``` r cooks.quantile <- 0.95 m <- ncol(deseq.ibd) # number of samples p <- 3 # number of model parameters (in the three condition case) @@ -329,13 +337,13 @@ h.threshold <- stats::qf(cooks.quantile, p, m - p) Filter the counts matrix to eliminate all genes where the cooks distance is over the outlier threshold. Here you can see that a further ~1,800 genes are filtered out based on having outlier read count values for at least one sample. -```r +``` r counts.mat.ibd.ol.filtered <- counts.mat.ibd.filtered[which(apply(cooks.mat, 1, function(x){(max(x) < h.threshold) >= 1})),] sprintf("Genes filtered: %s; Genes remaining: %s", nrow(counts.mat.ibd.filtered)-nrow(counts.mat.ibd.ol.filtered), nrow(counts.mat.ibd.ol.filtered)) ``` -```{.output} +``` output [1] "Genes filtered: 1776; Genes remaining: 17262" ``` @@ -343,7 +351,7 @@ sprintf("Genes filtered: %s; Genes remaining: %s", nrow(counts.mat.ibd.filtered) Run the same plot as above, and we can see that the genes with very low maximum counts over all samples have been removed. Note that here we are looking at the raw unnormalised counts, so not all maximum counts are below 11, as the filter was applied to the normailsed counts. -```r +``` r data.frame(max_count = apply(counts.mat.ibd.ol.filtered, 1, max, na.rm=TRUE)) %>% ggplot2::ggplot(ggplot2::aes(x = max_count)) + ggplot2::geom_histogram(bins = 200) + diff --git a/episode6.md b/episode6.md index 4c2911f..ee51360 100644 --- a/episode6.md +++ b/episode6.md @@ -39,14 +39,14 @@ download.file(url = "https://zenodo.org/record/8125141/files/counts.mat.ibd.ol.f And now read the files into R... -```r +``` r # suppressPackageStartupMessages(library(dplyr, quietly = TRUE)) # suppressPackageStartupMessages(library(ggplot2, quietly = TRUE)) # suppressPackageStartupMessages(library(tibble, quietly = TRUE)) ``` -```r +``` r samp.info.ibd.sel <- read.table(file="./data/ibd.sample.info.txt", sep="\t", header=T, fill=T, check.names=F) counts.mat.ibd.ol.filtered <- read.table(file="./data/counts.mat.ibd.ol.filtered.txt", sep='\t', header=T, fill=T, check.names=F) @@ -55,7 +55,7 @@ counts.mat.ibd.ol.filtered <- read.table(file="./data/counts.mat.ibd.ol.filtered A simple histogram of the raw counts data illustrates the skewed nature of the distribution. Here we plot a random sample of just 5,000 counts to illustrate the point. -```r +``` r set.seed(seed = 30) `%>%` <- magrittr::`%>%` @@ -75,7 +75,7 @@ counts.mat.ibd.ol.filtered %>% A plot of the mean count against the variance of the counts for a random sample of 5,000 genes (to reduce compute time) illustrates the clear mean variance relationship in the data. The variance is increasing as the mean count increases. -```r +``` r counts.mat.ibd.ol.filtered %>% .[sample(nrow(.), size = 5000, replace = FALSE),] %>% data.frame(row.mean = apply(., 1, mean), @@ -96,7 +96,7 @@ A number of standard transformations exist to change the distribution of RNA-Seq The `vst` transformation is faster to run and more suitable for datasets with a large sample size. Let's conduct the `vst` transformation on our filtered counts data. This may take a few minutes to run on our dataset. -```r +``` r # convert the condition variable to a factor as required by DESeq2 samp.info.ibd.sel[c('condition')] <- lapply(samp.info.ibd.sel[c('condition')], factor) @@ -107,12 +107,12 @@ dds.ibd.filt.ol <- DESeq2::DESeqDataSetFromMatrix( design = ~ condition) ``` -```{.warning} +``` warning Warning: replacing previous import 'S4Arrays::makeNindexFromArrayViewport' by 'DelayedArray::makeNindexFromArrayViewport' when loading 'SummarizedExperiment' ``` -```r +``` r # calculate the vst count values and extract the counts matrix using the assay() function counts.mat.ibd.vst <- DESeq2::varianceStabilizingTransformation(dds.ibd.filt.ol, blind=FALSE) %>% SummarizedExperiment::assay() @@ -121,7 +121,7 @@ counts.mat.ibd.vst <- DESeq2::varianceStabilizingTransformation(dds.ibd.filt.ol, Plotting the data again, we can see the difference in the distribution of the data following transformation. Although there is still a large number of count values equal to zero, the distribution of vst transformed counts is far less heavily skewed than the original. The dependence of the variance on the mean has essentially been eliminated. -```r +``` r set.seed(seed = 30) counts.mat.ibd.vst %>% @@ -138,7 +138,7 @@ counts.mat.ibd.vst %>% -```r +``` r counts.mat.ibd.vst %>% as.data.frame() %>% .[sample(nrow(.), size = 5000, replace = FALSE),] %>% @@ -160,7 +160,7 @@ Many machine learning algorithms are sensitive to the scale of the data. Even af -```r +``` r counts.mat.ibd.vst %>% as.data.frame() %>% tibble::rownames_to_column("geneID") %>% diff --git a/fig/.DS_Store b/fig/.DS_Store deleted file mode 100644 index 5008ddfcf53c02e82d7eee2e57c38e5672ef89f6..0000000000000000000000000000000000000000 GIT binary patch literal 0 HcmV?d00001 literal 6148 zcmeH~Jr2S!425mzP>H1@V-^m;4Wg<&0T*E43hX&L&p$$qDprKhvt+--jT7}7np#A3 zem<@ulZcFPQ@L2!n>{z**++&mCkOWA81W14cNZlEfg7;MkzE(HCqgga^y>{tEnwC%0;vJ&^%eQ zLs35+`xjp>T0