Skip to content

Commit

Permalink
differences for PR #6
Browse files Browse the repository at this point in the history
  • Loading branch information
actions-user committed Oct 3, 2023
1 parent a8202a9 commit c94e46d
Show file tree
Hide file tree
Showing 9 changed files with 29 additions and 520 deletions.
83 changes: 0 additions & 83 deletions config.yaml

This file was deleted.

10 changes: 6 additions & 4 deletions episode2.md
Original file line number Diff line number Diff line change
Expand Up @@ -72,7 +72,7 @@ Now use the sort in the top right of the screen to sort the results by "Samples"

## Illustrative Dataset: IBD Dataset

In the search results displayed above, the top ranked dataset is E-MTAB-11349: *Whole blood expression profiling of patients with inflammatory bowel diseases in the IBD-Character cohort*. We'll call it the [IBD dataset](https://www.ebi.ac.uk/biostudies/arrayexpress/studies/E-MTAB-11349).
In the search results displayed above, the top-ranked dataset is E-MTAB-11349: *Whole blood expression profiling of patients with inflammatory bowel diseases in the IBD-Character cohort*. We'll call it the [IBD dataset](https://www.ebi.ac.uk/biostudies/arrayexpress/studies/E-MTAB-11349).

The IBD dataset comprises human samples of patients with inflammatory bowel diseases and controls. The 'Protocols' section explains the main steps in the generation of the data, from RNA extraction and sample preparation, to sequencing and processing of the raw RNA-Seq data. The nucleic acid sequencing protocol gives details of the sequencing platform used to generate the raw fastq data, and the version of the human genome used in the alignment step to generate the count data. The normalization data transformation protocol gives the tools used to normalise raw counts data for sequence depth and sample composition, in this case normalisation is conducted using the R package [DESeq2](https://bioconductor.org/packages/release/bioc/html/DESeq2.html). DESEq2 contains a range of functions for the transformation and analysis of RNA-Seq data. Let's look at some of the basic information on this dataset:

Expand All @@ -90,9 +90,11 @@ The dataset contains two alternative sets of processed data, a matrix of raw cou

## Downloading and Reading into R

Let's download the SDRF file and the raw counts matrix - these are the two files that contain the information we will need to build a machine learning classification model. In the data files box to the right hand side, check the follwoing two files `E-MTAB-11349.sdrf.txt` and `ArrayExpress-raw.csv`, and save in a folder called `data` in the working directory for your R project. For consistency, rename `ArrayExpress-raw.csv` as `E-MTAB-11349.counts.matrix.csv`.
Let's download the SDRF file and the raw counts matrix - these are the two files that contain the information we will need to build a machine-learning classification model. In the data files box to the right-hand side, check the following two files `E-MTAB-11349.sdrf.txt` and `ArrayExpress-raw.csv`, and save them in a folder called `data` in the working directory for your R project. For consistency, rename `ArrayExpress-raw.csv` as `E-MTAB-11349.counts.matrix.csv`.

For convenience, a copy of the files is also stored on zenodo. You can run the following code that uses the function `download.file()` to download the files and save them directly into your `data` directory. (You need to have created the `data` directory beforehand).
For convenience, a copy of the files is also stored on zenodo. You can run the following code that uses the function `download.file()` to download the files and save them directly into your `data` directory. (You need to have created the `data` directory beforehand).

To get or set a working directory in R `getwd` function can be used to return an absolute file path representing the current working directory of the R process; if not correct `setwd` can be used to set the working directory to the desired location.

```r

Expand All @@ -106,7 +108,7 @@ download.file(url = "https://zenodo.org/record/8125141/files/E-MTAB-11349.counts

### Raw counts matrix

In R studio, open your project workbook and read in the raw counts data. Then check the dimensions of the matrix to confirm we have the expected number of samples and transcript IDs.
In R studio, open your project workbook and read the raw counts data. Then check the dimensions of the matrix to confirm we have the expected number of samples and transcript IDs.


```r
Expand Down
6 changes: 3 additions & 3 deletions episode5.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,15 +21,15 @@ exercises: 2

## Technical Artefacts in RNA-Seq Data

Machine learning classification algorithms are highly sensitive to any feature data characteristic, regardless of scale, that may differ between experimental groups, and will exploit these data characteristic differences when training a model. Given this, it is important to make sure that data input into a machine learning model reflects true biological signal, and not technical artefacts or noise that stems from the experimental process used to generate the data. In simple terms, we need to remove things are aren't "real biological information".
Machine learning classification algorithms are highly sensitive to any feature data characteristic that differs between experimental groups, and will exploit these differences when training a model. Given this, it is important to make sure that data input into a machine learning model reflects true biological signal, and not technical artefacts or noise that stems from the experimental process used to generate the data. In simple terms, we need to remove things are aren't "real biological information".

There are two important sources of noise inherent in RNA-Seq data that may negatively impact machine learning modelling performance, namely low read counts, and influential outlier read counts.

<br>

## Low read counts

Genes with consistently low read count values across all samples in a dataset may be technical or biological stochastic artefacts such as the detection of a transcript from a gene that is not uniformly active in a heterogeneous cell population or as the result of a transcriptional error. Below some count threshold, genes are unlikely to be representative of true biological differences related to the condition of interest. Filtering out low count genes has been show to increase the classification performance of machine learning classifiers, and to increase the stability of the set of genes selected by a machine learning algorithm in the context of selecting relevant genes.
Genes with consistently low read count values across all samples in a dataset may be technical or biological stochastic artefacts such as the detection of a transcript from a gene that is not uniformly active in a heterogeneous cell population or as the result of a transcriptional error. Below some threshold, differences in counts between the conditions of interest for a given gene are unlikely to be representative of true biological differences between the groups. Filtering out low count genes has been show to increase the classification performance of machine learning classifiers, and to increase the stability of the set of genes selected by a machine learning algorithm in the context of selecting relevant genes.

### Investigating Low Counts

Expand Down Expand Up @@ -64,7 +64,7 @@ counts.mat.ibd <- read.table(file="./data/counts.mat.ibd.txt", sep='\t', header=
# any(is.na(samp.info.ibd.sel))
```

Run the following code to view the histogram giving the frequency of the maximum count for each gene in the sample (plotted on a log10 scale). You'll see in the resulting plot that over 800 genes have no counts for any gene (i.e., maximum = 0), and that there are hundreds of genes where the maximum count for the gene across all samples is below 10. This compares to a median maximum count value of around 250 for the dataset. These low count genes are likely to represent technical noise.
Run the following code to view the histogram giving the frequency of the maximum count for each gene in the sample (plotted on a log10 scale). You'll see in the resulting plot that over 800 genes have no counts for any gene (i.e., maximum = 0), and that there are hundreds of genes where the maximum count for the gene across all samples is below 10. For comparison, the median of these maximum counts across all genes is around 250. These low count genes are likely to represent technical noise.

A simple filtering approach is to remove all genes where the maximum read count for that gene over all samples is below a given threshold. The next step is to determine what this threshold should be.

Expand Down
Binary file removed fig/episode5-rendered-unnamed-chunk-4-1.png
Binary file not shown.
Binary file removed fig/episode6-rendered-unnamed-chunk-5-1.png
Binary file not shown.
Binary file removed fig/episode6-rendered-unnamed-chunk-9-1.png
Binary file not shown.
32 changes: 16 additions & 16 deletions md5sum.txt
Original file line number Diff line number Diff line change
@@ -1,17 +1,17 @@
"file" "checksum" "built" "date"
"CODE_OF_CONDUCT.md" "c93c83c630db2fe2462240bf72552548" "site/built/CODE_OF_CONDUCT.md" "2023-07-11"
"LICENSE.md" "b24ebbb41b14ca25cf6b8216dda83e5f" "site/built/LICENSE.md" "2023-07-11"
"config.yaml" "0dab73289fecfb94fc3103a200936186" "site/built/config.yaml" "2023-07-11"
"index.md" "a02c9c785ed98ddd84fe3d34ddb12fcd" "site/built/index.md" "2023-07-11"
"links.md" "8184cf4149eafbf03ce8da8ff0778c14" "site/built/links.md" "2023-07-11"
"episodes/episode1.Rmd" "926badff629b1d85cfbf29a12e8c9291" "site/built/episode1.md" "2023-07-14"
"episodes/episode2.Rmd" "7a9492d6e94b47601461d63c43ad7025" "site/built/episode2.md" "2023-07-20"
"episodes/episode3.Rmd" "5852f58c31251027948529f82784e6f2" "site/built/episode3.md" "2023-07-14"
"episodes/episode4.Rmd" "b8109f31c9c190fa8d5a4db5d7a447e7" "site/built/episode4.md" "2023-07-20"
"episodes/episode5.Rmd" "88a322cefb0d142709d03b57a9946127" "site/built/episode5.md" "2023-07-20"
"episodes/episode6.Rmd" "6be594d58e8b3f916f5018669e609fda" "site/built/episode6.md" "2023-07-20"
"instructors/instructor-notes.md" "cae72b6712578d74a49fea7513099f8c" "site/built/instructor-notes.md" "2023-07-11"
"learners/reference.md" "1c7cc4e229304d9806a13f69ca1b8ba4" "site/built/reference.md" "2023-07-11"
"learners/setup.md" "6927e6379c9d9055194830a858d80eed" "site/built/setup.md" "2023-07-14"
"profiles/learner-profiles.md" "60b93493cf1da06dfd63255d73854461" "site/built/learner-profiles.md" "2023-07-11"
"renv/profiles/lesson-requirements/renv.lock" "1717e21d111b54d6687f598a7898420a" "site/built/renv.lock" "2023-07-11"
"CODE_OF_CONDUCT.md" "c93c83c630db2fe2462240bf72552548" "site/built/CODE_OF_CONDUCT.md" "2023-10-03"
"LICENSE.md" "b24ebbb41b14ca25cf6b8216dda83e5f" "site/built/LICENSE.md" "2023-10-03"
"config.yaml" "0dab73289fecfb94fc3103a200936186" "site/built/config.yaml" "2023-10-03"
"index.md" "a02c9c785ed98ddd84fe3d34ddb12fcd" "site/built/index.md" "2023-10-03"
"links.md" "8184cf4149eafbf03ce8da8ff0778c14" "site/built/links.md" "2023-10-03"
"episodes/episode1.Rmd" "926badff629b1d85cfbf29a12e8c9291" "site/built/episode1.md" "2023-10-03"
"episodes/episode2.Rmd" "ffacead9b9c7c4d5ca3af86ca7f59799" "site/built/episode2.md" "2023-10-03"
"episodes/episode3.Rmd" "5852f58c31251027948529f82784e6f2" "site/built/episode3.md" "2023-10-03"
"episodes/episode4.Rmd" "b8109f31c9c190fa8d5a4db5d7a447e7" "site/built/episode4.md" "2023-10-03"
"episodes/episode5.Rmd" "134c050517e06409dca8ba84e46118c7" "site/built/episode5.md" "2023-10-03"
"episodes/episode6.Rmd" "6be594d58e8b3f916f5018669e609fda" "site/built/episode6.md" "2023-10-03"
"instructors/instructor-notes.md" "cae72b6712578d74a49fea7513099f8c" "site/built/instructor-notes.md" "2023-10-03"
"learners/reference.md" "1c7cc4e229304d9806a13f69ca1b8ba4" "site/built/reference.md" "2023-10-03"
"learners/setup.md" "2499af5a77e67facf5bac24602034d6a" "site/built/setup.md" "2023-10-03"
"profiles/learner-profiles.md" "60b93493cf1da06dfd63255d73854461" "site/built/learner-profiles.md" "2023-10-03"
"renv/profiles/lesson-requirements/renv.lock" "8acc272e94be7b3a36f7c0a45900902c" "site/built/renv.lock" "2023-10-03"
Loading

0 comments on commit c94e46d

Please sign in to comment.