Skip to content

Latest commit

 

History

History
executable file
·
73 lines (40 loc) · 8.99 KB

about_conquer.md

File metadata and controls

executable file
·
73 lines (40 loc) · 8.99 KB

About conquer

The conquer (consistent quantification of external rna-seq data) repository is developed by Charlotte Soneson and Mark D Robinson at the University of Zurich, Switzerland. It is implemented in shiny and provides access to consistently processed public single-cell RNA-seq data sets. Below is a short description of the workflow used to process the raw reads in order to generate the data provided in the repository.

If you use conquer for your work, please cite

The information provided in the columns Brief description, Protocol and Protocol type was inferred and summarized from the information provided by the data generators in the public repositories. We refer to the original descriptions for more detailed information.

## Index building

In order to use Salmon to quantify the transcript abundances in a given sample, we first need to index the corresponding reference transcriptome. For a given organism, we download the fasta files containing cDNA and ncRNA sequences from Ensembl, complement these with ERCC spike-in sequences, and build a Salmon quasi-mapping index for the entire catalog. Note that the scater report for a given data set (available in the scater report column) details the precise version of the transcriptome that was used for the quantification. For data sets with "long" reads (longer than 50 bp) we use the default k=31, while for "short reads" (typically around 25 bp) we set k=15.

We also create a lookup table relating transcript IDs to the corresponding gene IDs. This information is obtained by parsing the sequence names in the cDNA and ncRNA fasta files. From these names we also obtain the genomic coordinates for each feature.

Sample list and run matching

The first step is to determine the set of samples included in a given data set. We download a "RunInfo.csv" file for the data set from SRA and a Series Matrix file from GEO, in order to link samples both to individual runs and to phenotypic information. If the data set is not available from GEO, we construct a phenotype data file from the information provided by the corresponding repository.

Quality control

For each sample in the data set, we find all the corresponding runs, and download and concatenate the corresponding FastQ files from SRA. There is also an optional step to trim adapters from the reads using cutadapt. Next, we run FastQC to generate a quality control file for each concatenated read file (one or two files per sample depending on whether it was processed with a single-end or paired-end sequencing protocol).

Abundance quantification

After the QC, we run Salmon to estimate the abundance of each transcript from the catalog described above in each sample. The Salmon output files are then compressed in an archive and can be downloaded from conquer (see the salmon archive column).

For data obtained with non-full-length library preparation protocols (e.g. targeting only the 3' or 5' end of transcripts), we quantify transcript and gene abundances using the umis pipeline developed by Valentine Svensson. Briefly, we quasimap the reads to the transcriptome using RapMap and use the counting capabilities of umis to obtain feature counts.

Summary report - MultiQC

Once FastQC and Salmon (or RapMap/umis) have been applied to all samples in the data set, we run MultiQC to summarise all the information into one report. This can also be downloaded from conquer (see the MultiQC report column). This report contains quality scores for all the samples and can be used to determine if there are problematic samples and whether the data set is good enough for the purposes of the user or needs to be subsetted.

Data summarisation

The abundances estimated by Salmon are summarised and provided to the user via conquer in the form of a MultiAssayExperiment object. This object can be downloaded via the buttons in the MultiAssayExperiment column. To generate this object, we first use the tximport package to read the Salmon output into R. This returns both count estimates and TPM estimates for each transcript. Next, we summarise the transcript-level information to the gene level. The gene-level TPM is defined as the sum of the TPMs of the corresponding transcripts, and similarly for the gene-level counts. We also provide "scaled TPMs" (see http://f1000research.com/articles/4-1521/ or the tximport vignette for a discussion), that is, summarised TPMs scaled to a "count scale". In the summarisation step, we make use of the transcript-to-gene lookup table generated above.

The provided MultiAssayExperiment object contains two "experiments", corresponding to the gene-level and transcript-level values. The gene-level experiment contains four "assays":

  • TPM
  • count
  • count_lstpm (count-scale length-scaled TPMs)
  • avetxlength (the average transcript length, which can be used as offsets in count models based on the count assay, see http://f1000research.com/articles/4-1521/).

The transcript-level experiment contains three "assays":

  • TPM
  • count
  • efflength (the effective length estimated by Salmon)

The MultiAssayExperiment also contains the phenotypic data (in the colData slot), as well as some metadata for the data set (the genome, the organism, a summary of the Salmon parameters and the fraction of reads that were mapped, and the date when the object was generated). Please note that the format of MultiAssayExperiment objects changed with version 1.1.49 of the MultiAssayExperiment package, and in particular the pData slot is now deprecated in favor of colData. The objects provided in conquer follow the new format.

Summary report - scater

In order to give users another way of investigating whether a data set is useful for their purposes, we also provide an exploratory analysis report. This is largely based on functions from the scater Bioconductor package, applied to data extracted from the MultiAssayExperiment object. The report calculates and visualises various quality measures for the cells, and provides low-dimensional representations of the cells, colored by different phenotypic annotations.

Acknowledgements

We would like to thank Simon Andrews for help with FastQC, Mike Love and Valentine Svensson for providing instructions for how to retrieve the URL for the FastQ file(s) of a given SRA run (see here and here), Davis McCarthy for input regarding scater and Nicholas Hamilton for instructions on how to generate a standardized report based on a provided R object (see here)). Finally, we would like to acknowledge the developers of all the tools we use to prepare the data for conquer.

Presentations/publications

conquer was presented as a poster at the Single Cell Genomics conference in Hinxton, UK, in September 2016. A detailed description of the database and an example of its use in an evaluation of differential expression analysis methods for single-cell RNA-seq data can be found in:

Code

The code used for conquer is available via GitHub.