The S3 xSeries
class attempts to provide a convenient R representation of a
typical RNA-Seq output consisting of a series of sequencing runs, usually each
of them referring to a different RNA sample, from a different biological source.
Building on this, the S3 xModel
class addresses the need to group several
independent scientific studies for the purpose of meta-analysis or reanalysis.
To describe this hierarchical data structure, we will use a terminology
primarily modeled after the one adopted by the partners of the International
Nucleotide Sequence Database Collaboration (INSDC) in the context of the
Sequence Read Archive (SRA).
See here for a
friendly explanation of the INSDC and the related SRA data model.
- Run: each sequencing run directly generating a single FASTQ file (or
file pair in the case of non-interleaved PE reads). Within the INSDC
framework, Run metadata objects are referred to by an accession following
the pattern
(E|D|S)RR[0-9]{6,}
. - Sample: the set of all Runs from the same biological RNA sample.
The corresponding INSDC metadata objects are Samples (accession
(E|D|S)RS[0-9]{6,}
) or, equivalently, BioSamples (accessionSAM(E|D|N)[A-Z]?[0-9]+
). Even GEO Sample IDs (accessionGSM[0-9]+
) can be found if data were originally brokered to INSDC’s SRA by GEO. - Series: the set of all Samples/Runs pertaining to a given
biological research project or experimental design, across all conditions of
interest to that particular study. This is roughly what INSDC refers to as
Study (accession
(E|D|S)RP[0-9]{6,}
) or, equivalently, BioProject (accessionPRJ(E|D|N)[A-Z][0-9]+
). If data were originally brokered by GEO or ArrayExpress, GEO Series IDs (accessionGSE[0-9]+
) or AE Experiment IDs (accessionE-[A-Z]{4}-[0-9]+
) will be also present in SRA DBs and can be used as Study alias accessions, respectively. - Model: a collection of Series dealing with the same biological model. Since this object is characteristic of the meta-analytic level, which by definition involves multiple independent studies, it has no counterpart in INSDC.
Note
Most of the times, Runs and Samples are the same thing, however it could happen that a single RNA Sample is sequenced through multiple Runs (also referred to as 'technical replicates'). Thus, the only ID that is ensured to be unique on a per-file basis (even in the case of technical replicates), is the Run accession. For this reason, it will be used here as the base reference for the construction of xSeries and xModel data structures.
SeqLoader implements S3 xSeries
and xModel
classes to provide an
integrated representation of RNA-Seq data (i.e., read counts), gene annotation,
and Sample/Run metadata for Series and Models, respectively.
Both new_xSeries
and new_xModel
SeqLoader constructors assume that
RNA-Seq low-level analysis (i.e., read alignment and transcript abundance
quantification) has already been performed and that the starting point is a
matrix of raw or normalized counts and their related metadata. This means that
xSeries
and xModel
objects can only be constructed from files containing
such information and not by direct data input.
With the exclusion of a few methods designed to filter and subset structures
([...]
, pruneRuns
, keepRuns
), all the other methods of the class access
the data structures in read-only mode, leaving the object completely unchanged.
SeqLoader is indeed designed to facilitate high-level analysis operations
(i.e., downstream of the count matrix) such as data representation or
descriptive or inferential statistical analysis, in both absolute and
differential expression contexts.
These S3 objects are constructed from a gene expression matrix and a metadata
table related to a single study. Within the expression matrix, each row
represents a gene (or transcript), while each column refers to a different
sequencing Run, with the possible addition of one or more columns for gene
annotations (e.g., gene symbol, gene name, gene type, ...). In contrast, the
metadata table is assumed to dedicate one row for each Run and one column for
each metadata of interest, among which a mandatory field for Run accessions.
So, xSeries
objects combine information from a typical gene expression matrix
(possibly including gene annotations) with metadata about individual Runs.
More technically, an xSeries
object is a named list containing one element for
each Run of the study in question, and an additional element for gene
annotations (which are common to all Runs). Each Run element is itself a
list containing relevant metadata for the specific Run, and a gene
data
frame with raw or normalized counts for each gene. The annotation
element, on
the contrary, is a simple data frame (see tree graph below). Notice that,
Samples are not explicitly represented in xSeries
objects. However, a method
will be implemented to collapse Runs to the Samples they belong in the case
of technical replicates (see issue #4).
Important
For the purpose of a data re-analysis or meta-analysis, one may not
necessarily be interested in all the samples or experimental conditions
included in the the original study. Rather, just a selection of them is
usually considered. For this reason, recomputed count matrices could feature
less columns (i.e., Runs) compared to the number of rows within the complete
metadata table. By construction, xSeries
objects are made up of all Runs
for which there exist a metadata entry (i.e., usually all the Runs from the
original study), but only the ones included in the loaded count matrices will
also feature a counts
column in the genes
data frame. In any case, one can
always use the pruneRuns
method to easily remove all the Runs with no
count data from xSeries or xModel objects, reducing them to the sole
columns actually present in the count matrix used for object construction.
Structurally speaking, these S3 objects are just collections (lists) of many
xSeries
related to the same biological model. Rather, the focus here is on
functions. Methods implementing generics for this class are primarily meant to
aid in meta-analytic data synthesis (see geneStats
in particular).
Here is a graph of a generic xModel made up of n xSeries. The n-th
Series consists of m Runs and the gene annotation
element. Each Run
contains a gene
dataframe for the counts
and an arbitrary number of
metadata, including the mandatory ena_run
ID.
Model : xModel
│
├──$ Series_1 : xSeries
├──$ Series_2 : xSeries
├──$ ...
└──$ Series_n : xSeries
│
├──$ Run_1 : list
├──$ Run_2 : list
├──$ Run_3 : list
├──$ ...
├──$ Run_m : list
│ │
│ ├──$ ena_sample_title : chr
│ ├──$ geo_series : chr
│ ├──$ geo_sample : chr
│ ├──$ ena_project : chr
│ ├──$ ena_sample : chr
│ ├──$ ena_run : chr
│ ├──$ read_count : int
│ ├──$ library_layout : chr
│ ├──$ extra : int
│ ├──$ ...
│ └──$ genes : data.frame
│ ├──$ IDs : chr
│ └──$ counts : num
│
└──$ annotation : data.frame
├──$ IDs : chr
├──$ SYMBOL : chr
├──$ GENENAME : chr
├──$ GENETYPE : chr
└──$ ...
A number of (hopefully reasonable) assumptions upon file and data organization are made for object construction to be successful.
- Each Series (or Study) is represented by two
CSV
orTSV
files, containing respectively the read counts (along with possible gene annotation) and metadata for all the Runs making up the Series. - All Series related to the same Model are stored in the same directory, here referred to as the target directory for xModel construction. Consistently, each Model is expected to live in a separate target directory.
- Both data and metadata files have names starting with the same Series
accession ID (any of the INSDC BioProject, ENA Study, or GEO Series format
is fine), followed by an underscore (
_
) and either the distinctive stringCountMatrix
ormeta
(ignoring case) if they are counts or metadata, respectively. Any additional characters after this pattern are allowed, as long as the terminal filename extension is eitherCSV
orTSV
(again ignoring case). Examples of well-formed filenames are:GSE205739_CountMatrix_genes_TPM.tsv PRJNA847413_countMATRIX.CSV PRJNA847413_meta.csv SRP379268_MeTaDaTa.TSV
- Metadata tables have proper column names as header, among which only
ena_run
is mandatory, as the only ID that is certainly unique to each FASTQ file in the original study (i.e., Series). - Count tables have proper column names as header, including a mandatory gene
ID column matching the regex
gene.*id|transcript.*id|ENSEMBL|ENSEMBLTRANS
(usually beinggene_id
ortranscript_id
). - Within the count table, each count-containing column is named using the
corresponding
ena_run
ID in the header (along with other possible text strings). - Input counts from
CountMatrix
file can be either raw or normalized counts, but in any case they are supposed to be in linear scale (not log-transformed)
Tip
Most of these requirements are automatically met if the low-level RNA-Seq data analysis is performed using x.FASTQ.