Skip to content

Commit

Permalink
Merge pull request #10 from mskcc/release_candidate_1.0.0
Browse files Browse the repository at this point in the history
Release candidate 1.0.0
  • Loading branch information
timosong authored Jul 24, 2024
2 parents bcf778b + a5a5efa commit f24a518
Show file tree
Hide file tree
Showing 33 changed files with 2,509 additions and 332 deletions.
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -6,3 +6,4 @@ results/
testing/
testing*
*.pyc
._*
62 changes: 29 additions & 33 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,67 +9,63 @@

## Introduction

**mskcc/loki** is a bioinformatics pipeline that ...
**mskcc/loki** is a bioinformatics pipeline that calculates Copy Number Variation (CNV) mutation data from a Tumor/Normal Bam pair. The pipeline uses MSKCC Facets/Facets-suite and calculates pileups using MKSCC Htstools.

<!-- TODO nf-core:
Complete this sentence with a 2-3 sentence summary of what types of data the pipeline ingests, a brief overview of the
major pipeline sections and the types of output it produces. You're giving an overview to someone new
to nf-core here, in 15-20 seconds. For an example, see https://github.com/nf-core/rnaseq/blob/master/README.md#introduction
-->
![Loki graph](docs/images/Loki.png)

<!-- TODO nf-core: Include a figure that guides the user through the major workflow steps. Many nf-core
workflows use the "tube map" design for that. See https://nf-co.re/docs/contributing/design_guidelines#examples for examples. -->
<!-- TODO nf-core: Fill in short bullet-pointed list of the default steps in the pipeline -->

1. Read QC ([`FastQC`](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/))
2. Present QC for raw reads ([`MultiQC`](http://multiqc.info/))
1. Calculate pileups ([`htstools`](https://github.com/mskcc/htstools/releases/tag/snp_pileup_0.1.1))
2. Calculate CNV results ([`Facets-suite`](https://github.com/mskcc/facets-suite/releases/tag/2.0.9))

## Usage

:::note
If you are new to Nextflow and nf-core, please refer to [this page](https://nf-co.re/docs/usage/installation) on how
to set-up Nextflow. Make sure to [test your setup](https://nf-co.re/docs/usage/introduction#how-to-run-a-pipeline)
with `-profile test` before running the workflow on actual data.
:::
> [!NOTE]
> If you are new to Nextflow and nf-core, please refer to [this page](https://nf-co.re/docs/usage/installation) on how to set-up Nextflow. Make sure to [test your setup](https://nf-co.re/docs/usage/introduction#how-to-run-a-pipeline) with `-profile test` before running the workflow on actual data.
#### Running nextflow @ MSKCC

<!-- TODO nf-core: Describe the minimum required steps to execute the pipeline, e.g. how to prepare samplesheets.
Explain what rows and columns represent. For instance (please edit as appropriate):
If you are runnning this pipeline on a MSKCC cluster you need to make sure nextflow is properly configured for the HPC envirornment:

```bash
module load java/jdk-17.0.8
module load singularity/3.7.1
export PATH=$PATH:/path/to/nextflow/binary
export SINGULARITY_TMPDIR=/path/to/network/storage/for/singularity/tmp/files
export NXF_SINGULARITY_CACHEDIR=/path/to/network/storage/for/singularity/cache
```

### Running the pipeline

First, prepare a samplesheet with your input data that looks as follows:

`samplesheet.csv`:

```csv
sample,fastq_1,fastq_2
CONTROL_REP1,AEG588A1_S1_L002_R1_001.fastq.gz,AEG588A1_S1_L002_R2_001.fastq.gz
pairId,tumorBam,normalBam,assay,normalType,bedFile
pair_sample,/bam/path/foo_tumor.rg.md.abra.printreads.bam,/bam/path/foo_normal.rg.md.abra.printreads.bam,IMPACT505,MATCHED,NONE
```

Each row represents a fastq file (single-end) or a pair of fastq files (paired end).
-->
> [!IMPORTANT]
> Make sure the bams have an index file associated with it either file.bam.bai or file.bai should work
Now, you can run the pipeline using:

<!-- TODO nf-core: update the following command to include all required parameters for a minimal example -->

```bash
nextflow run mskcc/loki \
-profile <docker/singularity/.../institute> \
nextflow run main.nf \
-profile singularity,test_juno \
--input samplesheet.csv \
--outdir <OUTDIR>
```

:::warning
Please provide pipeline parameters via the CLI or Nextflow `-params-file` option. Custom config files including those
provided by the `-c` Nextflow option can be used to provide any configuration _**except for parameters**_;
see [docs](https://nf-co.re/usage/configuration#custom-configuration-files).
:::
> [!WARNING]
> Please provide pipeline parameters via the CLI or Nextflow `-params-file` option. Custom config files including those provided by the `-c` Nextflow option can be used to provide any configuration _**except for parameters**_; see [docs](https://nf-co.re/usage/configuration#custom-configuration-files).
## Credits

mskcc/loki was originally written by Nikhil Kumar.
mskcc/loki was originally written by Nikhil Kumar [@nikhil](https://github.com/nikhil).

We thank the following people for their extensive assistance in the development of this pipeline:
<!--We thank the following people for their extensive assistance in the development of this pipeline: -->

<!-- TODO nf-core: If applicable, make list of people who have also contributed -->

Expand Down
6 changes: 3 additions & 3 deletions assets/samplesheet.csv
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
sample,fastq_1,fastq_2
SAMPLE_PAIRED_END,/path/to/fastq/files/AEG588A1_S1_L002_R1_001.fastq.gz,/path/to/fastq/files/AEG588A1_S1_L002_R2_001.fastq.gz
SAMPLE_SINGLE_END,/path/to/fastq/files/AEG588A4_S4_L003_R1_001.fastq.gz,
pairId,tumorBam,normalBam,assay,normalType,bedFile
pair_sample,/juno/work/ci/dev/dev_phoenix/test-datasets/bam/foo_tumor.rg.md.abra.printreads.bam,/juno/work/ci/dev/dev_phoenix/test-datasets/bam/foo_normal.rg.md.abra.printreads.bam,IMPACT505,MATCHED,NONE
pair_sample_1,/juno/work/ci/dev/dev_phoenix/test-datasets/bam/bar_tumor.rg.md.abra.printreads.bam,/juno/work/ci/dev/dev_phoenix/test-datasets/bam/bar_normal.rg.md.abra.printreads.bam,IMPACT505,MATCHED,NONE
41 changes: 32 additions & 9 deletions assets/schema_input.json
Original file line number Diff line number Diff line change
Expand Up @@ -7,22 +7,39 @@
"items": {
"type": "object",
"properties": {
"sample": {
"pairId": {
"type": "string",
"pattern": "^\\S+$",
"errorMessage": "Sample name must be provided and cannot contain spaces"
"errorMessage": "Pair id must be provided and cannot contain spaces"
},
"fastq_1": {
"tumorBam": {
"type": "string",
"pattern": "^\\S+\\.f(ast)?q\\.gz$",
"errorMessage": "FastQ file for reads 1 must be provided, cannot contain spaces and must have extension '.fq.gz' or '.fastq.gz'"
"format": "file-path",
"pattern": "^\\S+\\.bam$",
"errorMessage": "Tumor bam file must be provided, cannot contain spaces and must have extension '.bam'"
},
"fastq_2": {
"errorMessage": "FastQ file for reads 2 cannot contain spaces and must have extension '.fq.gz' or '.fastq.gz'",
"normalBam": {
"type": "string",
"format": "file-path",
"pattern": "^\\S+\\.bam$",
"errorMessage": "Normal bam file must be provided, cannot contain spaces and must have extension '.bam'"
},
"assay": {
"type": "string",
"pattern": "^\\S+$",
"errorMessage": "Assay must be provided and cannot contain spaces"
},
"normalType": {
"type": "string",
"pattern": "^\\S+$",
"errorMessage": "NormalType must be provided and cannot contain spaces"
},
"bedFile": {
"errorMessage": "Bed file to specify genomic regions",
"anyOf": [
{
"type": "string",
"pattern": "^\\S+\\.f(ast)?q\\.gz$"
"pattern": "^\\S+\\.bed$"
},
{
"type": "string",
Expand All @@ -31,6 +48,12 @@
]
}
},
"required": ["sample", "fastq_1"]
"required": [
"pairId",
"tumorBam",
"normalBam",
"assay",
"normalType"
]
}
}
139 changes: 57 additions & 82 deletions bin/check_samplesheet.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,38 +25,33 @@ class RowChecker:
"""

VALID_FORMATS = (
".fq.gz",
".fastq.gz",
".bam"
)

def __init__(
self,
sample_col="sample",
first_col="fastq_1",
second_col="fastq_2",
single_col="single_end",
pairId="pairId",
tumorBam="tumorBam",
normalBam="normalBam",
assay="assay",
normalType="normalType",
bedFile="bedFile",
**kwargs,
):
"""
Initialize the row checker with the expected column names.
Args:
sample_col (str): The name of the column that contains the sample name
(default "sample").
first_col (str): The name of the column that contains the first (or only)
FASTQ file path (default "fastq_1").
second_col (str): The name of the column that contains the second (if any)
FASTQ file path (default "fastq_2").
single_col (str): The name of the new column that will be inserted and
records whether the sample contains single- or paired-end sequencing
reads (default "single_end").
"""
super().__init__(**kwargs)
self._sample_col = sample_col
self._first_col = first_col
self._second_col = second_col
self._single_col = single_col
self._pairId=pairId
self._tumorBam=tumorBam
self._normalBam=normalBam
self._assay=assay
self._normalType=normalType
self._bedFile=bedFile
self._seen = set()
self.modified = []

Expand All @@ -69,65 +64,53 @@ def validate_and_transform(self, row):
(values).
"""
self._validate_sample(row)
self._validate_first(row)
self._validate_second(row)
self._validate_pair(row)
self._seen.add((row[self._sample_col], row[self._first_col]))
self._validate_names(row)
self._validate_bams(row)
self._validate_normalType(row)
self._validate_bed_format(row)
self._seen.add((row[self._pairId]))
self.modified.append(row)

def _validate_sample(self, row):
"""Assert that the sample name exists and convert spaces to underscores."""
if len(row[self._sample_col]) <= 0:
raise AssertionError("Sample input is required.")
# Sanitize samples slightly.
row[self._sample_col] = row[self._sample_col].replace(" ", "_")
def _validate_names(self, row):
"""Assert that the sample names exist"""
if len(row[self._pairId]) <= 0:
raise AssertionError("pairId is required.")

def _validate_pairId_format(self, row):
id_value = row[self._pairId]
if "." in id_value:
raise AssertionError("pairId:{} cannot contain any periods ('.') ".format(id_value))

def _validate_first(self, row):
def _validate_bams(self, row):
"""Assert that the first FASTQ entry is non-empty and has the right format."""
if len(row[self._first_col]) <= 0:
raise AssertionError("At least the first FASTQ file is required.")
self._validate_fastq_format(row[self._first_col])

def _validate_second(self, row):
"""Assert that the second FASTQ entry has the right format if it exists."""
if len(row[self._second_col]) > 0:
self._validate_fastq_format(row[self._second_col])

def _validate_pair(self, row):
"""Assert that read pairs have the same file extension. Report pair status."""
if row[self._first_col] and row[self._second_col]:
row[self._single_col] = False
first_col_suffix = Path(row[self._first_col]).suffixes[-2:]
second_col_suffix = Path(row[self._second_col]).suffixes[-2:]
if first_col_suffix != second_col_suffix:
raise AssertionError("FASTQ pairs must have the same file extensions.")
else:
row[self._single_col] = True

def _validate_fastq_format(self, filename):
if len(row[self._tumorBam]) <= 0 or len(row[self._normalBam]) <= 0:
raise AssertionError("Both bam files are required.")
self._validate_bam_format(row[self._tumorBam])
self._validate_bam_format(row[self._normalBam])

def _validate_normalType(self, row):
"""Assert that bait set exists."""
if len(row[self._normalType]) <= 0:
raise AssertionError("normalType is required.")

def _validate_bam_format(self, filename):
"""Assert that a given filename has one of the expected FASTQ extensions."""
if not any(filename.endswith(extension) for extension in self.VALID_FORMATS):
raise AssertionError(
f"The FASTQ file has an unrecognized extension: {filename}\n"
f"The BAM file has an unrecognized extension: {filename}\n"
f"It should be one of: {', '.join(self.VALID_FORMATS)}"
)

def validate_unique_samples(self):
"""
Assert that the combination of sample name and FASTQ filename is unique.
In addition to the validation, also rename all samples to have a suffix of _T{n}, where n is the
number of times the same sample exist, but with different FASTQ files, e.g., multiple runs per experiment.
"""
if len(self._seen) != len(self.modified):
raise AssertionError("The pair of sample name and FASTQ must be unique.")
seen = Counter()
for row in self.modified:
sample = row[self._sample_col]
seen[sample] += 1
row[self._sample_col] = f"{sample}_T{seen[sample]}"
def _validate_bed_format(self, row):
"""Assert that a given filename has one of the expected BED extensions."""
filename = row[self._bedFile]
if filename and filename != "NONE":
if not filename.endswith(".bed"):
raise AssertionError(
f"The BED file has an unrecognized extension: {filename}\n"
f"It should be .bed\n"
f"If you would like one generated for you leave it bank or enter 'NONE'\n"
)


def read_head(handle, num_lines=10):
Expand Down Expand Up @@ -164,10 +147,9 @@ def sniff_format(handle):

def check_samplesheet(file_in, file_out):
"""
Check that the tabular samplesheet has the structure expected by nf-core pipelines.
Check that the tabular samplesheet has the structure expected by the ODIN pipeline.
Validate the general shape of the table, expected columns, and each row. Also add
an additional column which records whether one or two FASTQ reads were found.
Validate the general shape of the table, expected columns, and each row.
Args:
file_in (pathlib.Path): The given tabular samplesheet. The format can be either
Expand All @@ -179,19 +161,14 @@ def check_samplesheet(file_in, file_out):
This function checks that the samplesheet follows the following structure,
see also the `viral recon samplesheet`_::
sample,fastq_1,fastq_2
SAMPLE_PE,SAMPLE_PE_RUN1_1.fastq.gz,SAMPLE_PE_RUN1_2.fastq.gz
SAMPLE_PE,SAMPLE_PE_RUN2_1.fastq.gz,SAMPLE_PE_RUN2_2.fastq.gz
SAMPLE_SE,SAMPLE_SE_RUN1_1.fastq.gz,
.. _viral recon samplesheet:
https://raw.githubusercontent.com/nf-core/test-datasets/viralrecon/samplesheet/samplesheet_test_illumina_amplicon.csv
pairId,tumorBam,normalBam,assay,normalType,bedFile
SAMPLE_TUMOR.SAMPLE_NORMAL,BAM_TUMOR,BAM_NORMAL,BAITS,NORMAL_TYPE,BED_FILE
"""
required_columns = {"sample", "fastq_1", "fastq_2"}
required_columns = {"pairId","tumorBam","normalBam","assay","normalType","bedFile"}
# See https://docs.python.org/3.9/library/csv.html#id3 to read up on `newline=""`.
with file_in.open(newline="") as in_handle:
reader = csv.DictReader(in_handle, dialect=sniff_format(in_handle))
reader = csv.DictReader(in_handle, dialect=sniff_format(in_handle),delimiter=',')
# Validate the existence of the expected header columns.
if not required_columns.issubset(reader.fieldnames):
req_cols = ", ".join(required_columns)
Expand All @@ -205,9 +182,7 @@ def check_samplesheet(file_in, file_out):
except AssertionError as error:
logger.critical(f"{str(error)} On line {i + 2}.")
sys.exit(1)
checker.validate_unique_samples()
header = list(reader.fieldnames)
header.insert(1, "single_end")
# See https://docs.python.org/3.9/library/csv.html#id3 to read up on `newline=""`.
with file_out.open(mode="w", newline="") as out_handle:
writer = csv.DictWriter(out_handle, header, delimiter=",")
Expand Down
Loading

0 comments on commit f24a518

Please sign in to comment.