Merge pull request #10 from mskcc/release_candidate_1.0.0

Release candidate 1.0.0
mskcc · Jul 24, 2024 · f24a518 · f24a518
2 parents bcf778b + a5a5efa
commit f24a518
Show file tree

Hide file tree

Showing 33 changed files with 2,509 additions and 332 deletions.
diff --git a/.gitignore b/.gitignore
@@ -6,3 +6,4 @@ results/
 testing/
 testing*
 *.pyc
+._*
diff --git a/README.md b/README.md
@@ -9,67 +9,63 @@
 
 ## Introduction
 
-**mskcc/loki** is a bioinformatics pipeline that ...
+**mskcc/loki** is a bioinformatics pipeline that calculates Copy Number Variation (CNV) mutation data from a Tumor/Normal Bam pair. The pipeline uses MSKCC Facets/Facets-suite and calculates pileups using MKSCC Htstools.
 
-<!-- TODO nf-core:
-   Complete this sentence with a 2-3 sentence summary of what types of data the pipeline ingests, a brief overview of the
-   major pipeline sections and the types of output it produces. You're giving an overview to someone new
-   to nf-core here, in 15-20 seconds. For an example, see https://github.com/nf-core/rnaseq/blob/master/README.md#introduction
--->
+![Loki graph](docs/images/Loki.png)
 
-<!-- TODO nf-core: Include a figure that guides the user through the major workflow steps. Many nf-core
-     workflows use the "tube map" design for that. See https://nf-co.re/docs/contributing/design_guidelines#examples for examples.   -->
-<!-- TODO nf-core: Fill in short bullet-pointed list of the default steps in the pipeline -->
-
-1. Read QC ([`FastQC`](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/))
-2. Present QC for raw reads ([`MultiQC`](http://multiqc.info/))
+1. Calculate pileups ([`htstools`](https://github.com/mskcc/htstools/releases/tag/snp_pileup_0.1.1))
+2. Calculate CNV results ([`Facets-suite`](https://github.com/mskcc/facets-suite/releases/tag/2.0.9))
 
 ## Usage
 
-:::note
-If you are new to Nextflow and nf-core, please refer to [this page](https://nf-co.re/docs/usage/installation) on how
-to set-up Nextflow. Make sure to [test your setup](https://nf-co.re/docs/usage/introduction#how-to-run-a-pipeline)
-with `-profile test` before running the workflow on actual data.
-:::
+> [!NOTE]
+> If you are new to Nextflow and nf-core, please refer to [this page](https://nf-co.re/docs/usage/installation) on how to set-up Nextflow. Make sure to [test your setup](https://nf-co.re/docs/usage/introduction#how-to-run-a-pipeline) with `-profile test` before running the workflow on actual data.
+
+#### Running nextflow @ MSKCC
 
-<!-- TODO nf-core: Describe the minimum required steps to execute the pipeline, e.g. how to prepare samplesheets.
-     Explain what rows and columns represent. For instance (please edit as appropriate):
+If you are runnning this pipeline on a MSKCC cluster you need to make sure nextflow is properly configured for the HPC envirornment:
+
+```bash
+module load java/jdk-17.0.8
+module load singularity/3.7.1
+export PATH=$PATH:/path/to/nextflow/binary
+export SINGULARITY_TMPDIR=/path/to/network/storage/for/singularity/tmp/files
+export NXF_SINGULARITY_CACHEDIR=/path/to/network/storage/for/singularity/cache
+```
+
+### Running the pipeline
 
 First, prepare a samplesheet with your input data that looks as follows:
 
 `samplesheet.csv`:
 
 ```csv
-sample,fastq_1,fastq_2
-CONTROL_REP1,AEG588A1_S1_L002_R1_001.fastq.gz,AEG588A1_S1_L002_R2_001.fastq.gz
+pairId,tumorBam,normalBam,assay,normalType,bedFile
+pair_sample,/bam/path/foo_tumor.rg.md.abra.printreads.bam,/bam/path/foo_normal.rg.md.abra.printreads.bam,IMPACT505,MATCHED,NONE
 ```
 
-Each row represents a fastq file (single-end) or a pair of fastq files (paired end).
-
--->
+> [!IMPORTANT]
+> Make sure the bams have an index file associated with it either file.bam.bai or file.bai should work
 
 Now, you can run the pipeline using:
 
 <!-- TODO nf-core: update the following command to include all required parameters for a minimal example -->
 
 ```bash
-nextflow run mskcc/loki \
-   -profile <docker/singularity/.../institute> \
+nextflow run main.nf \
+   -profile singularity,test_juno \
    --input samplesheet.csv \
    --outdir <OUTDIR>
 ```
 
-:::warning
-Please provide pipeline parameters via the CLI or Nextflow `-params-file` option. Custom config files including those
-provided by the `-c` Nextflow option can be used to provide any configuration _**except for parameters**_;
-see [docs](https://nf-co.re/usage/configuration#custom-configuration-files).
-:::
+> [!WARNING]
+> Please provide pipeline parameters via the CLI or Nextflow `-params-file` option. Custom config files including those provided by the `-c` Nextflow option can be used to provide any configuration _**except for parameters**_; see [docs](https://nf-co.re/usage/configuration#custom-configuration-files).
 
 ## Credits
 
-mskcc/loki was originally written by Nikhil Kumar.
+mskcc/loki was originally written by Nikhil Kumar [@nikhil](https://github.com/nikhil).
 
-We thank the following people for their extensive assistance in the development of this pipeline:
+<!--We thank the following people for their extensive assistance in the development of this pipeline: -->
 
 <!-- TODO nf-core: If applicable, make list of people who have also contributed -->
 

diff --git a/assets/samplesheet.csv b/assets/samplesheet.csv
@@ -1,3 +1,3 @@
-sample,fastq_1,fastq_2
-SAMPLE_PAIRED_END,/path/to/fastq/files/AEG588A1_S1_L002_R1_001.fastq.gz,/path/to/fastq/files/AEG588A1_S1_L002_R2_001.fastq.gz
-SAMPLE_SINGLE_END,/path/to/fastq/files/AEG588A4_S4_L003_R1_001.fastq.gz,
+pairId,tumorBam,normalBam,assay,normalType,bedFile
+pair_sample,/juno/work/ci/dev/dev_phoenix/test-datasets/bam/foo_tumor.rg.md.abra.printreads.bam,/juno/work/ci/dev/dev_phoenix/test-datasets/bam/foo_normal.rg.md.abra.printreads.bam,IMPACT505,MATCHED,NONE
+pair_sample_1,/juno/work/ci/dev/dev_phoenix/test-datasets/bam/bar_tumor.rg.md.abra.printreads.bam,/juno/work/ci/dev/dev_phoenix/test-datasets/bam/bar_normal.rg.md.abra.printreads.bam,IMPACT505,MATCHED,NONE
diff --git a/assets/schema_input.json b/assets/schema_input.json
@@ -7,22 +7,39 @@
     "items": {
         "type": "object",
         "properties": {
-            "sample": {
+            "pairId": {
                 "type": "string",
                 "pattern": "^\\S+$",
-                "errorMessage": "Sample name must be provided and cannot contain spaces"
+                "errorMessage": "Pair id must be provided and cannot contain spaces"
             },
-            "fastq_1": {
+            "tumorBam": {
                 "type": "string",
-                "pattern": "^\\S+\\.f(ast)?q\\.gz$",
-                "errorMessage": "FastQ file for reads 1 must be provided, cannot contain spaces and must have extension '.fq.gz' or '.fastq.gz'"
+                "format": "file-path",
+                "pattern": "^\\S+\\.bam$",
+                "errorMessage": "Tumor bam file must be provided, cannot contain spaces and must have extension '.bam'"
             },
-            "fastq_2": {
-                "errorMessage": "FastQ file for reads 2 cannot contain spaces and must have extension '.fq.gz' or '.fastq.gz'",
+            "normalBam": {
+                "type": "string",
+                "format": "file-path",
+                "pattern": "^\\S+\\.bam$",
+                "errorMessage": "Normal bam file must be provided, cannot contain spaces and must have extension '.bam'"
+            },
+            "assay": {
+                "type": "string",
+                "pattern": "^\\S+$",
+                "errorMessage": "Assay must be provided and cannot contain spaces"
+            },
+            "normalType": {
+                "type": "string",
+                "pattern": "^\\S+$",
+                "errorMessage": "NormalType must be provided and cannot contain spaces"
+            },
+            "bedFile": {
+                "errorMessage": "Bed file to specify genomic regions",
                 "anyOf": [
                     {
                         "type": "string",
-                        "pattern": "^\\S+\\.f(ast)?q\\.gz$"
+                        "pattern": "^\\S+\\.bed$"
                     },
                     {
                         "type": "string",
@@ -31,6 +48,12 @@
                 ]
             }
         },
-        "required": ["sample", "fastq_1"]
+        "required": [
+            "pairId",
+            "tumorBam",
+            "normalBam",
+            "assay",
+            "normalType"
+        ]
     }
 }
diff --git a/bin/check_samplesheet.py b/bin/check_samplesheet.py
@@ -25,38 +25,33 @@ class RowChecker:
     """
 
     VALID_FORMATS = (
-        ".fq.gz",
-        ".fastq.gz",
+        ".bam"
     )
 
     def __init__(
         self,
-        sample_col="sample",
-        first_col="fastq_1",
-        second_col="fastq_2",
-        single_col="single_end",
+        pairId="pairId",
+        tumorBam="tumorBam",
+        normalBam="normalBam",
+        assay="assay",
+        normalType="normalType",
+        bedFile="bedFile",
         **kwargs,
     ):
         """
         Initialize the row checker with the expected column names.
 
         Args:
-            sample_col (str): The name of the column that contains the sample name
-                (default "sample").
-            first_col (str): The name of the column that contains the first (or only)
-                FASTQ file path (default "fastq_1").
-            second_col (str): The name of the column that contains the second (if any)
-                FASTQ file path (default "fastq_2").
-            single_col (str): The name of the new column that will be inserted and
-                records whether the sample contains single- or paired-end sequencing
-                reads (default "single_end").
+
 
         """
         super().__init__(**kwargs)
-        self._sample_col = sample_col
-        self._first_col = first_col
-        self._second_col = second_col
-        self._single_col = single_col
+        self._pairId=pairId
+        self._tumorBam=tumorBam
+        self._normalBam=normalBam
+        self._assay=assay
+        self._normalType=normalType
+        self._bedFile=bedFile
         self._seen = set()
         self.modified = []
 
@@ -69,65 +64,53 @@ def validate_and_transform(self, row):
                 (values).
 
         """
-        self._validate_sample(row)
-        self._validate_first(row)
-        self._validate_second(row)
-        self._validate_pair(row)
-        self._seen.add((row[self._sample_col], row[self._first_col]))
+        self._validate_names(row)
+        self._validate_bams(row)
+        self._validate_normalType(row)
+        self._validate_bed_format(row)
+        self._seen.add((row[self._pairId]))
         self.modified.append(row)
 
-    def _validate_sample(self, row):
-        """Assert that the sample name exists and convert spaces to underscores."""
-        if len(row[self._sample_col]) <= 0:
-            raise AssertionError("Sample input is required.")
-        # Sanitize samples slightly.
-        row[self._sample_col] = row[self._sample_col].replace(" ", "_")
+    def _validate_names(self, row):
+        """Assert that the sample names exist"""
+        if len(row[self._pairId]) <= 0:
+            raise AssertionError("pairId is required.")
+
+    def _validate_pairId_format(self, row):
+        id_value = row[self._pairId]
+        if "." in id_value:
+            raise AssertionError("pairId:{} cannot contain any periods ('.') ".format(id_value))
 
-    def _validate_first(self, row):
+    def _validate_bams(self, row):
         """Assert that the first FASTQ entry is non-empty and has the right format."""
-        if len(row[self._first_col]) <= 0:
-            raise AssertionError("At least the first FASTQ file is required.")
-        self._validate_fastq_format(row[self._first_col])
-
-    def _validate_second(self, row):
-        """Assert that the second FASTQ entry has the right format if it exists."""
-        if len(row[self._second_col]) > 0:
-            self._validate_fastq_format(row[self._second_col])
-
-    def _validate_pair(self, row):
-        """Assert that read pairs have the same file extension. Report pair status."""
-        if row[self._first_col] and row[self._second_col]:
-            row[self._single_col] = False
-            first_col_suffix = Path(row[self._first_col]).suffixes[-2:]
-            second_col_suffix = Path(row[self._second_col]).suffixes[-2:]
-            if first_col_suffix != second_col_suffix:
-                raise AssertionError("FASTQ pairs must have the same file extensions.")
-        else:
-            row[self._single_col] = True
-
-    def _validate_fastq_format(self, filename):
+        if len(row[self._tumorBam]) <= 0  or len(row[self._normalBam]) <= 0:
+            raise AssertionError("Both bam files are required.")
+        self._validate_bam_format(row[self._tumorBam])
+        self._validate_bam_format(row[self._normalBam])
+
+    def _validate_normalType(self, row):
+        """Assert that bait set exists."""
+        if len(row[self._normalType]) <= 0:
+            raise AssertionError("normalType is required.")
+
+    def _validate_bam_format(self, filename):
         """Assert that a given filename has one of the expected FASTQ extensions."""
         if not any(filename.endswith(extension) for extension in self.VALID_FORMATS):
             raise AssertionError(
-                f"The FASTQ file has an unrecognized extension: {filename}\n"
+                f"The BAM file has an unrecognized extension: {filename}\n"
                 f"It should be one of: {', '.join(self.VALID_FORMATS)}"
             )
 
-    def validate_unique_samples(self):
-        """
-        Assert that the combination of sample name and FASTQ filename is unique.
-
-        In addition to the validation, also rename all samples to have a suffix of _T{n}, where n is the
-        number of times the same sample exist, but with different FASTQ files, e.g., multiple runs per experiment.
-
-        """
-        if len(self._seen) != len(self.modified):
-            raise AssertionError("The pair of sample name and FASTQ must be unique.")
-        seen = Counter()
-        for row in self.modified:
-            sample = row[self._sample_col]
-            seen[sample] += 1
-            row[self._sample_col] = f"{sample}_T{seen[sample]}"
+    def _validate_bed_format(self, row):
+        """Assert that a given filename has one of the expected BED extensions."""
+        filename = row[self._bedFile]
+        if filename and filename != "NONE":
+            if not filename.endswith(".bed"):
+                raise AssertionError(
+                    f"The BED file has an unrecognized extension: {filename}\n"
+                    f"It should be .bed\n"
+                    f"If you would like one generated for you leave it bank or enter 'NONE'\n"
+                )
 
 
 def read_head(handle, num_lines=10):
@@ -164,10 +147,9 @@ def sniff_format(handle):
 
 def check_samplesheet(file_in, file_out):
     """
-    Check that the tabular samplesheet has the structure expected by nf-core pipelines.
+    Check that the tabular samplesheet has the structure expected by the ODIN pipeline.
 
-    Validate the general shape of the table, expected columns, and each row. Also add
-    an additional column which records whether one or two FASTQ reads were found.
+    Validate the general shape of the table, expected columns, and each row.
 
     Args:
         file_in (pathlib.Path): The given tabular samplesheet. The format can be either
@@ -179,19 +161,14 @@ def check_samplesheet(file_in, file_out):
         This function checks that the samplesheet follows the following structure,
         see also the `viral recon samplesheet`_::
 
-            sample,fastq_1,fastq_2
-            SAMPLE_PE,SAMPLE_PE_RUN1_1.fastq.gz,SAMPLE_PE_RUN1_2.fastq.gz
-            SAMPLE_PE,SAMPLE_PE_RUN2_1.fastq.gz,SAMPLE_PE_RUN2_2.fastq.gz
-            SAMPLE_SE,SAMPLE_SE_RUN1_1.fastq.gz,
-
-    .. _viral recon samplesheet:
-        https://raw.githubusercontent.com/nf-core/test-datasets/viralrecon/samplesheet/samplesheet_test_illumina_amplicon.csv
+            pairId,tumorBam,normalBam,assay,normalType,bedFile
+            SAMPLE_TUMOR.SAMPLE_NORMAL,BAM_TUMOR,BAM_NORMAL,BAITS,NORMAL_TYPE,BED_FILE
 
     """
-    required_columns = {"sample", "fastq_1", "fastq_2"}
+    required_columns = {"pairId","tumorBam","normalBam","assay","normalType","bedFile"}
     # See https://docs.python.org/3.9/library/csv.html#id3 to read up on `newline=""`.
     with file_in.open(newline="") as in_handle:
-        reader = csv.DictReader(in_handle, dialect=sniff_format(in_handle))
+        reader = csv.DictReader(in_handle, dialect=sniff_format(in_handle),delimiter=',')
         # Validate the existence of the expected header columns.
         if not required_columns.issubset(reader.fieldnames):
             req_cols = ", ".join(required_columns)
@@ -205,9 +182,7 @@ def check_samplesheet(file_in, file_out):
             except AssertionError as error:
                 logger.critical(f"{str(error)} On line {i + 2}.")
                 sys.exit(1)
-        checker.validate_unique_samples()
     header = list(reader.fieldnames)
-    header.insert(1, "single_end")
     # See https://docs.python.org/3.9/library/csv.html#id3 to read up on `newline=""`.
     with file_out.open(mode="w", newline="") as out_handle:
         writer = csv.DictWriter(out_handle, header, delimiter=",")
-Original file line number
+Diff line change
@@ Expand Up / @@ -6,3 +6,4 @@ results/ @@
     testing/
     testing*
     *.pyc
+    ._*