Merge branch 'master' of github.com:NCBI-Hackathons/Automatic_Packagi…

…ng_Bioinformatics
NCBI-Hackathons · Sep 27, 2017 · d66b030 · d66b030
2 parents c56afa2 + 907b1b4
commit d66b030
Show file tree

Hide file tree

Showing 8 changed files with 491 additions and 4 deletions.
diff --git a/README.md b/README.md
@@ -1,16 +1,81 @@
 # SPeW: SeqPipeWrap 
+Pitt-NCBI Hackathon Team
+Amanda Poholek (Lead), Argus Sun, Zhou Fang (Ark), Natalie Rittenhouse, Rahil Sethi, Ramya Mallela 
 
 ## Introduction
 Taking a sequencing pipeline from individual pieces on your workstation to a seamless pipeline that can be run by any user is currently a challenge. Here, we create a proof-of-principle simple RNA-seq pipeline using separate Bash shell scripts that are linked together using NextFlow as a pipeline management tool. This is then wrapped using Docker for distribution and seamless running on other workstations without dependency issues. 
 
+SPeW is a framework for taking a NextGen Seq pipeline (such as RNA-seq, ChIP-seq or ATAC-seq) in any language, and using NextFlow as a pipeline management system to create a flexible, user-friendly pipeline that can be shared in a container platform.
+
+This project was part of the September 2017 Pitt-NCBI Hackathon in Pittsburgh PA
+
+## Dependencies
+Docker https://www.docker.com/
+
+samtools http://www.htslib.org/
+
+FastQC https://www.bioinformatics.babraham.ac.uk/projects/fastqc/
+
+cutadapt http://cutadapt.readthedocs.io/en/stable/guide.html
+
+tophat2 http://ccb.jhu.edu/software/tophat/index.shtml
+
+bowtie2 https://github.com/BenLangmead/bowtie
+
+cufflinks http://cole-trapnell-lab.github.io/cufflinks/
+
+R https://www.r-project.org/
+
+NextFlow https://www.nextflow.io/
+
 ## Methods 
 
 ![ScreenShot](SPeW_workflow.jpg)
 
-To create the proof-of-principle simple RNA-seq pipeline, we started with writing simple Bash shell scripts for each step required in the analysis. These individual steps were then combined together by integrating them into NextFlow. In order to allow for seamless running on any workstation, Docker was then used to wrap the Nextflow code. By wrapping into a container such as Docker, all dependencies required for each step are automatically on the users workstation. Docker has the ability to be used by Sinularity, allowig the code to be utilized on a High Computing Cluster(HPC).
+To create the proof-of-principle simple RNA-seq pipeline, we started with writing simple Bash shell scripts for each step required in the analysis. These individual steps were then combined together by integrating them into Nextflow. In order to allow for seamless running on any workstation, Docker was then used to wrap the Nextflow code. By wrapping into a container such as Docker, all dependencies required for each step are automatically on the users workstation. Docker has the ability to be used by Singularity, allowing the code to be utilized on a High Computing Cluster(HPC).
+
+### Using Nextflow to String together Bash Scripts
+Nextflow is easily installed by using curl:
+
+```
+curl -fsSL get.nextflow.io | bash
+```
+
+After installing Nexflow, 
+
+## Discussion Notes
+### Overview
+
+This repo was created as part of an NCBI-hackathon @ Univerisity of Pittsburgh in September of 2017. During this 3 day event, several disucssions occured about the best option to accomplish our goal of creating a method to take a NextGen Seq pipeline into a flexible format that can be easily shared. This is a summary of those discussions, edited by those who were present.
+
+Most Seq pipelines are inherently a string of pre-made programs linked together by the creator according to their personal preferences or prior experience, including comfort of specific programming languages, knowledge of what algorithms exist, and word of mouth about what is "best" option to use. This typically limits sharing the pipeline to another user who may want to make small modifications, or requires installing new software in order to run. In addition, many users may choose to run their pipeline on a local machine, cloud computing or a HPC. Thus, putting a pipeline into a format or framework that manages the pipeline and allows for easy modifications for future users as well as ability to share, would be of tremendous use to many beginner or intermediate bioinformatics users. 
+
+### Test case
+
+As a starting point, we generated a basic RNA-seq pipeline that was made up of individual modules in a bash shell script. These modules are as follows:
+1) FastQC
+2) Trimming
+3) Alignment
+4) Annotation
+5) Differential Gene Expression (DGE)
+
+Inputs and outputs were defined in Nextflow to link processess together. All outputs were placed in a final directory that had final products as well as intermediate files. 
+
+### Workflow Management Strategy Discussion with a Group of ~25 Computational Biologists and Data Scientists
+
+Next, we needed to determine a method to link these modules together where inputs and outputs are managed for you, and that a future user could add or change a module. Our ideal scenario includes an option where the pipeline can make decisions for you. For example, given output X, th next module used where output X becomes input X could be either module Y or module Z. In addition, we considered how to package the pipeline. This included the option of either wrapping each module in a container (ie docker) that then are linked together within a larger container, or making one whole pipeline and putting it into one container.
+
+The majority of the discussion was around how to link the modules together. Several pipeline management systems were discussed, including Nextflow, snakemake, CWL (common workflow language) and Jupyter notebooks. This conversation included the entire Hackathon group (5 teams) of roughly 25 people total. Ultimately we selected Nextflow, based on the fact that we think Nextflow has the best options for our needs. There was not universal agreement about this, and we believe this will a trial and error process. One downside was no one in the room had any experience with Nextflow, thus we were testing to see if this was indeed a good option.
+
+CWL was widely dismissed by pretty much all members present, as being too labor intensive to use. A few people with CWL experience relayed how difficult and frustrating it was to use, and the time it took to learn considered not worth the effort. 
+
+Snakemake was dismissed as being less flexible than Nextflow.  Many users thought that it is mostly Python oriented, although others confirmed that is not the case.  
+
+Nextflow was chosen because it can use any language, manages inputs and outputs and is meant to be easily wrapped.  
+
+A large part of the discussion included Jupyter notebooks as an alternative to Nextflow. This was considered to be a good in-between for intermediate-level bioinformaticians who want to crack the containers and customize them for particular use cases. Many raised issues of needing to install Jupyter, however no installation should be necessary if Jupyter is "headless" in a container, and Jupyter would only be used by those cracking the containers. However, the we feel it is important to be able to encompass all languages, and therefore this option may have inherent limitations, but perhaps be attractive for others in the future.
 
-## Requirements 
-Docker
+Based on these discussions, we chose Nextflow and here report how useful and flexible it ended up being. We hope this may help other users make more informed choices to manage their pipelines and wrap for distribution.
 
 ## Futher Directions 
-We hope to further add to the pipeline, and allow for the ability to enter the pipeline at any point. 
+We hope to add additional modules as well as modify current modules to the pipeline, and add in the ability to enter the pipeline at any point. 
diff --git a/annotate_module_v2.bash b/annotate_module_v2.bash
@@ -0,0 +1,182 @@
+#!/bin/bash
+
+module purge
+module load compiler/java/1.8.0_65-oracle
+module load samtools/1.2-gcc5.2.0
+module load tophat/2.1.1
+module load bowtie2/2.3.2-gcc5.2.0
+module load cufflinks/2.2.1
+module load HTSeq/0.9.0
+
+while getopts ha:b:p:g: option
+do
+ case "${option}"
+ in
+ h) echo -ne "Usage:\$ bash $0 -a annotate-method -b bam_directory -p sample_phenotype -g reference_gtf\n"
+    echo -ne "Options\n-a: Method of annotation with values Cufflinks or Reference (default: Reference)\n"
+    echo -ne "-b: Input bam directory full-path(default: current working directory)\n"
+    echo -ne "-p: Input sample phynotype file in directory full-path(default: current working directory)\n"
+    echo -ne "-g: Input Reference gtf file full-path(default: current working directory)\n"
+    exit 1;;
+ a) annot=${OPTARG};;
+ b) bamdir=${OPTARG};;
+ p) sm_phtype=${OPTARG};;
+ g) refseq_gtf=${OPTARG};;
+ esac
+
+done
+
+if [ -z "$annot" ]
+then
+ annot="Reference"
+fi
+
+if [ -z "$bamdir" ]
+then
+ echo "No gtf path specified. Looking in current $(pwd)"
+ dir=$(pwd)
+ bam=$(ls ${dir}/*.bam)
+fi
+
+if [ -z "$sm_phtype" ]
+then
+ echo "No gtf path specified. Looking in current $(pwd)"
+ dir=$(pwd)
+ sm_phtype=${dir}/sample_phenotype.txt
+fi
+
+if [ -z "$refseq_gtf" ]
+then
+ echo "No gtf path specified. Looking in current $(pwd)"
+ dir=$(pwd)
+ refseq_gtf=$(ls ${dir}/*.gtf)
+fi
+
+#################################################
+
+if [ $annot = "Cufflinks" ]
+
+then
+
+ ####### Assembly #################################
+
+ ### cufflinks  ###################
+
+ echo "Running cufflinks..."
+
+ for file in $(ls ${bamdir}/*.bam)
+
+ do
+
+  sample=$(basename $file .bam)
+
+  echo "Sample: $readname"
+
+  cufflinks_out=${readname}_cufflinks
+
+  if [ $(samtools view -c -f 1 $file) -gt 0 ]
+
+  then
+
+    cufflinks -p 1 -g $refseq_gtf -o $cufflinks_out --library-type fr-firststrand $file 2> ${cufflinks_out}.log &
+
+  else
+
+   cufflinks -p 1 -g $refseq_gtf -o $cufflinks_out $file 2> ${cufflinks_out}.log &
+
+  fi
+
+  echo "$cufflinks_out/transcripts.gtf" >> assembly_list.txt
+
+ done
+
+ wait
+
+ #### cuffmerge ############################### 
+
+ cuffmerge_out=cuffmerge_output
+
+ cuffmerge -s ${index}.fa -g $refseq_gtf -p 1 -o $cuffmerge_out assembly_list.txt
+
+ input_gtf=${cuffmerge_out}/merged.gtf
+
+#####################################################
+
+else
+
+ input_gtf=$refseq_gtf
+
+fi
+
+####  HTSeq counts #########################
+
+out_count=HTSeq_count
+
+mkdir -p $out_count
+
+#### Checking and sorting bam files for paired-end reads
+
+echo "Checking and sorting bam files for paired-end reads"
+
+for file in $(ls ${bamdir}/*.bam)
+
+do
+
+ if [ $(samtools view -c -f 1 $file) -gt 0 ]
+
+ then
+
+  mkdir -p name_sorted
+
+  sample=$(basename $file .bam)
+
+  echo "Sample: $sample"
+
+ # To run HTSeq on paired end sort bam file by name
+  samtools sort -n $file ${bamdir}/name_sorted/${sample}.name_sorted &
+
+ fi
+
+done
+
+wait
+
+##### Running HTSeq
+
+echo "HTSeq count"
+
+for file in $(ls ${bamdir}/*.bam)
+
+do
+
+ sample=$(basename $file .bam)
+
+ echo "Sample: $sample"
+
+ if [ $(samtools view -c -f 1 $file) -gt 0 ]
+
+ then
+
+ htseq-count -f bam --stranded=no ${bamdir}/name_sorted/${sample}.name_sorted.bam $input_gtf > ${out_count}/${sample}_count.no_strand.txt &
+
+ else
+
+ htseq-count -f bam --stranded=no $file $input_gtf > ${out_count}/${sample}_count.no_strand.txt &
+
+ fi
+
+done
+
+wait
+
+###### Combining files
+
+cd $out_count
+
+echo "gene_id" > gene_id; cat *.txt | cut -f 1 | sort -u >> gene_id
+
+while read line; do sm=$(echo "$line" | cut -f 1); echo "$sm" > ${sm}_count; cat ${sm}*.txt | sort -k 1,1 | cut -f 2 >> ${sm}_count; done < $sm_phtype
+
+paste gene_id *_count > htseq_count.txt
+
+############################################################
diff --git a/annotate_run.nf b/annotate_run.nf
@@ -0,0 +1,17 @@
+#!/usr/bin/env nextflow
+
+process test {
+
+output:
+stdout result
+
+script:
+    """
+    annotate_module_v2.bash -a Reference -b /ihome/uchandran/ras143/ncbi-workshop/RNAseq_files -p /ihome/uchandran/ras143/ncbi-workshop/RNAseq_files/sample_phenotype.txt -g /ihome/uchandran/ras143/ncbi-workshop/RNAseq_files/mm10_genes.gtf
+    """  
+}
+
+result.subscribe {
+    println it.trim()
+}
+
diff --git a/nextflow.config b/nextflow.config
@@ -0,0 +1,2 @@
+process.container = 'mcburton/trimming:latest'
+docker.enabled = true
diff --git a/trimming_aligning.nf b/trimming_aligning.nf
@@ -0,0 +1,46 @@
+#!/usr/bin/env nextflow
+/* Testing for NCBI SpEW project, trimming & aligning modules
+* Zhou (Ark) Fang [email protected]
+* Argus Sun [email protected]
+*/
+inPath = "/zfs1/ncbi-workshop/AP/nextflow/"
+inFiles = "$inPath/*.fastq.gz"
+singleEnd = true
+adapter1 = "ADATPER_FWD"
+adapter2 = "ADATPER_REV"
+
+inFiles
+
+params.in = "/zfs1/ncbi-workshop/AP/RNAseq/*.fastq"
+sequences = file(params.in)
+/* records = "/zfs1/ncbi-workshop/AP/RNAseq/AlignedReads/"  */
+
+
+
+
+process trimming{
+script:
+   if (singleEnd==true)
+   """
+   bash trimming.sh -i inPath -s -a1 adapter1 -a2 adapter2
+   """
+   else
+   """
+   bash /zfs1/nci-workshop/AP/nextflow/trimming.sh -i inPath -a1 adapter1 -a2 adapter2
+   """
+}
+
+
+
+process alignment {
+
+        input:
+        file '*.fastq.gz'
+
+        output:
+        file 'aligned_*' into records
+
+        """
+        bash /zfs1/ncbi-workshop/AP/nextflow/align2.sh
+        """
+}
diff --git a/trimming_aligning2.nf b/trimming_aligning2.nf
@@ -0,0 +1,52 @@
+#!/usr/bin/env nextflow
+/* Testing for NCBI SpEW project, trimming module
+* Zhou (Ark) Fang [email protected]
+*/
+inPath = "/zfs1/ncbi-workshop/AP/nextflow/"
+inFiles = "$inPath/*.fastq.gz"
+singleEnd = true
+adapter1 = "ADATPER_FWD"
+adapter2 = "ADATPER_REV"
+
+inFiles
+
+params.in = "/zfs1/ncbi-workshop/AP/RNAseq/*.fastq"
+sequences = file(params.in)
+/* records = "/zfs1/ncbi-workshop/AP/RNAseq/AlignedReads/"  */
+
+process modules{
+script"
+        """
+        bash module load bowtie2
+        bash module load tophat
+        bash module load cutadapt
+        """
+}
+
+
+}rocess trimming{
+script:
+   if (singleEnd==true)
+   """
+   bash trimming.sh -i inPath -s -a1 adapter1 -a2 adapter2
+   """
+   else
+   """
+   bash /zfs1/nci-workshop/AP/nextflow/trimming.sh -i inPath -a1 adapter1 -a2 adapter2
+   """
+}
+
+
+
+process alignment {
+
+        input:
+        file '*.fastq.gz'
+
+        output:
+        file 'aligned_*' into records
+
+        """
+        bash /zfs1/ncbi-workshop/AP/nextflow/align2.sh
+        """
+}