Skip to content

Commit

Permalink
Merge branch 'master' of github.com:NCBI-Hackathons/Automatic_Packagi…
Browse files Browse the repository at this point in the history
…ng_Bioinformatics
  • Loading branch information
ram194 committed Sep 27, 2017
2 parents c56afa2 + 907b1b4 commit d66b030
Show file tree
Hide file tree
Showing 8 changed files with 491 additions and 4 deletions.
73 changes: 69 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,16 +1,81 @@
# SPeW: SeqPipeWrap
Pitt-NCBI Hackathon Team
Amanda Poholek (Lead), Argus Sun, Zhou Fang (Ark), Natalie Rittenhouse, Rahil Sethi, Ramya Mallela

## Introduction
Taking a sequencing pipeline from individual pieces on your workstation to a seamless pipeline that can be run by any user is currently a challenge. Here, we create a proof-of-principle simple RNA-seq pipeline using separate Bash shell scripts that are linked together using NextFlow as a pipeline management tool. This is then wrapped using Docker for distribution and seamless running on other workstations without dependency issues.

SPeW is a framework for taking a NextGen Seq pipeline (such as RNA-seq, ChIP-seq or ATAC-seq) in any language, and using NextFlow as a pipeline management system to create a flexible, user-friendly pipeline that can be shared in a container platform.

This project was part of the September 2017 Pitt-NCBI Hackathon in Pittsburgh PA

## Dependencies
Docker https://www.docker.com/

samtools http://www.htslib.org/

FastQC https://www.bioinformatics.babraham.ac.uk/projects/fastqc/

cutadapt http://cutadapt.readthedocs.io/en/stable/guide.html

tophat2 http://ccb.jhu.edu/software/tophat/index.shtml

bowtie2 https://github.com/BenLangmead/bowtie

cufflinks http://cole-trapnell-lab.github.io/cufflinks/

R https://www.r-project.org/

NextFlow https://www.nextflow.io/

## Methods

![ScreenShot](SPeW_workflow.jpg)

To create the proof-of-principle simple RNA-seq pipeline, we started with writing simple Bash shell scripts for each step required in the analysis. These individual steps were then combined together by integrating them into NextFlow. In order to allow for seamless running on any workstation, Docker was then used to wrap the Nextflow code. By wrapping into a container such as Docker, all dependencies required for each step are automatically on the users workstation. Docker has the ability to be used by Sinularity, allowig the code to be utilized on a High Computing Cluster(HPC).
To create the proof-of-principle simple RNA-seq pipeline, we started with writing simple Bash shell scripts for each step required in the analysis. These individual steps were then combined together by integrating them into Nextflow. In order to allow for seamless running on any workstation, Docker was then used to wrap the Nextflow code. By wrapping into a container such as Docker, all dependencies required for each step are automatically on the users workstation. Docker has the ability to be used by Singularity, allowing the code to be utilized on a High Computing Cluster(HPC).

### Using Nextflow to String together Bash Scripts
Nextflow is easily installed by using curl:

```
curl -fsSL get.nextflow.io | bash
```

After installing Nexflow,

## Discussion Notes
### Overview

This repo was created as part of an NCBI-hackathon @ Univerisity of Pittsburgh in September of 2017. During this 3 day event, several disucssions occured about the best option to accomplish our goal of creating a method to take a NextGen Seq pipeline into a flexible format that can be easily shared. This is a summary of those discussions, edited by those who were present.

Most Seq pipelines are inherently a string of pre-made programs linked together by the creator according to their personal preferences or prior experience, including comfort of specific programming languages, knowledge of what algorithms exist, and word of mouth about what is "best" option to use. This typically limits sharing the pipeline to another user who may want to make small modifications, or requires installing new software in order to run. In addition, many users may choose to run their pipeline on a local machine, cloud computing or a HPC. Thus, putting a pipeline into a format or framework that manages the pipeline and allows for easy modifications for future users as well as ability to share, would be of tremendous use to many beginner or intermediate bioinformatics users.

### Test case

As a starting point, we generated a basic RNA-seq pipeline that was made up of individual modules in a bash shell script. These modules are as follows:
1) FastQC
2) Trimming
3) Alignment
4) Annotation
5) Differential Gene Expression (DGE)

Inputs and outputs were defined in Nextflow to link processess together. All outputs were placed in a final directory that had final products as well as intermediate files.

### Workflow Management Strategy Discussion with a Group of ~25 Computational Biologists and Data Scientists

Next, we needed to determine a method to link these modules together where inputs and outputs are managed for you, and that a future user could add or change a module. Our ideal scenario includes an option where the pipeline can make decisions for you. For example, given output X, th next module used where output X becomes input X could be either module Y or module Z. In addition, we considered how to package the pipeline. This included the option of either wrapping each module in a container (ie docker) that then are linked together within a larger container, or making one whole pipeline and putting it into one container.

The majority of the discussion was around how to link the modules together. Several pipeline management systems were discussed, including Nextflow, snakemake, CWL (common workflow language) and Jupyter notebooks. This conversation included the entire Hackathon group (5 teams) of roughly 25 people total. Ultimately we selected Nextflow, based on the fact that we think Nextflow has the best options for our needs. There was not universal agreement about this, and we believe this will a trial and error process. One downside was no one in the room had any experience with Nextflow, thus we were testing to see if this was indeed a good option.

CWL was widely dismissed by pretty much all members present, as being too labor intensive to use. A few people with CWL experience relayed how difficult and frustrating it was to use, and the time it took to learn considered not worth the effort.

Snakemake was dismissed as being less flexible than Nextflow. Many users thought that it is mostly Python oriented, although others confirmed that is not the case.

Nextflow was chosen because it can use any language, manages inputs and outputs and is meant to be easily wrapped.

A large part of the discussion included Jupyter notebooks as an alternative to Nextflow. This was considered to be a good in-between for intermediate-level bioinformaticians who want to crack the containers and customize them for particular use cases. Many raised issues of needing to install Jupyter, however no installation should be necessary if Jupyter is "headless" in a container, and Jupyter would only be used by those cracking the containers. However, the we feel it is important to be able to encompass all languages, and therefore this option may have inherent limitations, but perhaps be attractive for others in the future.

## Requirements
Docker
Based on these discussions, we chose Nextflow and here report how useful and flexible it ended up being. We hope this may help other users make more informed choices to manage their pipelines and wrap for distribution.

## Futher Directions
We hope to further add to the pipeline, and allow for the ability to enter the pipeline at any point.
We hope to add additional modules as well as modify current modules to the pipeline, and add in the ability to enter the pipeline at any point.
182 changes: 182 additions & 0 deletions annotate_module_v2.bash
Original file line number Diff line number Diff line change
@@ -0,0 +1,182 @@
#!/bin/bash

module purge
module load compiler/java/1.8.0_65-oracle
module load samtools/1.2-gcc5.2.0
module load tophat/2.1.1
module load bowtie2/2.3.2-gcc5.2.0
module load cufflinks/2.2.1
module load HTSeq/0.9.0

while getopts ha:b:p:g: option
do
case "${option}"
in
h) echo -ne "Usage:\$ bash $0 -a annotate-method -b bam_directory -p sample_phenotype -g reference_gtf\n"
echo -ne "Options\n-a: Method of annotation with values Cufflinks or Reference (default: Reference)\n"
echo -ne "-b: Input bam directory full-path(default: current working directory)\n"
echo -ne "-p: Input sample phynotype file in directory full-path(default: current working directory)\n"
echo -ne "-g: Input Reference gtf file full-path(default: current working directory)\n"
exit 1;;
a) annot=${OPTARG};;
b) bamdir=${OPTARG};;
p) sm_phtype=${OPTARG};;
g) refseq_gtf=${OPTARG};;
esac

done

if [ -z "$annot" ]
then
annot="Reference"
fi

if [ -z "$bamdir" ]
then
echo "No gtf path specified. Looking in current $(pwd)"
dir=$(pwd)
bam=$(ls ${dir}/*.bam)
fi

if [ -z "$sm_phtype" ]
then
echo "No gtf path specified. Looking in current $(pwd)"
dir=$(pwd)
sm_phtype=${dir}/sample_phenotype.txt
fi

if [ -z "$refseq_gtf" ]
then
echo "No gtf path specified. Looking in current $(pwd)"
dir=$(pwd)
refseq_gtf=$(ls ${dir}/*.gtf)
fi

#################################################

if [ $annot = "Cufflinks" ]

then

####### Assembly #################################

### cufflinks ###################

echo "Running cufflinks..."

for file in $(ls ${bamdir}/*.bam)

do

sample=$(basename $file .bam)

echo "Sample: $readname"

cufflinks_out=${readname}_cufflinks

if [ $(samtools view -c -f 1 $file) -gt 0 ]

then

cufflinks -p 1 -g $refseq_gtf -o $cufflinks_out --library-type fr-firststrand $file 2> ${cufflinks_out}.log &

else

cufflinks -p 1 -g $refseq_gtf -o $cufflinks_out $file 2> ${cufflinks_out}.log &

fi

echo "$cufflinks_out/transcripts.gtf" >> assembly_list.txt

done

wait

#### cuffmerge ###############################

cuffmerge_out=cuffmerge_output

cuffmerge -s ${index}.fa -g $refseq_gtf -p 1 -o $cuffmerge_out assembly_list.txt

input_gtf=${cuffmerge_out}/merged.gtf

#####################################################

else

input_gtf=$refseq_gtf

fi

#### HTSeq counts #########################

out_count=HTSeq_count

mkdir -p $out_count

#### Checking and sorting bam files for paired-end reads

echo "Checking and sorting bam files for paired-end reads"

for file in $(ls ${bamdir}/*.bam)

do

if [ $(samtools view -c -f 1 $file) -gt 0 ]

then

mkdir -p name_sorted

sample=$(basename $file .bam)

echo "Sample: $sample"

# To run HTSeq on paired end sort bam file by name
samtools sort -n $file ${bamdir}/name_sorted/${sample}.name_sorted &

fi

done

wait

##### Running HTSeq

echo "HTSeq count"

for file in $(ls ${bamdir}/*.bam)

do

sample=$(basename $file .bam)

echo "Sample: $sample"

if [ $(samtools view -c -f 1 $file) -gt 0 ]

then

htseq-count -f bam --stranded=no ${bamdir}/name_sorted/${sample}.name_sorted.bam $input_gtf > ${out_count}/${sample}_count.no_strand.txt &

else

htseq-count -f bam --stranded=no $file $input_gtf > ${out_count}/${sample}_count.no_strand.txt &

fi

done

wait

###### Combining files

cd $out_count

echo "gene_id" > gene_id; cat *.txt | cut -f 1 | sort -u >> gene_id

while read line; do sm=$(echo "$line" | cut -f 1); echo "$sm" > ${sm}_count; cat ${sm}*.txt | sort -k 1,1 | cut -f 2 >> ${sm}_count; done < $sm_phtype

paste gene_id *_count > htseq_count.txt

############################################################
17 changes: 17 additions & 0 deletions annotate_run.nf
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
#!/usr/bin/env nextflow

process test {

output:
stdout result

script:
"""
annotate_module_v2.bash -a Reference -b /ihome/uchandran/ras143/ncbi-workshop/RNAseq_files -p /ihome/uchandran/ras143/ncbi-workshop/RNAseq_files/sample_phenotype.txt -g /ihome/uchandran/ras143/ncbi-workshop/RNAseq_files/mm10_genes.gtf
"""
}

result.subscribe {
println it.trim()
}

2 changes: 2 additions & 0 deletions nextflow.config
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
process.container = 'mcburton/trimming:latest'
docker.enabled = true
46 changes: 46 additions & 0 deletions trimming_aligning.nf
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
#!/usr/bin/env nextflow
/* Testing for NCBI SpEW project, trimming & aligning modules
* Zhou (Ark) Fang [email protected]
* Argus Sun [email protected]
*/
inPath = "/zfs1/ncbi-workshop/AP/nextflow/"
inFiles = "$inPath/*.fastq.gz"
singleEnd = true
adapter1 = "ADATPER_FWD"
adapter2 = "ADATPER_REV"

inFiles

params.in = "/zfs1/ncbi-workshop/AP/RNAseq/*.fastq"
sequences = file(params.in)
/* records = "/zfs1/ncbi-workshop/AP/RNAseq/AlignedReads/" */




process trimming{
script:
if (singleEnd==true)
"""
bash trimming.sh -i inPath -s -a1 adapter1 -a2 adapter2
"""
else
"""
bash /zfs1/nci-workshop/AP/nextflow/trimming.sh -i inPath -a1 adapter1 -a2 adapter2
"""
}



process alignment {

input:
file '*.fastq.gz'

output:
file 'aligned_*' into records

"""
bash /zfs1/ncbi-workshop/AP/nextflow/align2.sh
"""
}
52 changes: 52 additions & 0 deletions trimming_aligning2.nf
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
#!/usr/bin/env nextflow
/* Testing for NCBI SpEW project, trimming module
* Zhou (Ark) Fang [email protected]
*/
inPath = "/zfs1/ncbi-workshop/AP/nextflow/"
inFiles = "$inPath/*.fastq.gz"
singleEnd = true
adapter1 = "ADATPER_FWD"
adapter2 = "ADATPER_REV"

inFiles

params.in = "/zfs1/ncbi-workshop/AP/RNAseq/*.fastq"
sequences = file(params.in)
/* records = "/zfs1/ncbi-workshop/AP/RNAseq/AlignedReads/" */

process modules{
script"
"""
bash module load bowtie2
bash module load tophat
bash module load cutadapt
"""
}
}rocess trimming{
script:
if (singleEnd==true)
"""
bash trimming.sh -i inPath -s -a1 adapter1 -a2 adapter2
"""
else
"""
bash /zfs1/nci-workshop/AP/nextflow/trimming.sh -i inPath -a1 adapter1 -a2 adapter2
"""
}
process alignment {
input:
file '*.fastq.gz'
output:
file 'aligned_*' into records
"""
bash /zfs1/ncbi-workshop/AP/nextflow/align2.sh
"""
}
Loading

0 comments on commit d66b030

Please sign in to comment.