This repository implements the Nextflow tutorial workflow using the Scientific Filesystem (SCIF) and Docker/Singularity to provide a reproducible research environment. Is an adaptation of snakemake.scif, but using nextflow instead. Also I have just discovered this try of @vsoch and I am adapting her cool documentation to this case.
These are the steps to follow in order to develop a new reproducible and transparent workflow:
- SCIF recipes: Discoverability and Transparency by installing our software in the container via a Scientific Filesystem. SCIF also lets us install the same dependencies across container technologies.
- Singularity/Docker container: this will handle the Reproducibility of our pipeline and the software it is going to need. Moreover Docker and Singularity HUB makes deployment and sharing a lot easier.
- Nextflow: enables scalable and reproducible scientific workflows managing software containers natively like a charm. It allows the adaptation of pipelines written in the most common scripting languages. Moreover its fluent DSL simplifies the implementation and the deployment of complex parallel and reactive workflows on clouds and clusters, being able to abstract the users both from where and even how they compute without any concern about the multiple dependencies a scientific workflow can have.
- Testing: circleCI (TO DO)
You will need to install nextflow and singularity. Singularity must be installed with administrative privileges, usually system wide, and nextflow can be "installed" to your present working directory. It's also recommended to define your Singularity cache to be somewhere (like a scratch space) where you have more disk space. For example:
export SINGULARITY_CACHEDIR=$SCRATCH/.singularity
If you are working on your local machine, this location defaults to $HOME/.singularity
, which
is appropriate. Once you have both, you only have to type:
# With the nextflow executable in the present working directory
./nextflow run BU-ISCIII/nextflow-scif -profile singularity
# With the nextflow executable installed system wide, and on the PATH
nextflow run BU-ISCIII/nextflow-scif -profile singularity
Thats it! When you finish, a "results" folder will appear with the content of your analysis.
$ ls results/
bwa DAG.dot DAG.svg report.html timeline.html trace.txt vcf
You can preview the
So...Let's get this done... and by "this" we are referring to a complete description of how this recipe was derived, and the components of the software that just ran. The idea here is that if you understand the underlying components, you can easily build your own nextflow pipelines.
When building pipelines, you can think of it like baking a cake. We have entire recipes for creating our final products (containers), and within those recipes ingredients (software) that we need to add. In this first part, we will talk about the three recipes in this repository, the Dockerfile for the Docker container, the Singularity recipe for a Singularity container, and the app_recipes/*.scif recipes.
A scientific filesystem is useful because it allows me to write one recipe for my various software, and then install easily in different containers or on my host. How do you know when you find a recipe? When you find a recipe for a scientific filesystem (SCIF), you will see a file with extension *.scif. For example, in this repository:
- app_recipes/*.scif are the recipes for the scientific filesystem that will be installed in both the Docker and Singularity containers to be run with nextflow. SCIF is flexible in that there can be many different internal applications defined in one file, however if we want we can put them in individual files and install them equivalently, as we show in this example.
For example, given the apps "samtools", "bwa" and "bcftools" and using a three recipes files , I would install like:
scif install app_recipes/samtools_v1.9_centos7.scif
scif install app_recipes/bwa_v0.7.17_centos7.scif
scif install app_recipes/bcftools_v1.9_centos7.scif
but I could also define the different applications in only one file and install them the next way.
scif install nextflow-scif.scif
This level of modularity is up to the user. Programatically, this install would be equivalent. This is a purely "human information" decision. If I am building a single container to share with my paper, I would opt to only need one file. If I am providing a builder service to users and want to easily install recipes in a modular fashion, I would want to do the second, which is this case. Here we are preparing a model for developing new workflows as quickly as possible, so app_recipes should contain a collection of app recipes that can be combined in the Singularity/Docker recipe for creating the needed container for the workflow being developed. In fact, probably we will convert app_recipes to a Github repository, so it will be used as remote repository in all the workflows we will develop, making its update easier.
This means that, for any SCIF recipe and a host of interest (Docker or Singularity container, or your computer) you can install the same recipes.
SCIF makes the installation to be so tidy, it makes me wanna cry, a scif folder is created in /
of your container and the tree directory will look like this:
/scif/apps/
samtools/
bin/
lib/
scif/
runscript.help
runscript
env/
01-base.sh
90-environment.sh
bwa/
....
Also SCIF will provide a separate variable environment for each app (even we are going to mess it a little bit in order to get nextflow functioning), and some cool ENV variables $SCIF_APP(BIN|ENV|...) that seem promising for a lot of functionality. You can read more on Singularity docs and SCIF docs.
Dockerfile is the recipe for building our container. You will also notice the installation is simple - we start with a container base that was equivalently used by the creator of the pipeline with system / host dependencies. In this case, I have used a base docker container of centos 7, this is because we use this SO in our lab and our HPC, and it is useful for me to have homogeneous installation recipes in both containers, my workstation and HPC (at least for now that we are figuring all this stuff out).
Then simply install the SCIF recipe to it. That comes down to these three commands:
RUN pip install scif # Install scif from pypi
ADD app_recipes/*.scif /opt # Add the recipe to the container
RUN scif install /opt/samtools.scif # Install it to the container
Now a little about the "mess", you can see that we added export PATH variable with bin folder for each app, this should not be necessary because app/bin folder is added to the PATH automaticaly when the app is run, BUT nextflow functionality expects to have access to the executables in the path or you have to provide the full path.
# Docker Recipe
ENV PATH=${PATH}:/scif/apps/samtools/bin
# Singularity Recipe
echo 'export PATH=${PATH}:/scif/apps/samtools/bin' >> $SINGULARITY_ENVIRONMENT
In order to take advantage of complete nextflow functionality I decided to assume this mistake in the use of SCIF, which will make the apps envs not being totally separated. You can read the Notes section for the pros and cons of this approximation and an insight of another posibility I don't like so much (or I consider less useful for our situation).
NOTE: it would be cool to use $SCIF_APPBIN_samtools, instead of the full path /scif/apps/samtools/bin
, but I tried and this variable must be set after PATH env variable, because I can't get it working.
The recipe file for a Singularity container is the file Singularity. The format of the recipe file is different, but installing the scientific filesystem, again from the recipes app_recipes is performed with the same commands:
pip install scif # Install scif from pypi
scif install /opt/samtools_v1.9_centos7.scif # Install it to the container
The only missing command to add the recipe to the container is because Singularity recipes allow you to do this in a %files
section that functions like a cp
%files
samtools_v1.9_centos7.scif /opt
Once we have our recipe or in order to create our installation commands with ease, we need to know how to build and interact with containers.
You can interact with a container in many different ways and for differente purposes. For example, during the development of the container, I took a strategy to start with a base, interactively shell into it, and test installation and running of things. To do this we can directly use a centos container from Dockerhub, singularity is what I prefer:
sudo singularity shell -w docker://centos:latest
Or docker if you want:
docker run -it centos
Once you have your recipe prepared you need to build your image:
With Singularity
sudo singularity build nextflow-scif.img Singularity
Or with Docker:
## . points to the path where docker will import the context for image creation. If you are in the repository path you have only to do:
docker build -t nextflow-scif .
# Dockerfile and context will be automatically imported to the docker daemon.
Command run, in Singularity and Docker will execute environment variables and the script configured in %runscript
section in Singularity or ENTRYPOINT in Dockerfile. In this case only scif is configured, being the command that will let us interact with the applications installed in the container.
NOTE: We can't define an ENTRYPOINT in docker because Nextflow doesn't support docker containers with entrypoint at the moment, it seems in its docs that have implemented some twist but is already deprecated. This means that for interacting with the applications we have to explicitly use scif command after docker run.
# Docker
docker run nextflow-scif scif
# Singularity
./nextflow-scif
singularity run nextflow-scif
The output will be something like this:
Scientific Filesystem [v0.0.71]
usage: scif [-h] [--debug] [--quiet] [--writable]
{version,pyshell,shell,preview,help,install,inspect,run,apps,dump,exec}
...
scientific filesystem tools
optional arguments:
-h, --help show this help message and exit
--debug use verbose logging to debug.
--quiet suppress print output
--writable, -w for relevant commands, if writable SCIF is needed
actions:
actions for Scientific Filesystem
{version,pyshell,shell,preview,help,install,inspect,run,apps,dump,exec}
scif actions
version show software version
pyshell Interactive python shell to scientific filesystem
shell shell to interact with scientific filesystem
preview preview changes to a filesytem
help look at help for an app, if it exists.
install install a recipe on the filesystem
inspect inspect an attribute for a scif installation
run entrypoint to run a scientific filesystem
apps list apps installed
dump dump recipe
exec execute a command to a scientific filesystem
The strength of SCIF is that it will always show you the applications installed in a container, and then provide predictable commands for inspecting, running, or otherwise interacting with them. For example, if I find the container, without any prior knowledge I can reveal the applications inside:
# Docker
docker run nextflow-scif scif apps
# Singularity
./nextflow-scif apps
# or
singularity run nextflow-scif apps
The output will look like this:
samtools
bwa
bcftools
We can also obtain help about a particular application
# Docker
docker run nextflow-scif scif help samtools
# Singularity
./nextflow-scif help samtools
# or
singularity run nextflow-scif help samtools
The output will look like this:
Samtools
and then inspecting
# Docker
docker run nextflow-scif scif inspect samtools
# Singularity
./nextflow-scif inspect samtools
# or
singularity run nextflow-scif inspect samtools
Output:
{
"samtools": {
"apprun": [
" exec samtools \"$@\""
],
"apphelp": [
" Samtools"
],
"applabels": [
"VERSION 1.9",
]
}
}
The creator of the container didn't write any complicated scripts to have this happen - the help text is just a chunk of text in a block of the recipe. The labels that are parsed to json, are also just written easily on two lines. This means that the creator can spend less time worry about exposing this. If you can write a text file, you can make your applications programatically parseable. You can check how this is done adding free text to %apphelp
and %applabel
in samtools_v1.9_centos7.scif.
NOTE: %appenv
can be also used for environment variables particular of each application. We are not using this because of nextflow functionality. See Notes.
I can easily shell into the container in the context of an application, meaning that the environment is sourced, etc. THIS IS NOT WORKING: USING SHELL OR EXEC ON A APP RUNS %RUNSCRIPT OF THE APP, TRY REMOVING THE EXPORT PATHS
# Docker
docker run -it nextflow-scif scif shell samtools
# Singularity
./nextflow-scif shell samtools
Output:
[samtools] executing /bin/bash
root@d002e338b88b:/scif/apps/samtools# env | grep PATH
LD_LIBRARY_PATH=/scif/apps/samtools/lib
PATH=/scif/apps/samtools/bin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
Notice how I'm in the app's context (in it's application folder) and that it's bin is added to the path? I can also shell in without a specific application context, but still have all the SCIF global variables available to me.
We can still shell into the "general" container:
# Docker
docker run -it nextflow-scif shell
# Singularity
./nextflow-scif shell
# or
singularity shell nextflow-scif
```bash
Output:
WARNING No app selected, will run default ['/bin/bash'] executing /bin/bash root@055a34619d17:/scif# ls apps data
Or thanks our "mess" in which we have exported bin app folders to the path we can execute each app directly from singularity container:
docker run nextflow-scif samtools
singularity exec nextflow-scif samtools
#### Running Applications
Before we get into creating a pipeline, look how easy it is to run an application. Without scif, we would have to have known that samtools is installed, and then executed the command to the container. But with the scientific filesystem, we discovered the app (shown above) and then we can just run it. The `run` command maps to the entrypoint, as was defined by the creator:
```bash
# Docker
docker run nextflow-scif run samtools
# Singularity
./nextflow-scif run samtools
Output:
Program: samtools (Tools for alignments in the SAM format)
Version: 0.1.18 (r982:295)
Usage: samtools <command> [options]
Command: view SAM<->BAM conversion
sort sort alignment file
mpileup multi-way pileup
depth compute the depth
faidx index/extract FASTA
tview text alignment viewer
index index alignment
idxstats BAM index stats (r595 or later)
fixmate fix mate information
flagstat simple stats
calmd recalculate MD/NM tags and '=' bases
merge merge sorted alignments
rmdup remove PCR duplicates
reheader replace BAM header
cat concatenate BAMs
targetcut cut fosmid regions (for fosmid pool only)
phase phase heterozygotes
[samtools] executing /bin/bash /scif/apps/samtools/scif/runscript
And executing any command in the context of the application is possible too: THIS IS NOT WORKING. This is equal to the issue shown with shell, due to out little twist for get nextflow working, here we have all the apps added to the PATH, but this should be the correct functioning of SCIF:
# Docker
docker run nextflow-scif scif exec samtools env | grep PATH
# Singularity
./nextflow-scif exec samtools env | grep PATH
Output:
LD_LIBRARY_PATH=/scif/apps/samtools/lib
PATH=/scif/apps/samtools/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
Note that for above, you will get more output with the Singularity container, as it shares the environment with the host. Whether we are using Docker or Singularity, the actions going on internally with the scientific filesystem client are the same. Given a simple enough pipeline, we could stop here, and just issue a series of commands to run the different apps.
Once you have your container completely functional you can upload it to Docker Hub or Singularity Hub for version control and sharing. We are uploading it to Docker Hub since Singularity can easily build or run a container from there.
If you have build your docker image in your container you can do:
# See your images
docker images
# You have to create an account on Docker Hub and a new repository which is very easy.
# Then log to docker hub from the command line.
docker login --username=yourhubusername [email protected]
# You can tag your image.
docker tag bb38976d03cf buisciii/nextflow-scif:1.0
# You can push
docker push bu-isciii/nextflow-scif
Nextflow enables scalable and reproducible scientific workflows managing software containers natively like a charm. It allows the adaptation of pipelines written in the most common scripting languages. Moreover its fluent DSL simplifies the implementation and the deployment of complex parallel and reactive workflows on clouds and clusters, being able to abstract the users both from where and even how they compute without any concern about the multiple dependencies a scientific workflow can have. We will need a nextflow config file and a main.nf script for start using nextflow.
In the nextflow config file we can customize all parameters, environment variables, profiles, etc. we will need in our pipeline. If you want to know more about config files for nextflow you can check it here.
In this case we have created a "simple" config file to use as a template for more complex configuration, using nf-core as a model. It contains some sections with information manifest
and process default parameters.
Following that we have my favourite part where you can configure profiles
for executing your nextflow pipeline. With this you can specify -profile standard/docker/singularity
and the pipeline will run in your local machine, in your singularity image or in your docker image. Moreover you can add further configuration limiting the memory or the number of cpus (ever per process), or any custom variables you need for running in your institution HPC.
The section looks like this:
profiles {
standard {
includeConfig 'conf/base.config'
}
docker {
includeConfig 'conf/docker.config'
}
singularity {
includeConfig 'conf/singularity.config'
}
hpc_isciii {
// TODO. with modules.
}
testing {
// TODO
}
aws {
// TO DO
}
none {
// Don't load any config (for use with custom home configs)
}
}
We have separated the configuration in several config files, the singularity and docker config files will enabled singularity and docker functionality and the path/url to the container:
singularity {
enabled = true
}
process {
// Path to container
container = 'buisciii/nextflow-scif:1.0'
executor = 'local'
}
The main nextflow script main.nf creates the pipeline concatenating the output of one step with another, and managing the input and output. You can check the nextflow docs for mastering te creation of nextflow pipelines.
Here we configured a simple template for a basic pipeline. It will be useful as an start point for developing more complex pipelines, with help and logging function templates, and space for variable checking and email reporting.
Now, let's have some fun and run our precious brand new pipeline with our great containers:
This command is the simplier, it will run the pipeline with all parameters by default, and standard profile, which means that it will be run in your local machine and you need to have all software installed and in your PATH.
# Simpler one
nextflow run main.nf
Now, let's use our container with docker or with singularity:
nextflow run main.nf -profile singularity
nextflow run main.nf -profile docker
Just as simple as that, nextflow handles the use of our singularity/docker container adding docker run
or singularity exec
before each script in the process configured in main.nf!!
But this gets better because for running this pipeline you don't even need to download this repository, you only need to have installed nextflow and Singularity/Docker and type:
nextflow run BU-ISCIII/nextflow-scif -profile singularity
and nextflow will pull the github repository, download the docker image and build it into a singularity image and run your pipeline using only one line of command!!
Also thanks to scif you can also interact with the image pulled from nextflow, in this case is a singularity image build from the docker image in DockerHub and by default is saved in work/singularity
. Since this image is created from the Dockerfile without an entrypoint as we saw before we have to interact with:
# For instance inspecting bwa app
singularity exec buisciii-nextflow-scif-1.0.img scif inspect bwa
This is not much a problem because in normal HPC usage we will be using a Singularity image build locally by us with Singulary Recipe, Docker will be used more for easy sharing.
Of course this is functioning with test data included in the github repository, but you can easily supply your own reads, reference genome and output dir using parameters. You can see the help of the pipeline:
nextflow run BU-ISCIII/nextflow-scif --help
Output:
N E X T F L O W ~ version 0.29.0
Launching `BU-ISCIII/nextflow-scif` [angry_albattani] - revision: d768dfd571 [develop]
=========================================
BU-ISCIII/nextflow-scif DEMO PIPELINE v0.1
=========================================
Usage:
The typical command for running the pipeline is as follows:
nextflow run BU-ISCIII/nextflow-scif -profile standard
Pipeline arguments:
--reads Path to input data (must be surrounded with quotes).
--genome Path to reference genome.
--outdir Output dir.
--help show this message.
-profile Hardware config to use. standard/docker/singularity. Default: standard.
and run it with your own data:
nextflow run BU-ISCIII/nextflow-scif -profile singularity --reads *.fastq --genome genome.fa --outdir results
NOTE: If you are using singularity as a container you have to take into consideration that it mounts
TODO Use cirCI or Travis?? For deployment and testing
Till this point everything is working fine because we are using the data located in our current directory, and singularity mounts $PWD
by default. But what if we are running in our HPC-cluster, and our data is in /processing_Data or other custom directory?
TODO: Try in different locations, with data and reference genome in different folders. Use of --bind ? . Need to pre-create paths inside the containers?? Improvement of I/O metadata mounting only the project folder instead of all processing_Data?? Note: for each app scif/app/{app}/data folder is created. Thats a possibility for mounting. We have to make some performance test in our cluster.
- We have created a container with all the software needed for our pipeline with Singularity and Docker using the same SCI-F recipes. This will allow you to deploy new containers really quicky for new pipelines once you have created a good collection of scif apps recipes.
- Using SCI-F we can access all our apps installed separately from our container. Besides our container is no longer a "black-box" and we can obtain information about what apps are installed in our container, with its labels and help.
- Our container is also functional with the native container commands, and we can exec any command in our container without the use of SCI-F which made it flexible enough for its use with nextflow.
- We have deploy our container to DockerHub so new users don't have to build the image theirselves.
- Finally we have created a basic nextflow workflow, with three different profile configurations {standard, singularity, docker}, this way you can use the same pipeline running in your local computer with your software installed, or with the singularity or docker image. Besides, you can create more profiles for its usage in a HPC with
module
configured, or even in a AWS server.
The idea is to have a workflow independent of where you are computing it, and a container system which allows you to assure reproducibility and sharing, including the maximun information possible for its use.
In order to get this working I have to make a decision: put SCIF on the "top" or put nextflow instead. I will try to explain it:
-
Nextflow on the top:
- Nextflow provides a natively functionality for using singularity and docker. Using it you just set the commands the pipeline is going to run, and nextflow handles the use of
docker run
orsingularity exec
before this commands with the container you supplied in the configuration. - This means that the executables for each app MUST be accesible when you run/exec from the container, not each of the apps with SCI-F.
- Thus, we are being forced to add the executables to the PATH; and somehow mess up SCIF goal of separate apps environment adding the bin app folder to the PATH only when the app is run. Being this the main con of this approximation.
- On the other hand, our main.nf will remain untouch and will be able to run in a local machine, a cluster or a container without any modification. Which was I was looking for.
- For this to work your minimun needs in your system are nextflow and Docker/Singularity if you are using docker or singularity profile.
- Nextflow provides a natively functionality for using singularity and docker. Using it you just set the commands the pipeline is going to run, and nextflow handles the use of
-
SCIF on the top (I am going to make a branch testing this): - > I tried but it didn't work, I get all the apps being recognized but when the process in nextflow where going to be run the files weren't found. This is because nextflow uses a temp folder called work for all executions, and when you use scif run bwa, the command is run in ${SCIF_APPROOT} instead on nextflow folder.
- Another aproximation will be to install nextflow inside the containers. This way you can only interact with the apps commands through SCIF which will be the only entrypoint in Singularity and Docker.
- Nextflow in this case will not use Singularity or Docker profile, because it is INSIDE one, so this config parameters must be set to false.
- main.nf script would have to be modified and
scif run
must be add before each command. - The pros of this approximation will be that apps will be in separate environments as SCI-F plans to, and you won't need to have nextflow installed in your system. If you use the image, just with docker or singularity installed you are set.
- However you could only use this nextflow pipeline with the image prepare for this, or have scif installed in your local/hpc system, and all the apps you are going to use inside scif which may not be the case (as in our HPC where we don't have install permissions)
-
Alternative middle point configuration -> I have this working in feature/nextflow-scif-alt
- I install nextflow inside the container like the option above, and each app environment is separated, but I have to add all app bin folders to the path prior to run nextflow command in %apprun.
- Also I had to cd to ${SCIF_DATA}, and bind this folder with my $PWD in the singularity command.
- For more info please chech feature/nextflow-scif-alt branch in this repository, and README-ALT.md file.
I suppose this is a decision you have to make, one could think that I can use my singularity image in my HPC system and forgot about modules and stuff, I will have all the software I need, in the version I need and whenever I need it. However even if this is going to be the case (we are still testing performance, we are very newbies using containers) I always like to have flexibility, and nextflow provides me in theory that "I don't mind where I compute" flexibility.