Skip to content
Sendu Bala edited this page Jul 29, 2013 · 30 revisions

VRPipe is a new pipeline management system, still under development, though it is being used in production. It was used to do the bulk of the data processing for the 1000 Genomes Project, and continues to be used to automate the running of software for even larger-scale sequencing projects at the Sanger Institute.

A pipeline management system lets you define a series of commands you wish to run (the 'pipeline', where each command typically corresponds to a 'step' in VRPipe's parlance). You can then put a data set through that pipeline, and the system ensures that the data is passed through to each command and that the commands run successfully in the correct order. The main benefit arises when you're dealing with many input data files on which you need to run an identical (except for file paths) set of command lines; the pipeline management system can run these independent series of commands in parallel, and potentially complete the work on 1000 input files in the time it would take to work on 1 (assuming you have a 1000+ CPU cluster).

Features

  • Easy to define Steps and Pipelines
  • Optimal memory reservation for jobs
  • Batching of jobs (even from different pipelines) based on their compute requirements
  • Automatic job retries on job failure
  • Quick and easy access to job errors for diagnosis
  • Quick and easy failed job resubmission (ie. after you've fixed the problem)
  • Detailed monitoring of current status (eg. how much of a pipeline has been completed so far)
  • Email notification on pipeline success or failure
  • Recorded job statistics (run time, memory usage)
  • Recorded history, such that given an output file VRPipe can tell you exactly how that file was made
  • Searchable output files by metadata
  • Automation for on-going projects where new input files arrive over time
  • Automatic handling of discovered mistakes in the input data, 'withdrawing' bad output files already created from the bad inputs and redoing any work necessary (for example, if a pipeline merged some good inputs and some bad inputs, the merge would get repeated with just the good inputs)

Guides

Future Plans

  • Improve web interface
  • Complete POD documentation, and improve/extend this wiki
  • Add Postgres support
Clone this wiki locally