From 731a2b4a92a28c5ff35058071ce5551fed808d62 Mon Sep 17 00:00:00 2001 From: Adina Wagner Date: Fri, 3 Jul 2020 17:15:37 +0200 Subject: [PATCH 01/22] HPC: add a draft write-up of datalad run with HTCondor --- docs/beyond_basics/101-170-dataladrun.rst | 378 ++++++++++++++++++++++ docs/beyond_basics/basics-hpc.rst | 12 + docs/beyond_basics/intro.rst | 1 + 3 files changed, 391 insertions(+) create mode 100644 docs/beyond_basics/101-170-dataladrun.rst create mode 100644 docs/beyond_basics/basics-hpc.rst diff --git a/docs/beyond_basics/101-170-dataladrun.rst b/docs/beyond_basics/101-170-dataladrun.rst new file mode 100644 index 000000000..8115f63de --- /dev/null +++ b/docs/beyond_basics/101-170-dataladrun.rst @@ -0,0 +1,378 @@ +.. _runhpc: + +DataLad-centric analysis with job scheduling and parallel computing +------------------------------------------------------------------- + +This section is a write-up of how DataLad can be used on a scientific computational cluster with a job scheduler for reproducible and FAIR data analyses at scale. +More concretely, it shows an example of containerized `fMRIprep `_ preprocessing on the `eNKI `_ neuroimaging dataset, scheduled with `HTCondor `_. +While the choice of containerized pipeline and job scheduler are specific in this example, the general setup is generic and could be used with any containerized pipeline and any job scheduling system. + +Why job scheduling? +^^^^^^^^^^^^^^^^^^^ + +On scientific compute clusters, job scheduling systems such as `HTCondor `_ or `slurm `_ are used to distribute computational jobs across the available computing infrastructure and manage the overall workload of the cluster. +This allows for efficient and fair use of available resources across a group of users, and it brings the potential for highly parallelized computations of jobs and thus vastly faster analyses. + +One common way to use a job scheduler, for example, is to process all subjects of a dataset independently and as parallel as the current workload of the compute cluster allows instead of serially (i.e., "one after the other"). +In such a setup, each subject-specific analysis becomes a single job, and the job scheduler fits as many jobs as it can on available :term:`compute node`\s. +If a large analysis can be split into many independent jobs, using a job scheduler to run them in parallel thus yields great performance advantages in addition to fair compute resource distribution across all users. + +.. findoutmore:: How is a job scheduler used? + + Depending on the job scheduler your system is using, the looks of your typical job scheduling differ, but the general principle is the same. + + Typically, a job scheduler is used *non-interactively*, and a *job* (i.e., any command or series of commands you want run) is *submitted* to the scheduler. + This submission starts with a "submit" command of the given job scheduler (such as ``condor_submit`` for HTCondor or ``sbatch`` for slurm) followed by a command, script, or *batch/submit-file* that contains job definitions and (potentially) compute resource requirements. + + The job scheduler takes the submitted jobs, *queues* them up in a central queue, and monitors the available compute resources (i.e., :term:`compute node`\s) of the cluster. + As soon as a computational resource is free, it matches a job from the queue to the available resource and computes the job on this node. + Usually, a single submission queues up multiple (dozens, hundreds, or thousands of) jobs. + +Where are the difficulties in parallel computing with DataLad? +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +In order to capture as much provenance as possible, analyses are best ran with a :command:`datalad run` or :command:`datalad containers-run` command, as these commands can capture and link all relevant components of an analysis, starting from code and results to input data and computational environment. + +Note, though, that when parallelizing jobs and computing them with provenance capture, *each individual job* needs to be wrapped in a ``run`` command, and not only the submission of the jobs to the job scheduler -- and this requires multiple parallel ``run`` commands on the same dataset. +Multiple simultaneous ``datalad (containers-)run`` invocations in the same dataset are, however, problematic: + +- Operations carried out during one :command:`run` command can lead to modifications that prevent a second, slightly later ``run`` command from being started +- The :command:`datalad save` command at the end of of :command:`datalad run` could save modifications that originate from a different job, leading to mis-associated provenance +- A number of *concurrency issues*, unwanted interactions of processes when they run simultaneously, can arise and lead to internal command failures + +Some of these problems can be averted by invoking the ``(containers-)run`` command with the ``--explicit`` [#f1]_ flag. +This doesn't solve all of the above problems, though, and may not be applicable to the computation at hand -- for example because all jobs write to a similar file or the result files are not known beforehand. +Below, a complete, largely platform and scheduling-system agnostic containerized analysis workflow is outlined that addressed the outlined problems. + +Processing FAIRly *and* in parallel +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +.. note:: + + FAIR *and* parallel processing requires out-of-the-box thinking, and many creative approaches can lead to success. + Here is **one** approach that led to a provenance-tracked, computationally reproducible, and parallel preprocessing workflow, but many more can work. + `We are eager to hear about yours `_. + +The key to the success of this workflow lies in creating it completely job-scheduling and platform agnostic, such that it can be deployed as a subject-specific job anywhere, with any job scheduling system. +Instead of computing job results in the same dataset over all jobs, temporary, :term:`ephemeral clone`\s are created to hold individual, subject-specific results, and those results are pushed back into the target dataset in the end. +The "creative" bits involved in this parallelized processing workflow boiled down to the following tricks: + +- Individual jobs (in this case, subject-specific analyses) are computed in throw-away dataset clones to avoid unwanted interactions between ``save`` commands. +- Moreover, beyond computing in job-specific, temporary locations, individual job results are also saved into uniquely identified :term:`branch`\es to enable simple pushing back of the results into the target dataset. +- The jobs constitute a complete DataLad-centric workflow in the form of a simple bash script, including dataset build-up and tear-down routines in a throw-away location, result computation, and result publication back to the target dataset. Thus, instead of submitting a ``datalad run`` command to the job scheduler, the job submission is a single script, and this submission is easily adapted to various job scheduling call formats. +- Right after successful job termination, the target dataset contains as many :term:`branch`\es as jobs, with each branch containing the results of one job. A manual :term:`merge` aggregates all results into the :term:`master` branch of the dataset. + +Walkthrough +^^^^^^^^^^^ + +The goal of the following analysis was standard preprocessing using `fMRIprep `_ on neuroimaging data of 1300 subjects in the `eNKI `_ dataset. +In order to associate input data, containerized pipeline, and outputs, the analysis was carried out in a DataLad dataset and with the :command:`datalad containers-run` command. +Here's a walkthrough of what was done and how. + +Starting point: Datasets for software and input data +"""""""""""""""""""""""""""""""""""""""""""""""""""" + +At the beginning of this endeavour, two important analysis components already exist as DataLad datasets: + +1. The input data +2. The containerized pipeline + +Following the :ref:`YODA principles `, each of these components is a standalone dataset. +While the input dataset creation is straightforwards, some thinking went into the creation of containerized pipeline dataset to set it up in a way that allows it to be installed as a subdataset and invoked from the superdataset. +If you are interested in this, find the details in the findoutmore below. + +.. findoutmore:: pipeline dataset creation + + We start with a dataset:: + + $ datalad create pipelines + [INFO ] Creating a new annex repo at /data/projects/enki/pipeline + create(ok): /data/projects/enki/pipeline (dataset) + $ cd pipelines + + As one of tools used in the pipeline, `freesurfer `_, requires a license file, this license file needs to be added into the dataset. + Only then can this dataset be moved around flexibly and also to different machines. + In order to have the license file available right away, it is saved ``--to-git`` and not annexed [#f2]_:: + + $ cp . + $ datalad save --to-git -m "add freesurfer license file" fs-license.txt + + Finally, we add a container with the pipeline to the dataset using :command:`datalad containers-add` [#f3]_. + The important part is the configuration of the container -- it has to be done in a way that makes the container usable in any superdataset the pipeline dataset. + + Depending on how the container needs to be called, the configuration differs. + In the case of an fMRIprep run, we want to be able to invoke the container from a superdataset. + The superdataset contains input data and ``pipelines`` dataset as subdatasets, and will collect all of the results. + Thus, these are arguments we want to supply the invocation with (following `fMRIprep's documentation `_) during a ``containers-run`` command:: + + $ datalad containers-run \ + [...] + \ + --n_cpus \ + --participant-label \ + [...] + + Note how this list does not include bind-mounts of the necessary directories or of the freesurfer license -- this makes the container invocation convenient and easy for any user. + Starting an fMRIprep run requires only a ``datalad containers-run`` with all of the desired fMRIprep options. + + This convenience for the user requires that all of the bind-mounts should be taken care of -- in a generic way -- in the container call specification, though. + Here is how this is done:: + + $ datalad containers-add fmriprep \ + --url TODO \ + --call-fmt singularity run --cleanenv -B "$PWD" {img} {cmd} --fs-license-file "$PWD/{img_dspath}/freesurfer_license.txt" + + During a :command:`datalad containers-run` command, the ``--call-fmt`` specification will be used to call the container. + The placeholders ``{img}`` and ``{cmd}`` will be replaced with the container (``{img}``) and the command given to ``datalad containers-run`` (``{cmd}``). + Thus, the ``--cleanenv`` flag (`recommended by fMRIprep `_) as well as bind-mounts are handled prior to the container invocation, and the ``--fs-license-file`` option with a path to the license file within the container is appended to the command. + Bind-mounting the working directory (``-B "$PWD"``) makes sure to bind mount the directory from which the container is being called, which should be the superdataset that contains input data and ``pipelines`` subdataset. + With these bind-mounts, input data and the freesurfer license file within ``pipelines`` are available in the container. + + With such a setup, the ``pipelines`` dataset can be installed in any dataset and will work out of the box. + +Analysis dataset setup +"""""""""""""""""""""" + +An analysis dataset consists of the following components: + +- input data as a subdataset +- ``pipelines`` container dataset as a subdataset +- subdatasets to hold the results + +Following the benchmarks and tips in the chapter :ref:`chapter_gobig`, the amount of files produced by fMRIprep on 1300 subjects requires two datasets to hold them. +In this particular computation, following the naming scheme and structure of fMRIpreps output directories, one subdataset is created for the `freesurfer `_ results of fMRIprep in a subdataset called ``freesurfer``, and one for the minimally preprocessed input data in a subdataset called ``fmriprep``. + +Here is an overview of the directory structure in the superdataset:: + + superds + ├── code # directory + │   └── pipelines # subdataset with fMRIprep + ├── fmriprep # subdataset for results + ├── freesurfer # subdataset for results + └── sourcedata # subdataset with BIDS-formatted data + ├── sourcedata # subdataset with raw data + ├── sub-A00008326 # directory + ├── sub-... + + +Workflow script +""""""""""""""" + +The general complexity of concurrent ``datalad (containers-)run`` commands arises when they are carried out in the same dataset. +Therefore, the strategy is to create throw-away dataset clone for all jobs. + +.. findoutmore:: how does one create throw-away clones? + + One way to do this are :term:`ephemeral clone`\s, an alternative is to make :term:`git-annex` disregard the datasets annex completely using ``git annex dead here``. + +This involves a build-up and tear-down routine for each job: Clone the analysis dataset hierarchy into a temporary location, run the computation, push the results, remove temporary dataset [#f4]_. + +To give you a first idea, a sketch of this is below. Fine-tuning and the complete script are shown in the findoutmore afterwards:: + + # everything is running under /tmp inside a compute job, /tmp is a performant local filesystem + $ cd /tmp + + # clone the superdataset + $ datalad clone /data/project/enki/superds ds + $ cd ds + + # get first-level subdatasets + $ datalad get -n -r -R1 . + + # make git-annex disregard the clones - they are meant to be thrown away + $ git submodule foreach --recursive git annex dead here + + # checkout unique branches (names derived from job IDs) in both subdatasets + # to enable pushing the results without interference from other jobs + $ git -C fmriprep checkout -b "job-$JOBID" + $ git -C freesurfer checkout -b "job-$JOBID" + + # call fmriprep with datalad containers-run + $ datalad containers-run \ + -m "fMRIprep $subid" \ + --explicit \ + -o freesurfer -o fmriprep \ + -i "$1" \ + -n code/pipelines/fmriprep \ + sourcedata . participant \ + --n_cpus 1 \ + --skip-bids-validation \ + -w .git/tmp/wdir \ + --participant-label "$subid" \ + --random-seed 12345 \ + --skull-strip-fixed-seed \ + --md-only-boilerplate \ + --output-spaces MNI152NLin6Asym \ + --use-aroma \ + --cifti-output + + # push back the results + $ datalad push -d fmriprep --to origin + $ datalad push -d freesurfer --to origin + # job handler should clean up workspace + +Pending a few yet missing safe guards against concurrency issues and to enable re-running computations, such a script can be submitted to any job scheduler with a subject ID and a job ID as identifiers for the fMRIprep run and branch names. + +.. findoutmore:: Fine-tuning: Enable re-running and safe-guard concurrency issues + + Two important fine-tunings are missing: + For one, cloning and pushing *can* still run into concurrency issues in the case when one job clones the original dataset while another job is currently pushing into this dataset. + Therefore, a trick can make sure that no two clone or push commands are executed at the same time. + This trick uses `file locking `_, in particular the tool `flock `_, to prevent exactly concurrent processes. + This is done by prepending ``clone`` and ``push`` commands with ``flock --verbose $DSLOCKFILE``, where ``$DSLOCKFILE`` is a textfile placed into ``.git/`` at the time of job submission (further details in the submit file in the next section) + + The second issue concerns the ability to rerun a computation quickly: + If fMRIprep finds preexisting results, it will fail to run. + Therefore, all outputs of a job are attempted to be removed before the jobs is started [#f5]_:: + + (cd fmriprep && rm -rf logs "$subid" "$subid.html" dataset_description.json desc-*.tsv) + (cd freesurfer && rm -rf fsaverage "$subid") + + With this in place, the only things missing are a :term:`shebang` at the top of the script, and some shell settings for robust scripting with verbose log files (``set -e -u -x``). + You can find the full script with rich comments in the next findoutmore. + +.. findoutmore:: See the complete bash script + + This script is placed in ``code/fmriprep_participant_job``: + + .. code-block:: bash + + #!/bin/bash + + # fail whenever something is fishy, use -x to get verbose logfiles + set -e -u -x + + # we pass in "sourcedata/sub-...", extract subject id from it + subid=$(basename $1) + + # this is all running under /tmp inside a compute job, /tmp is a performant + # local filesystem + cd /tmp + # get the output dataset, which includes the inputs as well + # flock makes sure that this does not interfere with another job + # finishing at the same time, and pushing its results back + # importantly, we clone from the location that we want to push the + # results too + flock --verbose $DSLOCKFILE \ + datalad clone /data/project/enki/super ds + + # all following actions are performed in the context of the superdataset + cd ds + # obtain all first-level subdatasets: + # dataset with fmriprep singularity container and pre-configured + # pipeline call; also get the output dataset to prep them for output + # consumption, we need to tune them for this particular job, sourcedata + # important: because we will push additions to the result datasets back + # at the end of the job, the installation of these result datasets + # must happen from the location we want to push back too + datalad get -n -r -R1 . + # let git-annex know that we do not want to remember any of these clones + # (we could have used an --ephemeral clone, but that might deposite data + # of failed jobs at the origin location, if the job runs on a shared + # filesystem -- let's stay self-contained) + git submodule foreach --recursive git annex dead here + + # checkout new branches in both subdatasets + # this enables us to store the results of this job, and push them back + # without interference from other jobs + git -C fmriprep checkout -b "job-$JOBID" + git -C freesurfer checkout -b "job-$JOBID" + # create workdir for fmriprep inside to simplify singularity call + # PWD will be available in the container + mkdir -p .git/tmp/wdir + # pybids (inside fmriprep) gets angry when it sees dangling symlinks + # of .json files -- wipe them out, spare only those that belong to + # the participant we want to process in this job + find sourcedata -mindepth 2 -name '*.json' -a ! -wholename "$1"/'*' -delete + + # next one is important to get job-reruns correct. We remove all anticipated + # output, such that fmriprep isn't confused by the presence of stale + # symlinks. Otherwise we would need to obtain and unlock file content. + # But that takes some time, for no reason other than being discarded + # at the end + (cd fmriprep && rm -rf logs "$subid" "$subid.html" dataset_description.json desc-*.tsv) + (cd freesurfer && rm -rf fsaverage "$subid") + + # the meat of the matter, add actual parameterization after --participant-label + datalad containers-run \ + -m "fMRIprep $subid" \ + --explicit \ + -o freesurfer -o fmriprep \ + -i "$1" \ + -n code/pipelines/fmriprep \ + sourcedata . participant \ + --n_cpus 1 \ + --skip-bids-validation \ + -w .git/tmp/wdir \ + --participant-label "$subid" \ + --random-seed 12345 \ + --skull-strip-fixed-seed \ + --md-only-boilerplate \ + --output-spaces MNI152NLin6Asym \ + --use-aroma \ + --cifti-output + # selectively push outputs only + # ignore root dataset, despite recorded changes, needs coordinated + # merge at receiving end + flock --verbose $DSLOCKFILE datalad push -d fmriprep --to origin + flock --verbose $DSLOCKFILE datalad push -d freesurfer --to origin + + # job handler should clean up workspace + + +Job submission +"""""""""""""" + +With this script set up, job submission boils down to invoking the script for each participant with a participant identifier that determines on which subject the job runs, and setting two environment variables - one the job ID that determines the branch name that is created, and one that points to a lockfile created beforehand once in ``.git``. +Job scheduler such as HTCondor have syntax that can identify subject IDs from consistently named directories, for example, and the submit file is thus lean. + +You can find the submit file used in this analyses in the findoutmore below. + +.. findoutmore:: HTCondor submit file + + .. code-block:: bash + + universe = vanilla + # this is currently necessary, because otherwise the + # bundles git in git-annex-standalone breaks + # but it should be removed eventually + get_env = True + # resource requirements for each job + request_cpus = 1 + request_memory = 20G + request_disk = 210G + + executable = $ENV(PWD)/code/fmriprep_participant_job + + # the job expects to environment variables for labeling and synchronization + environment = "JOBID=$(Cluster).$(Process) DSLOCKFILE=$ENV(PWD)/.git/datalad_lock" + log = $ENV(PWD)/../logs/$(Cluster).$(Process).log + output = $ENV(PWD)/../logs/$(Cluster).$(Process).out + error = $ENV(PWD)/../logs/$(Cluster).$(Process).err + arguments = $(subid) + # find all participants, based on the subdirectory names in the source dataset + # each relative path to such a subdirectory with become the value of `subid` + # and another job is queued. Will queue a total number of jobs matching the + # number of matching subdirectories + queue subid matching dirs sourcedata/sub-* + + +Merging results +""""""""""""""" + +TODO - need to ask mih how he did it and how merge conflicts were solved. + + + + +.. rubric:: Footnotes + +.. [#f1] To re-read about :command:`datalad run`'s ``--explicit`` option, take a look into the section :ref:`run5`. + +.. [#f2] If the distinction between annexed and unannexed files is new to you, please read section :ref:`symlink` + +.. [#f3] Note that this requires the ``datalad containers`` extension. Find an overview of all datalad extensions in :ref:`extensions_intro`. + +.. [#f4] Clean-up routines can, in the case of common job schedulers, be taken care of by performing everything in compute node specific ``/tmp`` directories that are wiped clean after job termination. + +.. [#f5] The brackets around the commands are called *command grouping* in bash, and yield a subshell environment: `www.gnu.org/software/bash/manual/html_node/Command-Grouping.html `_. \ No newline at end of file diff --git a/docs/beyond_basics/basics-hpc.rst b/docs/beyond_basics/basics-hpc.rst new file mode 100644 index 000000000..22c9e30da --- /dev/null +++ b/docs/beyond_basics/basics-hpc.rst @@ -0,0 +1,12 @@ +.. _chapter_hpc: + +Computing on clusters +--------------------- + +.. figure:: ../artwork/src/cluster.svg + +.. toctree:: + :maxdepth: 1 + :caption: Strategies for high performance computing with DataLad + + 101-170-dataladrun diff --git a/docs/beyond_basics/intro.rst b/docs/beyond_basics/intro.rst index 98836cccb..c9a54ec7e 100644 --- a/docs/beyond_basics/intro.rst +++ b/docs/beyond_basics/intro.rst @@ -22,6 +22,7 @@ associated usecases. basics-scaling basics-retrospective basics-specialpurpose + basics-hpc .. figure:: /artwork/src/hero.svg :width: 70% From fa6fc55587f71dc05702d990790679f2e05bac2d Mon Sep 17 00:00:00 2001 From: Adina Wagner Date: Fri, 3 Jul 2020 17:15:57 +0200 Subject: [PATCH 02/22] Gloss: HTC, HCP, compute node, ephemeral clone --- docs/glossary.rst | 16 ++++++++++++++-- 1 file changed, 14 insertions(+), 2 deletions(-) diff --git a/docs/glossary.rst b/docs/glossary.rst index a723a9b71..3633b196c 100644 --- a/docs/glossary.rst +++ b/docs/glossary.rst @@ -74,6 +74,8 @@ Glossary Container images are *built* from :term:`container recipe` files. They are a static filesystem inside a file, populated with the software specified in the recipe, and some initial configuration. + compute node + A compute node is an individual computer, part of a :term:`high-performance computing (HPC)` or :term:`high-throughput computing (HTC)` cluster. DataLad dataset A DataLad dataset is a Git repository that may or may not have a data annex that is used to @@ -128,7 +130,10 @@ Glossary You can find out a bit more on environment variable :ref:`in this footnote `. ephemeral clone - TODO + dataset clones that share the annex with the dataset they were cloned from, without :term:`git-annex` being aware of it. + On a technical level, this is achieved via symlinks. + They can be created with the ``--reckless ephemeral`` option of :command:`datalad clone`. + force-push Git concept; Enforcing a :command:`git push` command with the ``--force`` @@ -185,6 +190,13 @@ Glossary You can read about more about Pattern Matching in `Bash's Docs `_. + high-performance computing (HPC) + Aggregating computing power from a bond of computers in a way that delivers higher performance than a typical desktop computer in order to solve computing tasks that require high computing power or demand a lot of disk space or memory. + + + high-throughput computing (HTC) + A computing environment build from a bond of computers and tuned to deliver large amounts of computational power to allow parallel processing of independent computational jobs. For more information, see `this Wikipedia entry `_. + http Hypertext Transfer Protocol; A protocol for file transfer over a network. @@ -437,4 +449,4 @@ Glossary The Windows Subsystem for Linux, a compatibility layer for running Linux destributions on recent versions of Windows. Find out more `here `__. zsh - A Unix shell. \ No newline at end of file + A Unix shell. From 6c9c07a55cf06a0b8c78f4cee3327b56f26103bb Mon Sep 17 00:00:00 2001 From: Adina Wagner Date: Mon, 13 Jul 2020 10:41:09 +0200 Subject: [PATCH 03/22] Help: add ref handle to non bare push error --- docs/basics/101-135-help.rst | 2 ++ 1 file changed, 2 insertions(+) diff --git a/docs/basics/101-135-help.rst b/docs/basics/101-135-help.rst index 756b06569..f926fdcf3 100644 --- a/docs/basics/101-135-help.rst +++ b/docs/basics/101-135-help.rst @@ -316,6 +316,8 @@ this means that the sibling contains changes that your local dataset does not ye know about. It can be fixed by updating from the sibling first with a :command:`datalad update --merge`. +.. _nonbarepush: + Here is a different push rejection:: $ datalad push --to roommate From c6ac1c4a85ec4161461d778d2f5ebc80bbb1769d Mon Sep 17 00:00:00 2001 From: Adina Wagner Date: Mon, 13 Jul 2020 10:41:33 +0200 Subject: [PATCH 04/22] Minor formatting: m-dash instead of n-dash --- docs/basics/101-135-help.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/basics/101-135-help.rst b/docs/basics/101-135-help.rst index f926fdcf3..276d79f3a 100644 --- a/docs/basics/101-135-help.rst +++ b/docs/basics/101-135-help.rst @@ -331,7 +331,7 @@ As you can see, the :term:`git-annex branch` was pushed successfully, but updati the ``master`` branch was rejected: ``[remote rejected] (branch is currently checked out) [publish(/home/me/dl-101/DataLad-101)]``. In this particular case, this is because it was an attempt to push from ``DataLad-101`` to the ``roommate`` sibling that was created in chapter :ref:`chapter_collaboration`. -This is a special case of pushing, because it - in technical terms - is a push +This is a special case of pushing, because it -- in technical terms -- is a push to a non-bare repository. Unlike :term:`bare Git repositories`, non-bare repositories can not be pushed to at all times. To fix this, you either want to `checkout another branch `_ From 769acbc8cf104fe423daa2d5f50a2bb1ce6fde4f Mon Sep 17 00:00:00 2001 From: Adina Wagner Date: Mon, 13 Jul 2020 10:42:22 +0200 Subject: [PATCH 05/22] run: tweaks to the section on run in parallel --- docs/beyond_basics/101-170-dataladrun.rst | 51 ++++++++++++++++------- 1 file changed, 37 insertions(+), 14 deletions(-) diff --git a/docs/beyond_basics/101-170-dataladrun.rst b/docs/beyond_basics/101-170-dataladrun.rst index 8115f63de..6015f4947 100644 --- a/docs/beyond_basics/101-170-dataladrun.rst +++ b/docs/beyond_basics/101-170-dataladrun.rst @@ -13,7 +13,7 @@ Why job scheduling? On scientific compute clusters, job scheduling systems such as `HTCondor `_ or `slurm `_ are used to distribute computational jobs across the available computing infrastructure and manage the overall workload of the cluster. This allows for efficient and fair use of available resources across a group of users, and it brings the potential for highly parallelized computations of jobs and thus vastly faster analyses. -One common way to use a job scheduler, for example, is to process all subjects of a dataset independently and as parallel as the current workload of the compute cluster allows instead of serially (i.e., "one after the other"). +Consider one common way to use a job scheduler: processing all subjects of a dataset independently and as parallel as the current workload of the compute cluster allows instead of serially (i.e., "one after the other"). In such a setup, each subject-specific analysis becomes a single job, and the job scheduler fits as many jobs as it can on available :term:`compute node`\s. If a large analysis can be split into many independent jobs, using a job scheduler to run them in parallel thus yields great performance advantages in addition to fair compute resource distribution across all users. @@ -53,12 +53,19 @@ Processing FAIRly *and* in parallel Here is **one** approach that led to a provenance-tracked, computationally reproducible, and parallel preprocessing workflow, but many more can work. `We are eager to hear about yours `_. -The key to the success of this workflow lies in creating it completely job-scheduling and platform agnostic, such that it can be deployed as a subject-specific job anywhere, with any job scheduling system. -Instead of computing job results in the same dataset over all jobs, temporary, :term:`ephemeral clone`\s are created to hold individual, subject-specific results, and those results are pushed back into the target dataset in the end. +**General background**: We need to preprocess data from 1300 participants with a containerized pipeline. +All data lies in a single dataset. +The preprocessing results will encompass several TB and about half a million files, and will therefore need to be split into two result datasets. + +The keys to the success of this workflow lie in + +- creating it completely *job-scheduling* and *platform agnostic*, such that the workflow can be deployed as a subject-specific job anywhere, with any job scheduling system, and ... +- instead of computing job results in the same dataset over all jobs, temporary, :term:`ephemeral clone`\s are created to hold individual, subject-specific results, and those results are pushed back into the target dataset in the end. + The "creative" bits involved in this parallelized processing workflow boiled down to the following tricks: - Individual jobs (in this case, subject-specific analyses) are computed in throw-away dataset clones to avoid unwanted interactions between ``save`` commands. -- Moreover, beyond computing in job-specific, temporary locations, individual job results are also saved into uniquely identified :term:`branch`\es to enable simple pushing back of the results into the target dataset. +- Moreover, beyond computing in job-specific, temporary locations, individual job results are also saved into uniquely identified :term:`branch`\es to enable simple pushing back of the results into the target dataset [#f6]_. - The jobs constitute a complete DataLad-centric workflow in the form of a simple bash script, including dataset build-up and tear-down routines in a throw-away location, result computation, and result publication back to the target dataset. Thus, instead of submitting a ``datalad run`` command to the job scheduler, the job submission is a single script, and this submission is easily adapted to various job scheduling call formats. - Right after successful job termination, the target dataset contains as many :term:`branch`\es as jobs, with each branch containing the results of one job. A manual :term:`merge` aggregates all results into the :term:`master` branch of the dataset. @@ -83,11 +90,11 @@ If you are interested in this, find the details in the findoutmore below. .. findoutmore:: pipeline dataset creation - We start with a dataset:: + We start with a dataset (called ``pipelines`` in this example):: $ datalad create pipelines - [INFO ] Creating a new annex repo at /data/projects/enki/pipeline - create(ok): /data/projects/enki/pipeline (dataset) + [INFO ] Creating a new annex repo at /data/projects/enki/pipelines + create(ok): /data/projects/enki/pipelines (dataset) $ cd pipelines As one of tools used in the pipeline, `freesurfer `_, requires a license file, this license file needs to be added into the dataset. @@ -100,7 +107,7 @@ If you are interested in this, find the details in the findoutmore below. Finally, we add a container with the pipeline to the dataset using :command:`datalad containers-add` [#f3]_. The important part is the configuration of the container -- it has to be done in a way that makes the container usable in any superdataset the pipeline dataset. - Depending on how the container needs to be called, the configuration differs. + Depending on how the container/pipeline needs to be called, the configuration differs. In the case of an fMRIprep run, we want to be able to invoke the container from a superdataset. The superdataset contains input data and ``pipelines`` dataset as subdatasets, and will collect all of the results. Thus, these are arguments we want to supply the invocation with (following `fMRIprep's documentation `_) during a ``containers-run`` command:: @@ -165,9 +172,12 @@ Therefore, the strategy is to create throw-away dataset clone for all jobs. One way to do this are :term:`ephemeral clone`\s, an alternative is to make :term:`git-annex` disregard the datasets annex completely using ``git annex dead here``. -This involves a build-up and tear-down routine for each job: Clone the analysis dataset hierarchy into a temporary location, run the computation, push the results, remove temporary dataset [#f4]_. +Using throw-away clones involves a build-up and tear-down routine for each job: Clone the analysis dataset hierarchy into a temporary location, run the computation, push the results, remove temporary dataset [#f4]_. +All of this is done in a single script, which will be submitted as a job. -To give you a first idea, a sketch of this is below. Fine-tuning and the complete script are shown in the findoutmore afterwards:: +To give you a first idea, a sketch of this is below in a :term:`bash` (shell) script. +Using `shell `_ as the language for this script is a straight-forward choice as it allows you to script the DataLad workflow just as you would type it into your terminal, but other languages (e.g., using :ref:`DataLad's Python API ` or system calls in languages such as Matlab) would work as well. +Fine-tuning and the complete script are shown in the findoutmore afterwards:: # everything is running under /tmp inside a compute job, /tmp is a performant local filesystem $ cd /tmp @@ -176,7 +186,7 @@ To give you a first idea, a sketch of this is below. Fine-tuning and the complet $ datalad clone /data/project/enki/superds ds $ cd ds - # get first-level subdatasets + # get first-level subdatasets (-R1 = --recursion-limit 1) $ datalad get -n -r -R1 . # make git-annex disregard the clones - they are meant to be thrown away @@ -187,7 +197,8 @@ To give you a first idea, a sketch of this is below. Fine-tuning and the complet $ git -C fmriprep checkout -b "job-$JOBID" $ git -C freesurfer checkout -b "job-$JOBID" - # call fmriprep with datalad containers-run + # call fmriprep with datalad containers-run. Use all relevant fMRIprep + # arguments for your usecase $ datalad containers-run \ -m "fMRIprep $subid" \ --explicit \ @@ -212,6 +223,7 @@ To give you a first idea, a sketch of this is below. Fine-tuning and the complet # job handler should clean up workspace Pending a few yet missing safe guards against concurrency issues and to enable re-running computations, such a script can be submitted to any job scheduler with a subject ID and a job ID as identifiers for the fMRIprep run and branch names. +The concrete calling/submission of this script is shown in the paragraph :ref:`jobsubmit`, but on a procedural level, this workflow sketch takes care .. findoutmore:: Fine-tuning: Enable re-running and safe-guard concurrency issues @@ -319,6 +331,7 @@ Pending a few yet missing safe guards against concurrency issues and to enable r # job handler should clean up workspace +.. _jobsubmit: Job submission """""""""""""" @@ -328,7 +341,7 @@ Job scheduler such as HTCondor have syntax that can identify subject IDs from co You can find the submit file used in this analyses in the findoutmore below. -.. findoutmore:: HTCondor submit file +.. findoutmore:: HTCondor submit file fmriprep_all_participants.submit .. code-block:: bash @@ -356,13 +369,21 @@ You can find the submit file used in this analyses in the findoutmore below. # number of matching subdirectories queue subid matching dirs sourcedata/sub-* +All it takes to submit is a single ``condor_submit fmriprep_all_participants.submit``. Merging results """"""""""""""" +Once all jobs have finished, the results lie in individual branches of the output datasets. +In this concrete example, the subdatasets ``fmriprep`` and ``freesurfer`` will each have 1300 branches that hold individual job results. +The only thing left to do now is merging all of these branches into :term:`master` -- and potentially solve any merge conflicts that arise. + TODO - need to ask mih how he did it and how merge conflicts were solved. +Recomputing results +""""""""""""""""""" +TODO .. rubric:: Footnotes @@ -375,4 +396,6 @@ TODO - need to ask mih how he did it and how merge conflicts were solved. .. [#f4] Clean-up routines can, in the case of common job schedulers, be taken care of by performing everything in compute node specific ``/tmp`` directories that are wiped clean after job termination. -.. [#f5] The brackets around the commands are called *command grouping* in bash, and yield a subshell environment: `www.gnu.org/software/bash/manual/html_node/Command-Grouping.html `_. \ No newline at end of file +.. [#f5] The brackets around the commands are called *command grouping* in bash, and yield a subshell environment: `www.gnu.org/software/bash/manual/html_node/Command-Grouping.html `_. + +.. [#f6] To find out why a different branch is required to enable easy pushing back to the original dataset, please checkout the explanation on :ref:`pushing to non-bare repositories ` in the section on :ref:`help`. \ No newline at end of file From 200ea7378d39773b4c3f7acadbce6565d94effa1 Mon Sep 17 00:00:00 2001 From: Adina Wagner Date: Mon, 13 Jul 2020 11:27:36 +0200 Subject: [PATCH 06/22] Go Big: reference HPC chapter --- docs/beyond_basics/101-160-gobig.rst | 1 + 1 file changed, 1 insertion(+) diff --git a/docs/beyond_basics/101-160-gobig.rst b/docs/beyond_basics/101-160-gobig.rst index cccd24395..b480e1baf 100644 --- a/docs/beyond_basics/101-160-gobig.rst +++ b/docs/beyond_basics/101-160-gobig.rst @@ -22,6 +22,7 @@ and points to benchmarks, rules of thumb, and general solutions. Upcoming sections demonstrate how one can attempt large-scale analyses with DataLad, and how to fix things up when dataset sizes got out of hand. +The upcoming chapter :ref:`chapter_hpc`, finally, extends this chapter with advice and examples from large scale analyses on computational clusters. Why scaling up Git repos can become difficult ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ From 6cbdae275fa588de8dbc9906d48510bb7faac6f8 Mon Sep 17 00:00:00 2001 From: Adina Wagner Date: Mon, 13 Jul 2020 11:27:55 +0200 Subject: [PATCH 07/22] HPC: reference scaling up chapter --- docs/beyond_basics/101-170-dataladrun.rst | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/docs/beyond_basics/101-170-dataladrun.rst b/docs/beyond_basics/101-170-dataladrun.rst index 6015f4947..53783dc75 100644 --- a/docs/beyond_basics/101-170-dataladrun.rst +++ b/docs/beyond_basics/101-170-dataladrun.rst @@ -3,6 +3,10 @@ DataLad-centric analysis with job scheduling and parallel computing ------------------------------------------------------------------- +.. note:: + + It is advised to read the previous chapter :ref:`chapter_gobig` prior to this one + This section is a write-up of how DataLad can be used on a scientific computational cluster with a job scheduler for reproducible and FAIR data analyses at scale. More concretely, it shows an example of containerized `fMRIprep `_ preprocessing on the `eNKI `_ neuroimaging dataset, scheduled with `HTCondor `_. While the choice of containerized pipeline and job scheduler are specific in this example, the general setup is generic and could be used with any containerized pipeline and any job scheduling system. From 038b06a6ed85e0be18d2b28b20a1518ffc6ccff0 Mon Sep 17 00:00:00 2001 From: Adina Wagner Date: Mon, 13 Jul 2020 11:28:26 +0200 Subject: [PATCH 08/22] add short introductory section for HPC chapter --- docs/beyond_basics/101-169-cluster.rst | 25 +++++++++++++++++++++++++ docs/beyond_basics/basics-hpc.rst | 1 + 2 files changed, 26 insertions(+) create mode 100644 docs/beyond_basics/101-169-cluster.rst diff --git a/docs/beyond_basics/101-169-cluster.rst b/docs/beyond_basics/101-169-cluster.rst new file mode 100644 index 000000000..f418f0939 --- /dev/null +++ b/docs/beyond_basics/101-169-cluster.rst @@ -0,0 +1,25 @@ +.. _hpc: + +DataLad on High Throughput or High Performance Compute Clusters +--------------------------------------------------------------- + +For efficient computing of large analysis, to comply to best computing practices, or to fulfil the requirements that `responsible system administrators `_ impose, users may turn to computational clusters such as :term:`high-performance computing (HPC)` or :term:`high-throughput computing (HTC)` infrastructure for data analysis, back-up, or storage. + +This chapter is a collection of useful resources and examples that aims to help you get started with DataLad-centric workflows on clusters. +We hope to grow this chapter further, so please `get in touch `_ if you want to share your use case or seek more advice. + +Pointers to content in other chapters +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +To find out more about centralized storage solutions, you may want to checkout the usecase :ref:`usecase_datastore` or the section :ref:`riastore`. + +DataLad installation on a cluster +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Users of a compute cluster generally do not have administrative privileges (sudo rights) and thus can not install software as easily as on their own, private machine. +In order to get DataLad and its underlying tools installed, you can either `bribe (kindly ask) your system administrator `_ [#f1]_ or install everything for your own user only following the instructions in the paragraph :ref:`norootinstall` of the :ref:`installation page `. + + +.. rubric:: Footnotes + +.. [#f1] You may not need to bribe your system administrator if you are kind to them. Consider frequent gestures of appreciation, or send a geeky T-Shirt for `SysAdminDay `_ (the last Friday in July) -- Sysadmins do amazing work! \ No newline at end of file diff --git a/docs/beyond_basics/basics-hpc.rst b/docs/beyond_basics/basics-hpc.rst index 22c9e30da..377c32e7f 100644 --- a/docs/beyond_basics/basics-hpc.rst +++ b/docs/beyond_basics/basics-hpc.rst @@ -9,4 +9,5 @@ Computing on clusters :maxdepth: 1 :caption: Strategies for high performance computing with DataLad + 101-169-cluster 101-170-dataladrun From deba479ab68cf824bf6d10a6b580ea4dba3b3edf Mon Sep 17 00:00:00 2001 From: Adina Wagner Date: Mon, 13 Jul 2020 11:28:46 +0200 Subject: [PATCH 09/22] install: add reference handle to non-root install instruction --- docs/intro/installation.rst | 1 + 1 file changed, 1 insertion(+) diff --git a/docs/intro/installation.rst b/docs/intro/installation.rst index d632a2c26..827ab9369 100644 --- a/docs/intro/installation.rst +++ b/docs/intro/installation.rst @@ -240,6 +240,7 @@ Subsequently, DataLad can be installed via ``pip``. Alternatively, DataLad can be installed together with :term:`Git` and :term:`git-annex` via ``conda`` as outlined in the section below. +.. _norootinstall: Linux-machines with no root access (e.g. HPC systems) """"""""""""""""""""""""""""""""""""""""""""""""""""" From 74c3bcd9b69b4fd0ba3e33930f2b2c1429d90788 Mon Sep 17 00:00:00 2001 From: Adina Wagner Date: Mon, 13 Jul 2020 15:45:34 +0200 Subject: [PATCH 10/22] trigger build again From 68b7c1609c9b4b1d677c159dd43704ba2eac5f98 Mon Sep 17 00:00:00 2001 From: Adina Wagner Date: Mon, 13 Jul 2020 16:01:42 +0200 Subject: [PATCH 11/22] trigger build again From ac78233b64bfc5e19905769831a7d3a7a606a5f4 Mon Sep 17 00:00:00 2001 From: Adina Wagner Date: Mon, 13 Jul 2020 16:13:15 +0200 Subject: [PATCH 12/22] small typo --- docs/beyond_basics/101-170-dataladrun.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/beyond_basics/101-170-dataladrun.rst b/docs/beyond_basics/101-170-dataladrun.rst index 53783dc75..86d7897c9 100644 --- a/docs/beyond_basics/101-170-dataladrun.rst +++ b/docs/beyond_basics/101-170-dataladrun.rst @@ -340,7 +340,7 @@ The concrete calling/submission of this script is shown in the paragraph :ref:`j Job submission """""""""""""" -With this script set up, job submission boils down to invoking the script for each participant with a participant identifier that determines on which subject the job runs, and setting two environment variables - one the job ID that determines the branch name that is created, and one that points to a lockfile created beforehand once in ``.git``. +With this script set up, job submission boils down to invoking the script for each participant with a participant identifier that determines on which subject the job runs, and setting two environment variables -- one the job ID that determines the branch name that is created, and one that points to a lockfile created beforehand once in ``.git``. Job scheduler such as HTCondor have syntax that can identify subject IDs from consistently named directories, for example, and the submit file is thus lean. You can find the submit file used in this analyses in the findoutmore below. From edd26cc17abfdc8feccf447009a89052768b4621 Mon Sep 17 00:00:00 2001 From: Adina Wagner Date: Wed, 5 Aug 2020 16:45:04 +0200 Subject: [PATCH 13/22] hpc-run: finalize initial draft --- docs/beyond_basics/101-170-dataladrun.rst | 107 ++++++++++++++++++---- 1 file changed, 88 insertions(+), 19 deletions(-) diff --git a/docs/beyond_basics/101-170-dataladrun.rst b/docs/beyond_basics/101-170-dataladrun.rst index 86d7897c9..21b731b18 100644 --- a/docs/beyond_basics/101-170-dataladrun.rst +++ b/docs/beyond_basics/101-170-dataladrun.rst @@ -91,6 +91,7 @@ At the beginning of this endeavour, two important analysis components already ex Following the :ref:`YODA principles `, each of these components is a standalone dataset. While the input dataset creation is straightforwards, some thinking went into the creation of containerized pipeline dataset to set it up in a way that allows it to be installed as a subdataset and invoked from the superdataset. If you are interested in this, find the details in the findoutmore below. +Also note that there is a large collection of pre-existing container datasets available at `github.com/ReproNim/containers `_. .. findoutmore:: pipeline dataset creation @@ -176,18 +177,20 @@ Therefore, the strategy is to create throw-away dataset clone for all jobs. One way to do this are :term:`ephemeral clone`\s, an alternative is to make :term:`git-annex` disregard the datasets annex completely using ``git annex dead here``. -Using throw-away clones involves a build-up and tear-down routine for each job: Clone the analysis dataset hierarchy into a temporary location, run the computation, push the results, remove temporary dataset [#f4]_. +Using throw-away clones involves a build-up and tear-down routine for each job but works well since datasets are by nature made for collaboration [#f7]_: Clone the analysis dataset hierarchy into a temporary location, run the computation, push the results, remove temporary dataset [#f4]_. + All of this is done in a single script, which will be submitted as a job. -To give you a first idea, a sketch of this is below in a :term:`bash` (shell) script. +To give you a first idea, a sketch of this is in the :term:`bash` (shell) script below. Using `shell `_ as the language for this script is a straight-forward choice as it allows you to script the DataLad workflow just as you would type it into your terminal, but other languages (e.g., using :ref:`DataLad's Python API ` or system calls in languages such as Matlab) would work as well. Fine-tuning and the complete script are shown in the findoutmore afterwards:: - # everything is running under /tmp inside a compute job, /tmp is a performant local filesystem + # everything is running under /tmp inside a compute job, + # /tmp is job-specific local filesystem not shared between jobs $ cd /tmp # clone the superdataset - $ datalad clone /data/project/enki/superds ds + $ datalad clone /data/project/enki/super ds $ cd ds # get first-level subdatasets (-R1 = --recursion-limit 1) @@ -227,7 +230,7 @@ Fine-tuning and the complete script are shown in the findoutmore afterwards:: # job handler should clean up workspace Pending a few yet missing safe guards against concurrency issues and to enable re-running computations, such a script can be submitted to any job scheduler with a subject ID and a job ID as identifiers for the fMRIprep run and branch names. -The concrete calling/submission of this script is shown in the paragraph :ref:`jobsubmit`, but on a procedural level, this workflow sketch takes care +The concrete calling/submission of this script is shown in the paragraph :ref:`jobsubmit`, but on a procedural level, this workflow sketch takes care of everything that needs to be done apart from combining all computed results afterwards. .. findoutmore:: Fine-tuning: Enable re-running and safe-guard concurrency issues @@ -335,26 +338,28 @@ The concrete calling/submission of this script is shown in the paragraph :ref:`j # job handler should clean up workspace +Pending modifications to paths provided in clone locations, the above script and dataset setup is generic enough to be run on different systems and with different job schedulers. + .. _jobsubmit: Job submission """""""""""""" -With this script set up, job submission boils down to invoking the script for each participant with a participant identifier that determines on which subject the job runs, and setting two environment variables -- one the job ID that determines the branch name that is created, and one that points to a lockfile created beforehand once in ``.git``. -Job scheduler such as HTCondor have syntax that can identify subject IDs from consistently named directories, for example, and the submit file is thus lean. +Job submission now only boils down to invoking the script for each participant with a participant identifier that determines on which subject the job runs, and setting two environment variables -- one the job ID that determines the branch name that is created, and one that points to a lockfile created beforehand once in ``.git``. +Job scheduler such as HTCondor have syntax that can identify subject IDs from consistently named directories, for example, and the submit file can thus be lean even though it queues up more than 1000 jobs. You can find the submit file used in this analyses in the findoutmore below. -.. findoutmore:: HTCondor submit file fmriprep_all_participants.submit +.. findoutmore:: HTCondor submit file + + The following submit file was created and saved in ``code/fmriprep_all_participants.submit``: .. code-block:: bash universe = vanilla - # this is currently necessary, because otherwise the - # bundles git in git-annex-standalone breaks - # but it should be removed eventually get_env = True - # resource requirements for each job + # resource requirements for each job, determined by + # investigating the demands of a single test job request_cpus = 1 request_memory = 20G request_disk = 210G @@ -373,21 +378,83 @@ You can find the submit file used in this analyses in the findoutmore below. # number of matching subdirectories queue subid matching dirs sourcedata/sub-* -All it takes to submit is a single ``condor_submit fmriprep_all_participants.submit``. +All it takes to submit is a single ``condor_submit ``. Merging results """"""""""""""" Once all jobs have finished, the results lie in individual branches of the output datasets. -In this concrete example, the subdatasets ``fmriprep`` and ``freesurfer`` will each have 1300 branches that hold individual job results. +In this concrete example, the subdatasets ``fmriprep`` and ``freesurfer`` will each have more than 1000 branches that hold individual job results. The only thing left to do now is merging all of these branches into :term:`master` -- and potentially solve any merge conflicts that arise. +Usually, merging branches is done using the ``git merge`` command with a branch specification. +For example, in order to merge one job branch into the :term:`master` :term:`branch`, one would need to be on ``master`` and run ``git merge ``. +Given that the subdatasets each contain >1000 branches, and that each ``merge`` would lead to a commit, in order to not inflate the history of the dataset with hundreds of merge commits, two `Octopus merges `_ were done - one in each subdataset (``fmriprep`` and ``freesurfer``). + +.. findoutmore:: What is an octopus merge? + + Usually a commit that arises from a merge has two *parent* commits: The *first parent* is the branch the merge is being performed from, in the example above, ``master``. The *second parent* is the branch that was merged into the first. + + + However, ``git merge`` is capable of merging more than two branches simultaneously if more than a single branch name is given to the command. + The resulting merge commit has as many parent as were involved in the merge. + If a commit has more than two parents, if is affectionately called an "Octopus" merge. + + Octopus merges require merge-conflict-free situations, and will not be carried out whenever manual resolution of conflicts is needed. + +The merge command can be assembled quickly. +As all result branches were named ``job-``, a complete list of branches is obtained with the following command:: + + $ git branch -l | grep 'job-' | tr -d ' ' + +This command line call translates to: "list all branches, of all branches, show me those that contain ``job-``, and remove (``tr -d``) all whitespace. +This can be given to ``git merge`` as in + +.. code-block:: bash + + $ git merge -m "Merge results from job cluster XY" $(git branch -l | grep 'job-' | tr -d ' ') -TODO - need to ask mih how he did it and how merge conflicts were solved. +**Merging with merge conflicts** -Recomputing results -""""""""""""""""""" +When attempting an octopus merge like the one above and a merge conflict arises, the merge is aborted automatically. This is what it looks like:: + + $ git merge -m "Merge results from job cluster 107890" $(git branch -l | grep 'job-' | tr -d ' ') + Fast-forwarding to: job-107890.0 + Trying simple merge with job-107890.1 + Simple merge did not work, trying automatic merge. + ERROR: logs/CITATION.md: Not merging symbolic link changes. + fatal: merge program failed + Automated merge did not work. + Should not be doing an octopus. + Merge with strategy octopus failed. + +This merge conflict arose in the ``fmriprep`` subdataset an originated from the fact that each job generated a ``CITATION.md`` file with minimal individual changes. + +.. findoutmore:: How to fix this? + + As the file ``CITATION.md`` does not contain meaningful changes between jobs, one of the files was kept (e.g., copied into a temporary location, or brought back to life afterwards with ``git cat-file``), and all ``CITATION.md`` files of all branches were deleted prior to the merge. + Here is a bash loop that would do exactly that:: + + $ for b in $(git branch -l | grep 'job-' | tr -d ' '); + do ( git checkout -b m$b $b && git rm logs/CITATION.md && git commit --amend --no-edit ) ; + done + + Afterwards, the merge command succeeds + +**Merging without merge conflicts** + +If no merge conflicts arise and the octopus merge is successful, all results are aggregated in the ``master`` branch. +The commit log looks like a work of modern art when visualized with tools such as :term:`tig`: + +.. figure:: ../artwork/src/octopusmerge_tig.png + + +Summary +""""""" + +Once all jobs are computed in parallel and the resulting branches merged, the superdataset is populated with two subdatasets that hold the preprocessing results. +Each result contains a machine-readable record of provenance on when, how, and by whom it was computed. +From this point, the results in the subdatasets can be used for further analysis, while a record of how they were preprocessed is attached to them. -TODO .. rubric:: Footnotes @@ -402,4 +469,6 @@ TODO .. [#f5] The brackets around the commands are called *command grouping* in bash, and yield a subshell environment: `www.gnu.org/software/bash/manual/html_node/Command-Grouping.html `_. -.. [#f6] To find out why a different branch is required to enable easy pushing back to the original dataset, please checkout the explanation on :ref:`pushing to non-bare repositories ` in the section on :ref:`help`. \ No newline at end of file +.. [#f6] To find out why a different branch is required to enable easy pushing back to the original dataset, please checkout the explanation on :ref:`pushing to non-bare repositories ` in the section on :ref:`help`. + +.. [#f7] For an analogy, consider a group of software developers: Instead of adding code changes to the main :term:`branch` of a repository, they develop in their own repository clones and on dedicated, individual feature branches. This allows them to integrate their changes back into the original repository with as little conflict as possible. \ No newline at end of file From d5b78c0cb48befb2b9c961b86f48c5d8266ce606 Mon Sep 17 00:00:00 2001 From: Adina Wagner Date: Wed, 5 Aug 2020 16:48:56 +0200 Subject: [PATCH 14/22] retrigger build From 5fa670ba26276122aaf71e34dbb66c10154fafa0 Mon Sep 17 00:00:00 2001 From: Adina Wagner Date: Tue, 11 Aug 2020 07:54:50 +0200 Subject: [PATCH 15/22] trigger build again From 794d83e9ed15ab665d86d860e6c83859601d19fc Mon Sep 17 00:00:00 2001 From: Adina Wagner Date: Fri, 16 Oct 2020 16:22:35 +0200 Subject: [PATCH 16/22] run: rewrite material into general, simplistic workflow --- docs/beyond_basics/101-170-dataladrun.rst | 540 +++++++++------------- 1 file changed, 223 insertions(+), 317 deletions(-) diff --git a/docs/beyond_basics/101-170-dataladrun.rst b/docs/beyond_basics/101-170-dataladrun.rst index 21b731b18..d806ddea7 100644 --- a/docs/beyond_basics/101-170-dataladrun.rst +++ b/docs/beyond_basics/101-170-dataladrun.rst @@ -8,8 +8,9 @@ DataLad-centric analysis with job scheduling and parallel computing It is advised to read the previous chapter :ref:`chapter_gobig` prior to this one This section is a write-up of how DataLad can be used on a scientific computational cluster with a job scheduler for reproducible and FAIR data analyses at scale. -More concretely, it shows an example of containerized `fMRIprep `_ preprocessing on the `eNKI `_ neuroimaging dataset, scheduled with `HTCondor `_. -While the choice of containerized pipeline and job scheduler are specific in this example, the general setup is generic and could be used with any containerized pipeline and any job scheduling system. +It showcases the general principles behind parallel processing of DataLad-centric workflows with containerized pipelines. +This section lays the groundwork to the next section, a walkthrough through a more complex real life example of containerized `fMRIprep `_ preprocessing on the `eNKI `_ neuroimaging dataset, scheduled with `HTCondor `_. +While this chapter demonstrates specific containerized pipelines and job schedulers, the general setup is generic and could be used with any containerized pipeline and any job scheduling system. Why job scheduling? ^^^^^^^^^^^^^^^^^^^ @@ -31,329 +32,281 @@ If a large analysis can be split into many independent jobs, using a job schedul The job scheduler takes the submitted jobs, *queues* them up in a central queue, and monitors the available compute resources (i.e., :term:`compute node`\s) of the cluster. As soon as a computational resource is free, it matches a job from the queue to the available resource and computes the job on this node. Usually, a single submission queues up multiple (dozens, hundreds, or thousands of) jobs. + If you are interested in a tutorial for HTCondor, checkout the `INM-7 HTcondor Tutorial `_. Where are the difficulties in parallel computing with DataLad? ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ In order to capture as much provenance as possible, analyses are best ran with a :command:`datalad run` or :command:`datalad containers-run` command, as these commands can capture and link all relevant components of an analysis, starting from code and results to input data and computational environment. -Note, though, that when parallelizing jobs and computing them with provenance capture, *each individual job* needs to be wrapped in a ``run`` command, and not only the submission of the jobs to the job scheduler -- and this requires multiple parallel ``run`` commands on the same dataset. +Note, though, that when parallelizing jobs and computing them with provenance capture, *each individual job* needs to be wrapped in a ``run`` command, not only the submission of the jobs to the job scheduler -- and this requires multiple parallel ``run`` commands on the same dataset. Multiple simultaneous ``datalad (containers-)run`` invocations in the same dataset are, however, problematic: - Operations carried out during one :command:`run` command can lead to modifications that prevent a second, slightly later ``run`` command from being started -- The :command:`datalad save` command at the end of of :command:`datalad run` could save modifications that originate from a different job, leading to mis-associated provenance +- The :command:`datalad save` command at the end of :command:`datalad run` could save modifications that originate from a different job, leading to mis-associated provenance - A number of *concurrency issues*, unwanted interactions of processes when they run simultaneously, can arise and lead to internal command failures Some of these problems can be averted by invoking the ``(containers-)run`` command with the ``--explicit`` [#f1]_ flag. This doesn't solve all of the above problems, though, and may not be applicable to the computation at hand -- for example because all jobs write to a similar file or the result files are not known beforehand. -Below, a complete, largely platform and scheduling-system agnostic containerized analysis workflow is outlined that addressed the outlined problems. +Below, you can find a complete, largely platform and scheduling-system agnostic containerized analysis workflow that addressed the outlined problems. -Processing FAIRly *and* in parallel -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +Processing FAIRly *and* in parallel -- General workflow +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. note:: FAIR *and* parallel processing requires out-of-the-box thinking, and many creative approaches can lead to success. - Here is **one** approach that led to a provenance-tracked, computationally reproducible, and parallel preprocessing workflow, but many more can work. + Here is **one** approach that leads to a provenance-tracked, computationally reproducible, and parallel preprocessing workflow, but many more can work. `We are eager to hear about yours `_. -**General background**: We need to preprocess data from 1300 participants with a containerized pipeline. -All data lies in a single dataset. -The preprocessing results will encompass several TB and about half a million files, and will therefore need to be split into two result datasets. +**General setup**: The overall setup consists of a data analysis with a containerized pipeline (i.e., a software container that performs a single or a set of analyses). +Results will be aggregated into a top-level analysis dataset while the input dataset and a "pipeline" dataset (with a configured software container) exist as subdatasets. +The analysis is carried out on a computational cluster that uses a job scheduling system to distribute compute jobs. -The keys to the success of this workflow lie in +The "creative" bits involved in this parallelized processing workflow boil down to the following tricks: -- creating it completely *job-scheduling* and *platform agnostic*, such that the workflow can be deployed as a subject-specific job anywhere, with any job scheduling system, and ... -- instead of computing job results in the same dataset over all jobs, temporary, :term:`ephemeral clone`\s are created to hold individual, subject-specific results, and those results are pushed back into the target dataset in the end. +- Individual jobs (for example subject-specific analyses) are computed in **throw-away dataset clones** to avoid unwanted interactions between parallel jobs. +- Beyond computing in job-specific, temporary locations, individual job results are also saved into uniquely identified :term:`branch`\es to enable simple **pushing back of the results** into the target dataset. +- The jobs constitute a complete DataLad-centric workflow in the form of a simple bash script, including dataset build-up and tear-down routines in a throw-away location, result computation, and result publication back to the target dataset. + Thus, instead of submitting a ``datalad run`` command to the job scheduler, **the job submission is a single script**, and this submission is easily adapted to various job scheduling call formats. +- Right after successful completion of all jobs, the target dataset contains as many :term:`branch`\es as jobs, with each branch containing the results of one job. + A manual :term:`merge` aggregates all results into the :term:`master` branch of the dataset. -The "creative" bits involved in this parallelized processing workflow boiled down to the following tricks: +The keys to the success of this workflow lie in -- Individual jobs (in this case, subject-specific analyses) are computed in throw-away dataset clones to avoid unwanted interactions between ``save`` commands. -- Moreover, beyond computing in job-specific, temporary locations, individual job results are also saved into uniquely identified :term:`branch`\es to enable simple pushing back of the results into the target dataset [#f6]_. -- The jobs constitute a complete DataLad-centric workflow in the form of a simple bash script, including dataset build-up and tear-down routines in a throw-away location, result computation, and result publication back to the target dataset. Thus, instead of submitting a ``datalad run`` command to the job scheduler, the job submission is a single script, and this submission is easily adapted to various job scheduling call formats. -- Right after successful job termination, the target dataset contains as many :term:`branch`\es as jobs, with each branch containing the results of one job. A manual :term:`merge` aggregates all results into the :term:`master` branch of the dataset. +- creating it completely *job-scheduling* and *platform agnostic*, such that the workflow can be deployed as a subject/...-specific job anywhere, with any job scheduling system, and ... +- instead of computing job results in the same dataset over all jobs, temporary clones are created to hold individual, job-specific results, and those results are pushed back into the target dataset in the end ... +- while all dataset components (input data, containerized pipeline) are reusable and the results completely provenance-tracked. -Walkthrough -^^^^^^^^^^^ +Step-by-Step +"""""""""""" -The goal of the following analysis was standard preprocessing using `fMRIprep `_ on neuroimaging data of 1300 subjects in the `eNKI `_ dataset. -In order to associate input data, containerized pipeline, and outputs, the analysis was carried out in a DataLad dataset and with the :command:`datalad containers-run` command. -Here's a walkthrough of what was done and how. +To get an idea of the general setup of parallel provenance-tracked computations, consider a data analysis dataset... -Starting point: Datasets for software and input data -"""""""""""""""""""""""""""""""""""""""""""""""""""" +.. code-block:: bash -At the beginning of this endeavour, two important analysis components already exist as DataLad datasets: + $ datalad create parallel_analysis + [INFO ] Creating a new annex repo at /tmp/parallel_analysis + [INFO ] Scanning for unlocked files (this may take some time) + create(ok): /tmp/parallel_analysis (dataset) + $ cd parallel_analysis -1. The input data -2. The containerized pipeline +... with input data as a subdataset ... -Following the :ref:`YODA principles `, each of these components is a standalone dataset. -While the input dataset creation is straightforwards, some thinking went into the creation of containerized pipeline dataset to set it up in a way that allows it to be installed as a subdataset and invoked from the superdataset. -If you are interested in this, find the details in the findoutmore below. -Also note that there is a large collection of pre-existing container datasets available at `github.com/ReproNim/containers `_. +.. code-block:: bash -.. findoutmore:: pipeline dataset creation + $ datalad clone -d . /path/to/my/rawdata + [INFO ] Scanning for unlocked files (this may take some time) + install(ok): /tmp/parallel_analysis/rawdata (dataset) + add(ok): /tmp/parallel_analysis/rawdata (file) + add(ok): /tmp/parallel_analysis/.gitmodules (file) + save(ok): /tmp/parallel_analysis (dataset) + action summary: + add (ok: 2) + install (ok: 1) + save (ok: 1) + +... and a dataset with a containerized pipeline (for example from the `ReproNim container-collection `_ [#f2]_) as another subdataset: + +.. findoutmore:: Why do I add the pipeline as a subdataset? + + You could also add and configure the container using ``datalad containers-add`` to the top-most dataset. + This solution makes the container less usable, though. + If you have more than one application for a container, keeping it as a standalone dataset can guarantee easier reuse. + For an example on how to create such a dataset yourself, please checkout the Findoutmore in :ref:`pipelineenki` in the real-life walkthrough in the next section. + +.. code-block:: + + $ datalad clone -d . https://github.com/ReproNim/containers.git + [INFO ] Scanning for unlocked files (this may take some time) + install(ok): /tmp/parallel_analysis/containers (dataset) + add(ok): /tmp/parallel_analysis/containers (file) + add(ok): /tmp/parallel_analysis/.gitmodules (file) + save(ok): /tmp/parallel_analysis (dataset) + action summary: + add (ok: 2) + install (ok: 1) + save (ok: 1) + +The analysis aims to process the ``rawdata`` with a pipeline from ``containers`` and collect the outcomes in the toplevel ``parallel_analysis`` dataset -- FAIRly and in parallel, using ``datalad containers-run``. + +One way to conceptualize the workflow is by taking the perspective of a single compute job. +This job consists of whatever you may want to parallelize over. + +.. findoutmore:: What are common analysis types to parallelize over? + + The key to using a job scheduler and parallelization is to break down an analysis into smaller, loosely coupled computing tasks that can be distributed across a compute cluster. + Among common analysis setups that are suitable for parallelization are computations that can be split into several analysis that each run on one subset of the data -- such one or some out of many subjects, acquisitions, or files. + The large computation "preprocess 200 subjects" can be split into 200 times the job "preprocess 1 subject", for example. + Commonly parallelized computations are also analyses that need to be ran with a range of different parameters, where each parameter configuration can constitute one job. + The latter type of parallelization is for example the case in simulation studies. + +Say your raw data contains continuous moisture measurements in the Arctic, taken over the course of 10 years. +Each file in your dataset contains the data of a single day. +You are interested in a daily aggregate, and are therefore parallelizing across files -- each compute job will run an analysis pipeline on one datafile. + +What you will submit as a job with a job scheduler is not a ``datalad containers-run`` call, but a shell script that contains all relevant data analysis steps. +Using `shell `_ as the language for this script is a straight-forward choice as it allows you to script the DataLad workflow just as you would type it into your terminal, but other languages (e.g., using :ref:`DataLad's Python API ` or system calls in languages such as Matlab) would work as well. - We start with a dataset (called ``pipelines`` in this example):: +**Building the job**: - $ datalad create pipelines - [INFO ] Creating a new annex repo at /data/projects/enki/pipelines - create(ok): /data/projects/enki/pipelines (dataset) - $ cd pipelines +``datalad (containers-)run`` does not support concurrent execution in the *same* dataset clone. +The solution is as easy as it is stubborn: We simply create one throw-away dataset clone for each job. - As one of tools used in the pipeline, `freesurfer `_, requires a license file, this license file needs to be added into the dataset. - Only then can this dataset be moved around flexibly and also to different machines. - In order to have the license file available right away, it is saved ``--to-git`` and not annexed [#f2]_:: +.. findoutmore:: how does one create throw-away clones? - $ cp . - $ datalad save --to-git -m "add freesurfer license file" fs-license.txt + One way to do this are :term:`ephemeral clone`\s, an alternative is to make :term:`git-annex` disregard the datasets annex completely using ``git annex dead here``. + The latter is more appropriate for this context -- we could use an ephemeral clone, but that might deposit data of failed jobs at the origin location, if the job runs on a shared filesystem -- let's stay self-contained. - Finally, we add a container with the pipeline to the dataset using :command:`datalad containers-add` [#f3]_. - The important part is the configuration of the container -- it has to be done in a way that makes the container usable in any superdataset the pipeline dataset. +Using throw-away clones involves a build-up, result-push, and tear-down routine for each job but this works well since datasets are by nature made for such decentralized, collaborative workflows. +We treat cluster compute nodes like contributors to the analyses that clone the analysis dataset hierarchy into a temporary location, run the computation, push the results, and remove their temporary dataset again [#f3]_. +All of this routine is done in a single script, which will be submitted as a job. +Here, we build the general structure of this script. - Depending on how the container/pipeline needs to be called, the configuration differs. - In the case of an fMRIprep run, we want to be able to invoke the container from a superdataset. - The superdataset contains input data and ``pipelines`` dataset as subdatasets, and will collect all of the results. - Thus, these are arguments we want to supply the invocation with (following `fMRIprep's documentation `_) during a ``containers-run`` command:: +The compute job clones the dataset to a unique place, so that it can run a containers-run command inside it without interfering with any other job. +The first part of the script is therefore to navigate to a unique location, and clone the analysis dataset to it. - $ datalad containers-run \ - [...] - \ - --n_cpus \ - --participant-label \ - [...] +.. findoutmore:: How can I get a unique location? - Note how this list does not include bind-mounts of the necessary directories or of the freesurfer license -- this makes the container invocation convenient and easy for any user. - Starting an fMRIprep run requires only a ``datalad containers-run`` with all of the desired fMRIprep options. + On common HTCondor setups, ``/tmp`` directories in individual jobs are job-specific local Filesystem not shared between jobs -- i.e., unique locations! + An alternative is to create a unique temporary directory, e.g., with the ``mktemp -d`` command on Unix systems. - This convenience for the user requires that all of the bind-mounts should be taken care of -- in a generic way -- in the container call specification, though. - Here is how this is done:: +.. code-block:: bash - $ datalad containers-add fmriprep \ - --url TODO \ - --call-fmt singularity run --cleanenv -B "$PWD" {img} {cmd} --fs-license-file "$PWD/{img_dspath}/freesurfer_license.txt" + # go into unique location + $ cd /tmp + # clone the analysis dataset + $ datalad clone /path/to/parallel_analysis ds + $ cd ds - During a :command:`datalad containers-run` command, the ``--call-fmt`` specification will be used to call the container. - The placeholders ``{img}`` and ``{cmd}`` will be replaced with the container (``{img}``) and the command given to ``datalad containers-run`` (``{cmd}``). - Thus, the ``--cleanenv`` flag (`recommended by fMRIprep `_) as well as bind-mounts are handled prior to the container invocation, and the ``--fs-license-file`` option with a path to the license file within the container is appended to the command. - Bind-mounting the working directory (``-B "$PWD"``) makes sure to bind mount the directory from which the container is being called, which should be the superdataset that contains input data and ``pipelines`` subdataset. - With these bind-mounts, input data and the freesurfer license file within ``pipelines`` are available in the container. +This dataset clone is *temporary*: It will exist over the course of one analysis/job only, but before it is being purged, all of the results it computed will be pushed to the original dataset. +This requires a safe-guard: If the original dataset receives the results from the dataset clone, it knows about the clone and its state. +In order to protect the results from accidental synchronization upon deletion of the linked dataset clone, the clone should be created as a "trow-away clone" right from the start. +By running ``git annex dead here``, :term:`git-annex` disregards the clone, preventing the deletion of data in the clone to affect the original dataset. - With such a setup, the ``pipelines`` dataset can be installed in any dataset and will work out of the box. +.. code-block:: bash -Analysis dataset setup -"""""""""""""""""""""" + $ git annex dead here -An analysis dataset consists of the following components: +The ``datalad push`` to the original clone location of a dataset needs to be prepared carefully. +The job computes one result of many and saves it, thus creating new data and a new entry with the run-record in the dataset history. +But each job is unaware of the results and :term:`commit`\s produced by other branches. +Should all jobs push back the results to the original place (the :term:`master` :term:`branch` of the original dataset), the individual jobs would conflict with each other or, worse, overwrite each other (if you don't have the default push configuration of Git). -- input data as a subdataset -- ``pipelines`` container dataset as a subdataset -- subdatasets to hold the results +The general procedure and standard :term:`Git` workflow for collaboration, therefore, is to create a change on a different, unique :term:`branch`, push this different branch, and integrate the changes into the original master branch via a :term:`merge` in the original dataset [#f4]_. -Following the benchmarks and tips in the chapter :ref:`chapter_gobig`, the amount of files produced by fMRIprep on 1300 subjects requires two datasets to hold them. -In this particular computation, following the naming scheme and structure of fMRIpreps output directories, one subdataset is created for the `freesurfer `_ results of fMRIprep in a subdataset called ``freesurfer``, and one for the minimally preprocessed input data in a subdataset called ``fmriprep``. +In order to do this, prior to executing the analysis, the script will *checkout* a unique new branch in the analysis dataset. +The most convenient name for the branch is the Job-ID, an identifier under which the job scheduler runs an individual job. +This makes it easy to associate a result (via its branch) with the log, error, or output files that the job scheduler produces [#f5]_, and the real-life example will demonstrate these advantages more concretely. -Here is an overview of the directory structure in the superdataset:: +.. code-block:: bash - superds - ├── code # directory - │   └── pipelines # subdataset with fMRIprep - ├── fmriprep # subdataset for results - ├── freesurfer # subdataset for results - └── sourcedata # subdataset with BIDS-formatted data - ├── sourcedata # subdataset with raw data - ├── sub-A00008326 # directory - ├── sub-... + # git checkout -b creates a new branch and checks it out + $ git checkout -b "job-$JOBID" +Importantly, the ``$JOB-ID`` isn't hardcoded into the script but it can be given to the script as an environment or input variable at the time of job submission. +The code snippet above uses a bash environment variable (``$JOBID``). +It will be defined in the job submission -- this is shown and explained in detail in the respective paragraph below. -Workflow script -""""""""""""""" +Next, its ``time for the containers-run`` command. +The invocation will depend on the container and dataset configuration (both of which are demonstrated in the real-life example in the next section), and below, we pretend that the pipeline invocation only needs an input file and an output file. +These input file is specified via a bash variables (``$inputfile``) that will be defined in the script and provided at the time of job submission via command line argument from the job scheduler, and the output file name is based on the input file name. -The general complexity of concurrent ``datalad (containers-)run`` commands arises when they are carried out in the same dataset. -Therefore, the strategy is to create throw-away dataset clone for all jobs. +.. code-block:: bash -.. findoutmore:: how does one create throw-away clones? + $ datalad containers-run \ + -m "Computing results for $inputfile" \ + --explicit \ + --output "aggregate_${inputfile}" \ + --input "rawdata/$inputfile" \ + -n code/containers/mycontainer \ + '{inputs}' '{outputs}' - One way to do this are :term:`ephemeral clone`\s, an alternative is to make :term:`git-annex` disregard the datasets annex completely using ``git annex dead here``. +After the ``containers-run`` execution in the script, the results can be pushed back to the dataset :term:`sibling` ``origin`` [#f6]_:: -Using throw-away clones involves a build-up and tear-down routine for each job but works well since datasets are by nature made for collaboration [#f7]_: Clone the analysis dataset hierarchy into a temporary location, run the computation, push the results, remove temporary dataset [#f4]_. + $ datalad push --to origin -All of this is done in a single script, which will be submitted as a job. -To give you a first idea, a sketch of this is in the :term:`bash` (shell) script below. -Using `shell `_ as the language for this script is a straight-forward choice as it allows you to script the DataLad workflow just as you would type it into your terminal, but other languages (e.g., using :ref:`DataLad's Python API ` or system calls in languages such as Matlab) would work as well. -Fine-tuning and the complete script are shown in the findoutmore afterwards:: +Pending a few yet missing safe guards against concurrency issues and the definition job-specific (environment) variables, such a script can be submitted to any job scheduler with identifiers for input files, output files, and a job ID as identifiers for the branch names. +This workflow sketch takes care of everything that needs to be done apart from combining all computed results afterwards. - # everything is running under /tmp inside a compute job, - # /tmp is job-specific local filesystem not shared between jobs - $ cd /tmp +.. findoutmore:: Fine-tuning: Safe-guard concurrency issues - # clone the superdataset - $ datalad clone /data/project/enki/super ds - $ cd ds + An important fine-tuning is missing: + Cloning and pushing *can* still run into concurrency issues in the case when one job clones the original dataset while another job is currently pushing its results into this dataset. + Therefore, a trick can make sure that no two clone or push commands are executed at *exactly* the same time. + This trick uses `file locking `_, in particular the tool `flock `_, to prevent exactly concurrent processes. + This is done by prepending ``clone`` and ``push`` commands with ``flock --verbose $DSLOCKFILE``, where ``$DSLOCKFILE`` is a text file placed into ``.git/`` at the time of job submission, provided via environment variable (see below and the paragraph "Job submission"). + This is a non-trivial process, but luckily, you don't need to understand file locking or ``flock`` in order to follow along -- just make sure that you copy the usage of ``$DSLOCKFILE`` in the script and in the job submission. - # get first-level subdatasets (-R1 = --recursion-limit 1) - $ datalad get -n -r -R1 . +.. findoutmore:: Variable definition - # make git-annex disregard the clones - they are meant to be thrown away - $ git submodule foreach --recursive git annex dead here + There are two ways to define variables that a script can use: + The first is by defining environment variables, and passing this environment to the compute job. + This can be done in the job submission file. + To set and pass down the job-ID and a lock file in HTCondor, one can supply the following line in the job submission file:: - # checkout unique branches (names derived from job IDs) in both subdatasets - # to enable pushing the results without interference from other jobs - $ git -C fmriprep checkout -b "job-$JOBID" - $ git -C freesurfer checkout -b "job-$JOBID" + environment = "JOBID=$(Cluster).$(Process) DSLOCKFILE=$ENV(PWD)/.git/datalad_lock" - # call fmriprep with datalad containers-run. Use all relevant fMRIprep - # arguments for your usecase - $ datalad containers-run \ - -m "fMRIprep $subid" \ - --explicit \ - -o freesurfer -o fmriprep \ - -i "$1" \ - -n code/pipelines/fmriprep \ - sourcedata . participant \ - --n_cpus 1 \ - --skip-bids-validation \ - -w .git/tmp/wdir \ - --participant-label "$subid" \ - --random-seed 12345 \ - --skull-strip-fixed-seed \ - --md-only-boilerplate \ - --output-spaces MNI152NLin6Asym \ - --use-aroma \ - --cifti-output - - # push back the results - $ datalad push -d fmriprep --to origin - $ datalad push -d freesurfer --to origin - # job handler should clean up workspace - -Pending a few yet missing safe guards against concurrency issues and to enable re-running computations, such a script can be submitted to any job scheduler with a subject ID and a job ID as identifiers for the fMRIprep run and branch names. -The concrete calling/submission of this script is shown in the paragraph :ref:`jobsubmit`, but on a procedural level, this workflow sketch takes care of everything that needs to be done apart from combining all computed results afterwards. - -.. findoutmore:: Fine-tuning: Enable re-running and safe-guard concurrency issues - - Two important fine-tunings are missing: - For one, cloning and pushing *can* still run into concurrency issues in the case when one job clones the original dataset while another job is currently pushing into this dataset. - Therefore, a trick can make sure that no two clone or push commands are executed at the same time. - This trick uses `file locking `_, in particular the tool `flock `_, to prevent exactly concurrent processes. - This is done by prepending ``clone`` and ``push`` commands with ``flock --verbose $DSLOCKFILE``, where ``$DSLOCKFILE`` is a textfile placed into ``.git/`` at the time of job submission (further details in the submit file in the next section) + The second way is via shell script command line arguments. + Everything that is given as a command line argument to the script can be accessed in the script in the order of their appearance via ``$``. + A script invoked with ``bash myscript.sh `` can access ``inputfile`` with ``$1``, ``parameter`` with ``$2``, and ```` with ``$3``. + If the job scheduler takes care of iterating through input file names, the relevant input variable for the simplistic example could thus be defined in the script as follows:: + + inputfile=$1 - The second issue concerns the ability to rerun a computation quickly: - If fMRIprep finds preexisting results, it will fail to run. - Therefore, all outputs of a job are attempted to be removed before the jobs is started [#f5]_:: +With fine tuning and variable definitions in place, the only things missing are a :term:`shebang` at the top of the script, and some shell settings for robust scripting with verbose log files (``set -e -u -x``). +Here's how the full general script looks like. - (cd fmriprep && rm -rf logs "$subid" "$subid.html" dataset_description.json desc-*.tsv) - (cd freesurfer && rm -rf fsaverage "$subid") +.. code-block:: bash - With this in place, the only things missing are a :term:`shebang` at the top of the script, and some shell settings for robust scripting with verbose log files (``set -e -u -x``). - You can find the full script with rich comments in the next findoutmore. + #!/bin/bash + + # fail whenever something is fishy, use -x to get verbose logfiles + set -e -u -x + + # we pass arbitrary arguments via job scheduler and can use them as variables + fileid=$1 + ... + + # go into unique location + cd /tmp + # clone the analysis dataset. flock makes sure that this does not interfere + # with another job finishing and pushing results back at the same time + flock --verbose $DSLOCKFILE datalad clone /path/to/parallel_analysis ds + cd ds + # announce the clone to be temporary + git annex dead here + # checkout a unique branch + git checkout -b "job-$JOBID" + # run the job + datalad containers-run \ + -m "Computing data $inputfile" \ + --explicit \ + --output "aggregate_${inputfile}" \ + --input "rawdata/$inputfile" \ + -n code/containers/mycontainer \ + '{inputs}' '{outputs}' + # push, with filelocking as a safe-guard + flock --verbose $DSLOCKFILE datalad push --to origin -.. findoutmore:: See the complete bash script + # Done - job handler should clean up workspace - This script is placed in ``code/fmriprep_participant_job``: +Its a short script that encapsulates a complete workflow. +Think of it as the sequence of necessary DataLad commands you would need to do in order to compute a job. +You can save this script into your analysis dataset, e.g., as ``code/analysis_job.sh`` - .. code-block:: bash +**Job submission**: - #!/bin/bash - - # fail whenever something is fishy, use -x to get verbose logfiles - set -e -u -x - - # we pass in "sourcedata/sub-...", extract subject id from it - subid=$(basename $1) - - # this is all running under /tmp inside a compute job, /tmp is a performant - # local filesystem - cd /tmp - # get the output dataset, which includes the inputs as well - # flock makes sure that this does not interfere with another job - # finishing at the same time, and pushing its results back - # importantly, we clone from the location that we want to push the - # results too - flock --verbose $DSLOCKFILE \ - datalad clone /data/project/enki/super ds - - # all following actions are performed in the context of the superdataset - cd ds - # obtain all first-level subdatasets: - # dataset with fmriprep singularity container and pre-configured - # pipeline call; also get the output dataset to prep them for output - # consumption, we need to tune them for this particular job, sourcedata - # important: because we will push additions to the result datasets back - # at the end of the job, the installation of these result datasets - # must happen from the location we want to push back too - datalad get -n -r -R1 . - # let git-annex know that we do not want to remember any of these clones - # (we could have used an --ephemeral clone, but that might deposite data - # of failed jobs at the origin location, if the job runs on a shared - # filesystem -- let's stay self-contained) - git submodule foreach --recursive git annex dead here - - # checkout new branches in both subdatasets - # this enables us to store the results of this job, and push them back - # without interference from other jobs - git -C fmriprep checkout -b "job-$JOBID" - git -C freesurfer checkout -b "job-$JOBID" - # create workdir for fmriprep inside to simplify singularity call - # PWD will be available in the container - mkdir -p .git/tmp/wdir - # pybids (inside fmriprep) gets angry when it sees dangling symlinks - # of .json files -- wipe them out, spare only those that belong to - # the participant we want to process in this job - find sourcedata -mindepth 2 -name '*.json' -a ! -wholename "$1"/'*' -delete - - # next one is important to get job-reruns correct. We remove all anticipated - # output, such that fmriprep isn't confused by the presence of stale - # symlinks. Otherwise we would need to obtain and unlock file content. - # But that takes some time, for no reason other than being discarded - # at the end - (cd fmriprep && rm -rf logs "$subid" "$subid.html" dataset_description.json desc-*.tsv) - (cd freesurfer && rm -rf fsaverage "$subid") - - # the meat of the matter, add actual parameterization after --participant-label - datalad containers-run \ - -m "fMRIprep $subid" \ - --explicit \ - -o freesurfer -o fmriprep \ - -i "$1" \ - -n code/pipelines/fmriprep \ - sourcedata . participant \ - --n_cpus 1 \ - --skip-bids-validation \ - -w .git/tmp/wdir \ - --participant-label "$subid" \ - --random-seed 12345 \ - --skull-strip-fixed-seed \ - --md-only-boilerplate \ - --output-spaces MNI152NLin6Asym \ - --use-aroma \ - --cifti-output - # selectively push outputs only - # ignore root dataset, despite recorded changes, needs coordinated - # merge at receiving end - flock --verbose $DSLOCKFILE datalad push -d fmriprep --to origin - flock --verbose $DSLOCKFILE datalad push -d freesurfer --to origin - - # job handler should clean up workspace - -Pending modifications to paths provided in clone locations, the above script and dataset setup is generic enough to be run on different systems and with different job schedulers. - -.. _jobsubmit: - -Job submission -"""""""""""""" - -Job submission now only boils down to invoking the script for each participant with a participant identifier that determines on which subject the job runs, and setting two environment variables -- one the job ID that determines the branch name that is created, and one that points to a lockfile created beforehand once in ``.git``. -Job scheduler such as HTCondor have syntax that can identify subject IDs from consistently named directories, for example, and the submit file can thus be lean even though it queues up more than 1000 jobs. - -You can find the submit file used in this analyses in the findoutmore below. +Job submission now only boils down to invoking the script for each participant with the relevant command line arguments (e.g., input and output files for the our artificial example) and the necessary environment variables (e.g., the job ID that determines the branch name that is created, and one that points to a lockfile created beforehand once in ``.git``). +Job scheduler such as HTCondor can typically do this with automatic variables. +They for example have syntax that can identify subject IDs or consecutive file numbers from consistently named directory structure, access the job ID, loop through a predefined list of values or parameters, or use various forms of pattern matching. +Examples of this are demonstrated `here `_. +Thus, the submit file takes care of defining hundreds or thousands of variables, but can still be lean even though it queues up hundreds or thousands of jobs. +Here is a submit file that could be employed: .. findoutmore:: HTCondor submit file - The following submit file was created and saved in ``code/fmriprep_all_participants.submit``: - .. code-block:: bash universe = vanilla @@ -364,7 +317,7 @@ You can find the submit file used in this analyses in the findoutmore below. request_memory = 20G request_disk = 210G - executable = $ENV(PWD)/code/fmriprep_participant_job + executable = $ENV(PWD)/code/analysis_job.sh # the job expects to environment variables for labeling and synchronization environment = "JOBID=$(Cluster).$(Process) DSLOCKFILE=$ENV(PWD)/.git/datalad_lock" @@ -372,29 +325,25 @@ You can find the submit file used in this analyses in the findoutmore below. output = $ENV(PWD)/../logs/$(Cluster).$(Process).out error = $ENV(PWD)/../logs/$(Cluster).$(Process).err arguments = $(subid) - # find all participants, based on the subdirectory names in the source dataset - # each relative path to such a subdirectory with become the value of `subid` - # and another job is queued. Will queue a total number of jobs matching the - # number of matching subdirectories - queue subid matching dirs sourcedata/sub-* + # find all input data, based on the file names in the source dataset. + # Each relative path to such a file name will become the value of `inputfile` + # This will queue as many jobs as file names match the pattern + queue inputfile matching files rawdata/acquisition_*_.txt All it takes to submit is a single ``condor_submit ``. -Merging results -""""""""""""""" -Once all jobs have finished, the results lie in individual branches of the output datasets. -In this concrete example, the subdatasets ``fmriprep`` and ``freesurfer`` will each have more than 1000 branches that hold individual job results. +**Merging results**: +Once all jobs are finished, the results lie in individual branches of the original dataset. The only thing left to do now is merging all of these branches into :term:`master` -- and potentially solve any merge conflicts that arise. Usually, merging branches is done using the ``git merge`` command with a branch specification. For example, in order to merge one job branch into the :term:`master` :term:`branch`, one would need to be on ``master`` and run ``git merge ``. -Given that the subdatasets each contain >1000 branches, and that each ``merge`` would lead to a commit, in order to not inflate the history of the dataset with hundreds of merge commits, two `Octopus merges `_ were done - one in each subdataset (``fmriprep`` and ``freesurfer``). +Given that the parallel job execution could have created thousands of branches, and that each ``merge`` would lead to a commit, in order to not inflate the history of the dataset with hundreds of :term:`merge` commits, one can do a single `Octopus merges `_ of all branches at once. .. findoutmore:: What is an octopus merge? Usually a commit that arises from a merge has two *parent* commits: The *first parent* is the branch the merge is being performed from, in the example above, ``master``. The *second parent* is the branch that was merged into the first. - However, ``git merge`` is capable of merging more than two branches simultaneously if more than a single branch name is given to the command. The resulting merge commit has as many parent as were involved in the merge. If a commit has more than two parents, if is affectionately called an "Octopus" merge. @@ -402,73 +351,30 @@ Given that the subdatasets each contain >1000 branches, and that each ``merge`` Octopus merges require merge-conflict-free situations, and will not be carried out whenever manual resolution of conflicts is needed. The merge command can be assembled quickly. -As all result branches were named ``job-``, a complete list of branches is obtained with the following command:: +If all result branches were named ``job-``, a complete list of branches is obtained with the following command:: $ git branch -l | grep 'job-' | tr -d ' ' -This command line call translates to: "list all branches, of all branches, show me those that contain ``job-``, and remove (``tr -d``) all whitespace. -This can be given to ``git merge`` as in +This command line call translates to: "list all branches. Of those branches, show me those that contain ``job-``, and remove (``tr -d``) all whitespace." +This call can be given to ``git merge`` as in .. code-block:: bash $ git merge -m "Merge results from job cluster XY" $(git branch -l | grep 'job-' | tr -d ' ') -**Merging with merge conflicts** - -When attempting an octopus merge like the one above and a merge conflict arises, the merge is aborted automatically. This is what it looks like:: - - $ git merge -m "Merge results from job cluster 107890" $(git branch -l | grep 'job-' | tr -d ' ') - Fast-forwarding to: job-107890.0 - Trying simple merge with job-107890.1 - Simple merge did not work, trying automatic merge. - ERROR: logs/CITATION.md: Not merging symbolic link changes. - fatal: merge program failed - Automated merge did not work. - Should not be doing an octopus. - Merge with strategy octopus failed. - -This merge conflict arose in the ``fmriprep`` subdataset an originated from the fact that each job generated a ``CITATION.md`` file with minimal individual changes. - -.. findoutmore:: How to fix this? - - As the file ``CITATION.md`` does not contain meaningful changes between jobs, one of the files was kept (e.g., copied into a temporary location, or brought back to life afterwards with ``git cat-file``), and all ``CITATION.md`` files of all branches were deleted prior to the merge. - Here is a bash loop that would do exactly that:: - - $ for b in $(git branch -l | grep 'job-' | tr -d ' '); - do ( git checkout -b m$b $b && git rm logs/CITATION.md && git commit --amend --no-edit ) ; - done - - Afterwards, the merge command succeeds - -**Merging without merge conflicts** - -If no merge conflicts arise and the octopus merge is successful, all results are aggregated in the ``master`` branch. -The commit log looks like a work of modern art when visualized with tools such as :term:`tig`: - -.. figure:: ../artwork/src/octopusmerge_tig.png - - -Summary -""""""" - -Once all jobs are computed in parallel and the resulting branches merged, the superdataset is populated with two subdatasets that hold the preprocessing results. -Each result contains a machine-readable record of provenance on when, how, and by whom it was computed. -From this point, the results in the subdatasets can be used for further analysis, while a record of how they were preprocessed is attached to them. - - +Voilà -- the results of all provenance-tracked job executions merged into the original dataset. +If you are interested in seeing this workflow applied in a real analysis, read on into the next section, :ref:`hcpenki`. .. rubric:: Footnotes .. [#f1] To re-read about :command:`datalad run`'s ``--explicit`` option, take a look into the section :ref:`run5`. -.. [#f2] If the distinction between annexed and unannexed files is new to you, please read section :ref:`symlink` - -.. [#f3] Note that this requires the ``datalad containers`` extension. Find an overview of all datalad extensions in :ref:`extensions_intro`. +.. [#f2] The `ReproNim container-collection `_ is a DataLad dataset that contains a range of preconfigured containers for neuroimaging. -.. [#f4] Clean-up routines can, in the case of common job schedulers, be taken care of by performing everything in compute node specific ``/tmp`` directories that are wiped clean after job termination. +.. [#f3] Clean-up routines can, in the case of common job schedulers, be taken care of by performing everything in compute node specific ``/tmp`` directories that are wiped clean after job termination. -.. [#f5] The brackets around the commands are called *command grouping* in bash, and yield a subshell environment: `www.gnu.org/software/bash/manual/html_node/Command-Grouping.html `_. +.. [#f4] For an analogy, consider a group of software developers: Instead of adding code changes to the main :term:`branch` of a repository, they develop in their own repository clones and on dedicated, individual feature branches. This allows them to integrate their changes back into the original repository with as little conflict as possible. To find out why a different branch is required to enable easy pushing back to the original dataset, please checkout the explanation on :ref:`pushing to non-bare repositories ` in the section on :ref:`help`. -.. [#f6] To find out why a different branch is required to enable easy pushing back to the original dataset, please checkout the explanation on :ref:`pushing to non-bare repositories ` in the section on :ref:`help`. +.. [#f5] Job schedulers can commonly produce log, error, and output files and it is advisable to save them for each job. Usually, job schedulers make it convenient to save them with a job-ID as an identifier. An example of this for HTCondor is shown in the Findoutmore in :ref:`jobsubmit`. -.. [#f7] For an analogy, consider a group of software developers: Instead of adding code changes to the main :term:`branch` of a repository, they develop in their own repository clones and on dedicated, individual feature branches. This allows them to integrate their changes back into the original repository with as little conflict as possible. \ No newline at end of file +.. [#f6] When a dataset is cloned from any location, this original location is by default known as the :term:`sibling`/:term:`remote` ``origin`` to the clone. From cbfce1a6c46e5d5e09226a60ac677a7cb626644e Mon Sep 17 00:00:00 2001 From: Adina Wagner Date: Mon, 19 Oct 2020 16:22:44 +0200 Subject: [PATCH 17/22] Run: Make ENKI preprocessing an additional walkthrough --- docs/beyond_basics/101-171-enki.rst | 403 ++++++++++++++++++++++++++++ docs/beyond_basics/basics-hpc.rst | 1 + 2 files changed, 404 insertions(+) create mode 100644 docs/beyond_basics/101-171-enki.rst diff --git a/docs/beyond_basics/101-171-enki.rst b/docs/beyond_basics/101-171-enki.rst new file mode 100644 index 000000000..471158019 --- /dev/null +++ b/docs/beyond_basics/101-171-enki.rst @@ -0,0 +1,403 @@ +.. _hcpenki: + +Walkthrough: Parallel ENKI preprocessing with fMRIprep +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The previous section has been an overview on parallel, provenance-tracked computations in DataLad datasets. +While the general workflow entails a complete setup, its usually easier to understand it by seeing it applied to a concrete usecase. +Its even more informative if that usecase includes some complexities that do not exist in the "picture-perfect" example but are likely to arise in real life. +Therefore, the following walk-through in this section is a write-up of an existing and successfully executed analysis. + +Its goal was standard data preprocessing using `fMRIprep `_ on neuroimaging data of 1300 subjects in the `eNKI `_ dataset. +In order to associate input data, containerized pipeline, and outputs, the analysis was carried out in a DataLad dataset and with the :command:`datalad containers-run` command. +The pipeline dataset was created with a custom configuration to make it generalizable, and, due to the additional complexity of a large quantity of results, the output was collected in subdatasets. + +.. _pipelineenki: + +Starting point: Datasets for software and input data +"""""""""""""""""""""""""""""""""""""""""""""""""""" + +At the beginning of this endeavour, two important analysis components already exist as DataLad datasets: + +1. The input data +2. The containerized pipeline + +Following the :ref:`YODA principles `, each of these components is a standalone dataset. +While the input dataset creation is straightforwards, some thinking went into the creation of containerized pipeline dataset to set it up in a way that allows it to be installed as a subdataset and invoked from the superdataset. +If you are interested in this, find the details in the findoutmore below. +Also note that there is a large collection of pre-existing container datasets available at `github.com/ReproNim/containers `_. + +.. findoutmore:: pipeline dataset creation + + We start with a dataset (called ``pipelines`` in this example):: + + $ datalad create pipelines + [INFO ] Creating a new annex repo at /data/projects/enki/pipelines + create(ok): /data/projects/enki/pipelines (dataset) + $ cd pipelines + + As one of tools used in the pipeline, `freesurfer `_, requires a license file, this license file needs to be added into the dataset. + Only then can this dataset be moved around flexibly and also to different machines. + In order to have the license file available right away, it is saved ``--to-git`` and not annexed [#f1]_:: + + $ cp . + $ datalad save --to-git -m "add freesurfer license file" fs-license.txt + + Finally, we add a container with the pipeline to the dataset using :command:`datalad containers-add` [#f2]_. + The important part is the configuration of the container -- it has to be done in a way that makes the container usable in any superdataset the pipeline dataset. + + Depending on how the container/pipeline needs to be called, the configuration differs. + In the case of an fMRIprep run, we want to be able to invoke the container from a superdataset. + The superdataset contains input data and ``pipelines`` dataset as subdatasets, and will collect all of the results. + Thus, these are arguments we want to supply the invocation with (following `fMRIprep's documentation `_) during a ``containers-run`` command:: + + $ datalad containers-run \ + [...] + \ + --n_cpus \ + --participant-label \ + [...] + + Note how this list does not include bind-mounts of the necessary directories or of the freesurfer license -- this makes the container invocation convenient and easy for any user. + Starting an fMRIprep run requires only a ``datalad containers-run`` with all of the desired fMRIprep options. + + This convenience for the user requires that all of the bind-mounts should be taken care of -- in a generic way -- in the container call specification, though. + Here is how this is done:: + + $ datalad containers-add fmriprep \ + --url /data/project/singularity/fmriprep-20.2.0.simg \ + --call-fmt singularity run --cleanenv -B "$PWD" {img} {cmd} --fs-license-file "$PWD/{img_dspath}/freesurfer_license.txt" + + During a :command:`datalad containers-run` command, the ``--call-fmt`` specification will be used to call the container. + The placeholders ``{img}`` and ``{cmd}`` will be replaced with the container (``{img}``) and the command given to ``datalad containers-run`` (``{cmd}``). + Thus, the ``--cleanenv`` flag (`recommended by fMRIprep `_) as well as bind-mounts are handled prior to the container invocation, and the ``--fs-license-file`` option with a path to the license file within the container is appended to the command. + Bind-mounting the working directory (``-B "$PWD"``) makes sure to bind mount the directory from which the container is being called, which should be the superdataset that contains input data and ``pipelines`` subdataset. + With these bind-mounts, input data and the freesurfer license file within ``pipelines`` are available in the container. + + With such a setup, the ``pipelines`` dataset can be installed in any dataset and will work out of the box. + +Analysis dataset setup +"""""""""""""""""""""" + +The size of the input dataset and the nature of preprocessing results with fMRIprep constitute an additional complexity: +Based on the amount of input data and test runs of fMRIprep on single subjects, we estimated that the preprocessing results from fMRIprep would encompass several TB in size and about half a million files. +This amount of files is too large to be stored in a single dataset, though, and results will therefore need to be split into two result datasets. +These will be included as direct subdatasets of the toplevel analysis dataset. +This is inconvenient -- it separates results (in the result subdatasets) from their provenance (the run-records in the top-level dataset) -- but inevitable given the dataset size. +A final analysis dataset will consist of the following components: + +- input data as a subdataset +- ``pipelines`` container dataset as a subdataset +- subdatasets to hold the results + +Following the benchmarks and tips in the chapter :ref:`chapter_gobig`, the amount of files produced by fMRIprep on 1300 subjects requires two datasets to hold them. +In this particular computation, following the naming scheme and structure of fMRIpreps output directories, one subdataset is created for the `freesurfer `_ results of fMRIprep in a subdataset called ``freesurfer``, and one for the minimally preprocessed input data in a subdataset called ``fmriprep``. + +Here is an overview of the directory structure in the superdataset:: + + superds + ├── code # directory + │   └── pipelines # subdataset with fMRIprep + ├── fmriprep # subdataset for results + ├── freesurfer # subdataset for results + └── sourcedata # subdataset with BIDS-formatted data + ├── sourcedata # subdataset with raw data + ├── sub-A00008326 # directory + ├── sub-... + +When running fMRIprep on a smaller set of subjects, or a containerized pipeline that produces fewer files, saving results into subdatasets isn't necessary. + +Workflow script +""""""""""""""" + +Based on the general principles introduced in the previous section, there is a sketch of the workflow in the :term:`bash` (shell) script below. +It still lacks ``fMRIprep`` specific fine-tuning -- the complete script is shown in the findoutmore afterwards. +This initial sketch serves to highlight key differences and adjustments due to the complexity and size of the analysis, explained below and highlighted in the script as well: + +* **Getting subdatasets**: The empty result subdatasets wouldn't be installed in the clone automatically -- ``datalad get -n -r -R1 .`` installs all first-level subdatasets so that they are available to be populated with results. +* **recursive throw-away clones**: In the simpler general workflow, we ran ``git annex dead here`` in the topmost dataset. + This dataset contains the results within subdatasets. + In order to make them "throw-away" as well, the ``git annex dead here`` configuration needs to be applied recursively for all datasets with ``git submodule foreach --recursive git annex dead here``. +* **Checkout unique branches in the subdataset**: Since the results will be pushed from the subdatasets, it is in there that unique branches need to be checked out. + We're using ``git -C `` to apply a command in dataset under ``path``. +* **Complex container call**: The ``containers-run`` command is more complex because it supplies all desired ``fMRIprep`` arguments. +* **Push the subdatasets only**: We only need to push the results, i.e., there is one push per each subdataset. + +.. code-block:: bash + :emphasize-lines: 10, 13, 19-20, 24, 43-44 + + # everything is running under /tmp inside a compute job, + # /tmp is job-specific local filesystem not shared between jobs + $ cd /tmp + + # clone the superdataset with locking + $ flock --verbose $DSLOCKFILE datalad clone /data/project/enki/super ds + $ cd ds + + # get first-level subdatasets (-R1 = --recursion-limit 1) + $ datalad get -n -r -R1 . + + # make git-annex disregard the clones - they are meant to be thrown away + $ git submodule foreach --recursive git annex dead here + + # checkout unique branches (names derived from job IDs) in both subdatasets + # to enable pushing the results without interference from other jobs + # In a setup with no subdatasets, "-C " would be stripped, + # and a new branch would be checked out in the superdataset instead. + $ git -C fmriprep checkout -b "job-$JOBID" + $ git -C freesurfer checkout -b "job-$JOBID" + + # call fmriprep with datalad containers-run. Use all relevant fMRIprep + # arguments for your usecase + $ datalad containers-run \ + -m "fMRIprep $subid" \ + --explicit \ + -o freesurfer -o fmriprep \ + -i "$1" \ + -n code/pipelines/fmriprep \ + sourcedata . participant \ + --n_cpus 1 \ + --skip-bids-validation \ + -w .git/tmp/wdir \ + --participant-label "$subid" \ + --random-seed 12345 \ + --skull-strip-fixed-seed \ + --md-only-boilerplate \ + --output-spaces MNI152NLin6Asym \ + --use-aroma \ + --cifti-output + + # push back the results + $ flock --verbose $DSLOCKFILE datalad push -d fmriprep --to origin + $ flock --verbose $DSLOCKFILE datalad push -d freesurfer --to origin + # job handler should clean up workspace + +Just like the general script from the last section, this script can be submitted to any job scheduler -- here with a subject ID as a ``$subid`` command line variable and a job ID as environment variable as identifiers for the fMRIprep run and branch names. +At this point, the workflow misses a tweak that is necessary in fMRIprep to enable re-running computations. + +.. findoutmore:: Fine-tuning: Enable re-running + + If you want to make sure that your dataset is set up in a way that you have the ability to rerun a computation quickly, the following fMRIprep-specific consideration is important: + If fMRIprep finds preexisting results, it will fail to run. + Therefore, all outputs of a job need to be removed before the jobs is started [#f3]_. + We can simply add an attempt to do this in the script:: + + (cd fmriprep && rm -rf logs "$subid" "$subid.html" dataset_description.json desc-*.tsv) + (cd freesurfer && rm -rf fsaverage "$subid") + + With this in place, the only things missing are a :term:`shebang` at the top of the script, and some shell settings for robust scripting with verbose log files (``set -e -u -x``). + You can find the full script with rich comments in the next findoutmore. + +.. findoutmore:: See the complete bash script + + This script is placed in ``code/fmriprep_participant_job``: + + .. code-block:: bash + + #!/bin/bash + + # fail whenever something is fishy, use -x to get verbose logfiles + set -e -u -x + + # we pass in "sourcedata/sub-...", extract subject id from it + subid=$(basename $1) + + # this is all running under /tmp inside a compute job, /tmp is a performant + # local filesystem + cd /tmp + # get the output dataset, which includes the inputs as well + # flock makes sure that this does not interfere with another job + # finishing at the same time, and pushing its results back + # importantly, we clone from the location that we want to push the + # results too + flock --verbose $DSLOCKFILE \ + datalad clone /data/project/enki/super ds + + # all following actions are performed in the context of the superdataset + cd ds + # obtain all first-level subdatasets: + # dataset with fmriprep singularity container and pre-configured + # pipeline call; also get the output dataset to prep them for output + # consumption, we need to tune them for this particular job, sourcedata + # important: because we will push additions to the result datasets back + # at the end of the job, the installation of these result datasets + # must happen from the location we want to push back too + datalad get -n -r -R1 . + # let git-annex know that we do not want to remember any of these clones + # (we could have used an --ephemeral clone, but that might deposite data + # of failed jobs at the origin location, if the job runs on a shared + # filesystem -- let's stay self-contained) + git submodule foreach --recursive git annex dead here + + # checkout new branches in both subdatasets + # this enables us to store the results of this job, and push them back + # without interference from other jobs + git -C fmriprep checkout -b "job-$JOBID" + git -C freesurfer checkout -b "job-$JOBID" + # create workdir for fmriprep inside to simplify singularity call + # PWD will be available in the container + mkdir -p .git/tmp/wdir + # pybids (inside fmriprep) gets angry when it sees dangling symlinks + # of .json files -- wipe them out, spare only those that belong to + # the participant we want to process in this job + find sourcedata -mindepth 2 -name '*.json' -a ! -wholename "$1"/'*' -delete + + # next one is important to get job-reruns correct. We remove all anticipated + # output, such that fmriprep isn't confused by the presence of stale + # symlinks. Otherwise we would need to obtain and unlock file content. + # But that takes some time, for no reason other than being discarded + # at the end + (cd fmriprep && rm -rf logs "$subid" "$subid.html" dataset_description.json desc-*.tsv) + (cd freesurfer && rm -rf fsaverage "$subid") + + # the meat of the matter, add actual parameterization after --participant-label + datalad containers-run \ + -m "fMRIprep $subid" \ + --explicit \ + -o freesurfer -o fmriprep \ + -i "$1" \ + -n code/pipelines/fmriprep \ + sourcedata . participant \ + --n_cpus 1 \ + --skip-bids-validation \ + -w .git/tmp/wdir \ + --participant-label "$subid" \ + --random-seed 12345 \ + --skull-strip-fixed-seed \ + --md-only-boilerplate \ + --output-spaces MNI152NLin6Asym \ + --use-aroma \ + --cifti-output + # selectively push outputs only + # ignore root dataset, despite recorded changes, needs coordinated + # merge at receiving end + flock --verbose $DSLOCKFILE datalad push -d fmriprep --to origin + flock --verbose $DSLOCKFILE datalad push -d freesurfer --to origin + + # job handler should clean up workspace + +Pending modifications to paths provided in clone locations, the above script and dataset setup is generic enough to be run on different systems and with different job schedulers. + +.. _jobsubmit: + +Job submission +"""""""""""""" + +Job submission now only boils down to invoking the script for each participant with a participant identifier that determines on which subject the job runs, and setting two environment variables -- one the job ID that determines the branch name that is created, and one that points to a lockfile created beforehand once in ``.git``. +Job scheduler such as HTCondor have syntax that can identify subject IDs from consistently named directories, for example, and the submit file can thus be lean even though it queues up more than 1000 jobs. + +You can find the submit file used in this analyses in the findoutmore below. + +.. findoutmore:: HTCondor submit file + + The following submit file was created and saved in ``code/fmriprep_all_participants.submit``: + + .. code-block:: bash + + universe = vanilla + get_env = True + # resource requirements for each job, determined by + # investigating the demands of a single test job + request_cpus = 1 + request_memory = 20G + request_disk = 210G + + executable = $ENV(PWD)/code/fmriprep_participant_job + + # the job expects to environment variables for labeling and synchronization + environment = "JOBID=$(Cluster).$(Process) DSLOCKFILE=$ENV(PWD)/.git/datalad_lock" + log = $ENV(PWD)/../logs/$(Cluster).$(Process).log + output = $ENV(PWD)/../logs/$(Cluster).$(Process).out + error = $ENV(PWD)/../logs/$(Cluster).$(Process).err + arguments = $(subid) + # find all participants, based on the subdirectory names in the source dataset + # each relative path to such a subdirectory with become the value of `subid` + # and another job is queued. Will queue a total number of jobs matching the + # number of matching subdirectories + queue subid matching dirs sourcedata/sub-* + +All it takes to submit is a single ``condor_submit ``. + +Merging results +""""""""""""""" + +Once all jobs have finished, the results lie in individual branches of the output datasets. +In this concrete example, the subdatasets ``fmriprep`` and ``freesurfer`` will each have more than 1000 branches that hold individual job results. +The only thing left to do now is merging all of these branches into :term:`master` -- and potentially solve any merge conflicts that arise. +Usually, merging branches is done using the ``git merge`` command with a branch specification. +For example, in order to merge one job branch into the :term:`master` :term:`branch`, one would need to be on ``master`` and run ``git merge ``. +Given that the subdatasets each contain >1000 branches, and that each ``merge`` would lead to a commit, in order to not inflate the history of the dataset with hundreds of merge commits, two `Octopus merges `_ were done - one in each subdataset (``fmriprep`` and ``freesurfer``). + +.. findoutmore:: What is an octopus merge? + + Usually a commit that arises from a merge has two *parent* commits: The *first parent* is the branch the merge is being performed from, in the example above, ``master``. The *second parent* is the branch that was merged into the first. + + + However, ``git merge`` is capable of merging more than two branches simultaneously if more than a single branch name is given to the command. + The resulting merge commit has as many parent as were involved in the merge. + If a commit has more than two parents, if is affectionately called an "Octopus" merge. + + Octopus merges require merge-conflict-free situations, and will not be carried out whenever manual resolution of conflicts is needed. + +The merge command can be assembled quickly. +As all result branches were named ``job-``, a complete list of branches is obtained with the following command:: + + $ git branch -l | grep 'job-' | tr -d ' ' + +This command line call translates to: "list all branches, of all branches, show me those that contain ``job-``, and remove (``tr -d``) all whitespace. +This can be given to ``git merge`` as in + +.. code-block:: bash + + $ git merge -m "Merge results from job cluster XY" $(git branch -l | grep 'job-' | tr -d ' ') + +**Merging with merge conflicts** + +When attempting an octopus merge like the one above and a merge conflict arises, the merge is aborted automatically. This is what it looks like:: + + $ git merge -m "Merge results from job cluster 107890" $(git branch -l | grep 'job-' | tr -d ' ') + Fast-forwarding to: job-107890.0 + Trying simple merge with job-107890.1 + Simple merge did not work, trying automatic merge. + ERROR: logs/CITATION.md: Not merging symbolic link changes. + fatal: merge program failed + Automated merge did not work. + Should not be doing an octopus. + Merge with strategy octopus failed. + +This merge conflict arose in the ``fmriprep`` subdataset an originated from the fact that each job generated a ``CITATION.md`` file with minimal individual changes. + +.. findoutmore:: How to fix this? + + As the file ``CITATION.md`` does not contain meaningful changes between jobs, one of the files was kept (e.g., copied into a temporary location, or brought back to life afterwards with ``git cat-file``), and all ``CITATION.md`` files of all branches were deleted prior to the merge. + Here is a bash loop that would do exactly that:: + + $ for b in $(git branch -l | grep 'job-' | tr -d ' '); + do ( git checkout -b m$b $b && git rm logs/CITATION.md && git commit --amend --no-edit ) ; + done + + Afterwards, the merge command succeeds + +**Merging without merge conflicts** + +If no merge conflicts arise and the octopus merge is successful, all results are aggregated in the ``master`` branch. +The commit log looks like a work of modern art when visualized with tools such as :term:`tig`: + +.. figure:: ../artwork/src/octopusmerge_tig.png + + +Summary +""""""" + +Once all jobs are computed in parallel and the resulting branches merged, the superdataset is populated with two subdatasets that hold the preprocessing results. +Each result contains a machine-readable record of provenance on when, how, and by whom it was computed. +From this point, the results in the subdatasets can be used for further analysis, while a record of how they were preprocessed is attached to them. + + +.. rubric:: Footnotes + +.. [#f1] If the distinction between annexed and unannexed files is new to you, please read section :ref:`symlink` + +.. [#f2] Note that this requires the ``datalad containers`` extension. Find an overview of all datalad extensions in :ref:`extensions_intro`. + +.. [#f3] The brackets around the commands are called *command grouping* in bash, and yield a subshell environment: `www.gnu.org/software/bash/manual/html_node/Command-Grouping.html `_. \ No newline at end of file diff --git a/docs/beyond_basics/basics-hpc.rst b/docs/beyond_basics/basics-hpc.rst index 377c32e7f..8b7208aa0 100644 --- a/docs/beyond_basics/basics-hpc.rst +++ b/docs/beyond_basics/basics-hpc.rst @@ -11,3 +11,4 @@ Computing on clusters 101-169-cluster 101-170-dataladrun + 101-171-enki From 3f212fab7ce7dfdb38d740792804895b5a12b623 Mon Sep 17 00:00:00 2001 From: Adina Wagner Date: Tue, 20 Oct 2020 16:26:28 +0200 Subject: [PATCH 18/22] Tweaks and typos --- docs/beyond_basics/101-170-dataladrun.rst | 105 +++++++++++++--------- docs/beyond_basics/101-171-enki.rst | 48 ++++------ 2 files changed, 84 insertions(+), 69 deletions(-) diff --git a/docs/beyond_basics/101-170-dataladrun.rst b/docs/beyond_basics/101-170-dataladrun.rst index d806ddea7..8d8994fe9 100644 --- a/docs/beyond_basics/101-170-dataladrun.rst +++ b/docs/beyond_basics/101-170-dataladrun.rst @@ -3,22 +3,29 @@ DataLad-centric analysis with job scheduling and parallel computing ------------------------------------------------------------------- +There are data analyses that consist of running a handful of scripts on a handful of files. +Those analyses can be done in a couple of minutes or hours on your private computer. +But there are also analyses that are so large -- either in terms of computations, or with regard to the amount of data that they are run on -- that it would takes days or even weeks to complete them. +The latter type of analyses typically requires a compute cluster, a job scheduler, and parallelization. +The question is: How can they become as reproducible and provenance tracked as the simplistic, singular analysis that were showcased in the handbook so far, and that comfortably fitted on a private computer? + .. note:: It is advised to read the previous chapter :ref:`chapter_gobig` prior to this one This section is a write-up of how DataLad can be used on a scientific computational cluster with a job scheduler for reproducible and FAIR data analyses at scale. It showcases the general principles behind parallel processing of DataLad-centric workflows with containerized pipelines. -This section lays the groundwork to the next section, a walkthrough through a more complex real life example of containerized `fMRIprep `_ preprocessing on the `eNKI `_ neuroimaging dataset, scheduled with `HTCondor `_. While this chapter demonstrates specific containerized pipelines and job schedulers, the general setup is generic and could be used with any containerized pipeline and any job scheduling system. +This section lays the groundwork to the next section, a walk-through through a real life example of containerized `fMRIprep `_ preprocessing on the `eNKI `_ neuroimaging dataset, scheduled with `HTCondor `_. + Why job scheduling? ^^^^^^^^^^^^^^^^^^^ On scientific compute clusters, job scheduling systems such as `HTCondor `_ or `slurm `_ are used to distribute computational jobs across the available computing infrastructure and manage the overall workload of the cluster. This allows for efficient and fair use of available resources across a group of users, and it brings the potential for highly parallelized computations of jobs and thus vastly faster analyses. -Consider one common way to use a job scheduler: processing all subjects of a dataset independently and as parallel as the current workload of the compute cluster allows instead of serially (i.e., "one after the other"). +Consider one common way to use a job scheduler: processing all subjects of a dataset independently and as parallel as the current workload of the compute cluster allows -- instead of serially "one after the other". In such a setup, each subject-specific analysis becomes a single job, and the job scheduler fits as many jobs as it can on available :term:`compute node`\s. If a large analysis can be split into many independent jobs, using a job scheduler to run them in parallel thus yields great performance advantages in addition to fair compute resource distribution across all users. @@ -32,15 +39,16 @@ If a large analysis can be split into many independent jobs, using a job schedul The job scheduler takes the submitted jobs, *queues* them up in a central queue, and monitors the available compute resources (i.e., :term:`compute node`\s) of the cluster. As soon as a computational resource is free, it matches a job from the queue to the available resource and computes the job on this node. Usually, a single submission queues up multiple (dozens, hundreds, or thousands of) jobs. - If you are interested in a tutorial for HTCondor, checkout the `INM-7 HTcondor Tutorial `_. + If you are interested in a tutorial for HTCondor, checkout the `INM-7 HTCondor Tutorial `_. Where are the difficulties in parallel computing with DataLad? ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ In order to capture as much provenance as possible, analyses are best ran with a :command:`datalad run` or :command:`datalad containers-run` command, as these commands can capture and link all relevant components of an analysis, starting from code and results to input data and computational environment. -Note, though, that when parallelizing jobs and computing them with provenance capture, *each individual job* needs to be wrapped in a ``run`` command, not only the submission of the jobs to the job scheduler -- and this requires multiple parallel ``run`` commands on the same dataset. -Multiple simultaneous ``datalad (containers-)run`` invocations in the same dataset are, however, problematic: +But in order to compute parallel jobs with provenance capture, *each individual job* needs to be wrapped in a ``run`` command, not only the submission of the jobs to the job scheduler. +This requires multiple parallel ``run`` commands on the same dataset. +But: Multiple simultaneous ``datalad (containers-)run`` invocations in the same dataset are problematic. - Operations carried out during one :command:`run` command can lead to modifications that prevent a second, slightly later ``run`` command from being started - The :command:`datalad save` command at the end of :command:`datalad run` could save modifications that originate from a different job, leading to mis-associated provenance @@ -67,7 +75,7 @@ The "creative" bits involved in this parallelized processing workflow boil down - Individual jobs (for example subject-specific analyses) are computed in **throw-away dataset clones** to avoid unwanted interactions between parallel jobs. - Beyond computing in job-specific, temporary locations, individual job results are also saved into uniquely identified :term:`branch`\es to enable simple **pushing back of the results** into the target dataset. -- The jobs constitute a complete DataLad-centric workflow in the form of a simple bash script, including dataset build-up and tear-down routines in a throw-away location, result computation, and result publication back to the target dataset. +- The jobs constitute a complete DataLad-centric workflow in the form of a simple **bash script**, including dataset build-up and tear-down routines in a throw-away location, result computation, and result publication back to the target dataset. Thus, instead of submitting a ``datalad run`` command to the job scheduler, **the job submission is a single script**, and this submission is easily adapted to various job scheduling call formats. - Right after successful completion of all jobs, the target dataset contains as many :term:`branch`\es as jobs, with each branch containing the results of one job. A manual :term:`merge` aggregates all results into the :term:`master` branch of the dataset. @@ -81,7 +89,7 @@ The keys to the success of this workflow lie in Step-by-Step """""""""""" -To get an idea of the general setup of parallel provenance-tracked computations, consider a data analysis dataset... +To get an idea of the general setup of parallel provenance-tracked computations, consider a :ref:`YODA-compliant ` data analysis dataset... .. code-block:: bash @@ -108,13 +116,6 @@ To get an idea of the general setup of parallel provenance-tracked computations, ... and a dataset with a containerized pipeline (for example from the `ReproNim container-collection `_ [#f2]_) as another subdataset: -.. findoutmore:: Why do I add the pipeline as a subdataset? - - You could also add and configure the container using ``datalad containers-add`` to the top-most dataset. - This solution makes the container less usable, though. - If you have more than one application for a container, keeping it as a standalone dataset can guarantee easier reuse. - For an example on how to create such a dataset yourself, please checkout the Findoutmore in :ref:`pipelineenki` in the real-life walkthrough in the next section. - .. code-block:: $ datalad clone -d . https://github.com/ReproNim/containers.git @@ -128,25 +129,32 @@ To get an idea of the general setup of parallel provenance-tracked computations, install (ok: 1) save (ok: 1) +.. findoutmore:: Why do I add the pipeline as a subdataset? + + You could also add and configure the container using ``datalad containers-add`` to the top-most dataset. + This solution makes the container less usable, though. + If you have more than one application for a container, keeping it as a standalone dataset can guarantee easier reuse. + For an example on how to create such a dataset yourself, please checkout the Findoutmore in :ref:`pipelineenki` in the real-life walk-through in the next section. + + The analysis aims to process the ``rawdata`` with a pipeline from ``containers`` and collect the outcomes in the toplevel ``parallel_analysis`` dataset -- FAIRly and in parallel, using ``datalad containers-run``. One way to conceptualize the workflow is by taking the perspective of a single compute job. This job consists of whatever you may want to parallelize over. +For an arbitrary example, say your raw data contains continuous moisture measurements in the Arctic, taken over the course of 10 years. +Each file in your dataset contains the data of a single day. +You are interested in a daily aggregate, and are therefore parallelizing across files -- each compute job will run an analysis pipeline on one datafile. .. findoutmore:: What are common analysis types to parallelize over? The key to using a job scheduler and parallelization is to break down an analysis into smaller, loosely coupled computing tasks that can be distributed across a compute cluster. - Among common analysis setups that are suitable for parallelization are computations that can be split into several analysis that each run on one subset of the data -- such one or some out of many subjects, acquisitions, or files. + Among common analysis setups that are suitable for parallelization are computations that can be split into several analysis that each run on one subset of the data -- such as one (or some) out of many subjects, acquisitions, or files. The large computation "preprocess 200 subjects" can be split into 200 times the job "preprocess 1 subject", for example. - Commonly parallelized computations are also analyses that need to be ran with a range of different parameters, where each parameter configuration can constitute one job. - The latter type of parallelization is for example the case in simulation studies. - -Say your raw data contains continuous moisture measurements in the Arctic, taken over the course of 10 years. -Each file in your dataset contains the data of a single day. -You are interested in a daily aggregate, and are therefore parallelizing across files -- each compute job will run an analysis pipeline on one datafile. + In simulation studies, a commonly parallelized task concerns analyses that need to be ran with a range of different parameters, where each parameter configuration can constitute one job. What you will submit as a job with a job scheduler is not a ``datalad containers-run`` call, but a shell script that contains all relevant data analysis steps. -Using `shell `_ as the language for this script is a straight-forward choice as it allows you to script the DataLad workflow just as you would type it into your terminal, but other languages (e.g., using :ref:`DataLad's Python API ` or system calls in languages such as Matlab) would work as well. +Using `shell `_ as the language for this script is a straight-forward choice as it allows you to script the DataLad workflow just as you would type it into your terminal. +Other languages (e.g., using :ref:`DataLad's Python API ` or system calls in languages such as Matlab) would work as well, though. **Building the job**: @@ -156,19 +164,20 @@ The solution is as easy as it is stubborn: We simply create one throw-away datas .. findoutmore:: how does one create throw-away clones? One way to do this are :term:`ephemeral clone`\s, an alternative is to make :term:`git-annex` disregard the datasets annex completely using ``git annex dead here``. - The latter is more appropriate for this context -- we could use an ephemeral clone, but that might deposit data of failed jobs at the origin location, if the job runs on a shared filesystem -- let's stay self-contained. + The latter is more appropriate for this context -- we could use an ephemeral clone, but that might deposit data of failed jobs at the origin location, if the job runs on a shared filesystem. -Using throw-away clones involves a build-up, result-push, and tear-down routine for each job but this works well since datasets are by nature made for such decentralized, collaborative workflows. -We treat cluster compute nodes like contributors to the analyses that clone the analysis dataset hierarchy into a temporary location, run the computation, push the results, and remove their temporary dataset again [#f3]_. -All of this routine is done in a single script, which will be submitted as a job. -Here, we build the general structure of this script. +Using throw-away clones involves a build-up, result-push, and tear-down routine for each job. +It sounds complex and tedious, but this actually works well since datasets are by nature made for such decentralized, collaborative workflows. +We treat cluster compute nodes like contributors to the analyses: They clone the analysis dataset hierarchy into a temporary location, run the computation, push the results, and remove their temporary dataset again [#f3]_. +The complete routine is done in a single script, which will be submitted as a job. +Here, we build the general structure of this script, piece by piece. -The compute job clones the dataset to a unique place, so that it can run a containers-run command inside it without interfering with any other job. +The compute job clones the dataset to a unique place, so that it can run a ``containers-run`` command inside it without interfering with any other job. The first part of the script is therefore to navigate to a unique location, and clone the analysis dataset to it. .. findoutmore:: How can I get a unique location? - On common HTCondor setups, ``/tmp`` directories in individual jobs are job-specific local Filesystem not shared between jobs -- i.e., unique locations! + On common HTCondor setups, ``/tmp`` directories in individual jobs are a job-specific local Filesystem that are not shared between jobs -- i.e., unique locations! An alternative is to create a unique temporary directory, e.g., with the ``mktemp -d`` command on Unix systems. .. code-block:: bash @@ -181,7 +190,7 @@ The first part of the script is therefore to navigate to a unique location, and This dataset clone is *temporary*: It will exist over the course of one analysis/job only, but before it is being purged, all of the results it computed will be pushed to the original dataset. This requires a safe-guard: If the original dataset receives the results from the dataset clone, it knows about the clone and its state. -In order to protect the results from accidental synchronization upon deletion of the linked dataset clone, the clone should be created as a "trow-away clone" right from the start. +In order to protect the results from someone accidentally synchronizing (updating) the dataset from its linked dataset after is has been deleted, the clone should be created as a "trow-away clone" right from the start. By running ``git annex dead here``, :term:`git-annex` disregards the clone, preventing the deletion of data in the clone to affect the original dataset. .. code-block:: bash @@ -189,7 +198,7 @@ By running ``git annex dead here``, :term:`git-annex` disregards the clone, prev $ git annex dead here The ``datalad push`` to the original clone location of a dataset needs to be prepared carefully. -The job computes one result of many and saves it, thus creating new data and a new entry with the run-record in the dataset history. +The job computes *one* result (out of of many results) and saves it, thus creating new data and a new entry with the run-record in the dataset history. But each job is unaware of the results and :term:`commit`\s produced by other branches. Should all jobs push back the results to the original place (the :term:`master` :term:`branch` of the original dataset), the individual jobs would conflict with each other or, worse, overwrite each other (if you don't have the default push configuration of Git). @@ -205,11 +214,11 @@ This makes it easy to associate a result (via its branch) with the log, error, o $ git checkout -b "job-$JOBID" Importantly, the ``$JOB-ID`` isn't hardcoded into the script but it can be given to the script as an environment or input variable at the time of job submission. -The code snippet above uses a bash environment variable (``$JOBID``). +The code snippet above uses a bash :term:`environment variable` (``$JOBID``, as indicated by the all-upper-case variable name with a leading ``$``). It will be defined in the job submission -- this is shown and explained in detail in the respective paragraph below. -Next, its ``time for the containers-run`` command. -The invocation will depend on the container and dataset configuration (both of which are demonstrated in the real-life example in the next section), and below, we pretend that the pipeline invocation only needs an input file and an output file. +Next, its time for the :command:`containers-run` command. +The invocation will depend on the container and dataset configuration (both of which are demonstrated in the real-life example in the next section), and below, we pretend that the container invocation only needs an input file and an output file. These input file is specified via a bash variables (``$inputfile``) that will be defined in the script and provided at the time of job submission via command line argument from the job scheduler, and the output file name is based on the input file name. .. code-block:: bash @@ -227,7 +236,7 @@ After the ``containers-run`` execution in the script, the results can be pushed $ datalad push --to origin -Pending a few yet missing safe guards against concurrency issues and the definition job-specific (environment) variables, such a script can be submitted to any job scheduler with identifiers for input files, output files, and a job ID as identifiers for the branch names. +Pending a few yet missing safe guards against concurrency issues and the definition of job-specific (environment) variables, such a script can be submitted to any job scheduler with identifiers for input files, output files, and a job ID as identifiers for the branch names. This workflow sketch takes care of everything that needs to be done apart from combining all computed results afterwards. .. findoutmore:: Fine-tuning: Safe-guard concurrency issues @@ -242,7 +251,7 @@ This workflow sketch takes care of everything that needs to be done apart from c .. findoutmore:: Variable definition There are two ways to define variables that a script can use: - The first is by defining environment variables, and passing this environment to the compute job. + The first is by defining :term:`environment variable`\s, and passing this environment to the compute job. This can be done in the job submission file. To set and pass down the job-ID and a lock file in HTCondor, one can supply the following line in the job submission file:: @@ -294,11 +303,11 @@ Here's how the full general script looks like. Its a short script that encapsulates a complete workflow. Think of it as the sequence of necessary DataLad commands you would need to do in order to compute a job. -You can save this script into your analysis dataset, e.g., as ``code/analysis_job.sh`` +You can save this script into your analysis dataset, e.g., as ``code/analysis_job.sh``, and make it executable (such that it is executed automatically by the program specified in the :term:`shebang`)using ``chmod +x code/analysis_job.sh``. **Job submission**: -Job submission now only boils down to invoking the script for each participant with the relevant command line arguments (e.g., input and output files for the our artificial example) and the necessary environment variables (e.g., the job ID that determines the branch name that is created, and one that points to a lockfile created beforehand once in ``.git``). +Job submission now only boils down to invoking the script for each participant with the relevant command line arguments (e.g., input files for our artificial example) and the necessary environment variables (e.g., the job ID that determines the branch name that is created, and one that points to a lockfile created beforehand once in ``.git``). Job scheduler such as HTCondor can typically do this with automatic variables. They for example have syntax that can identify subject IDs or consecutive file numbers from consistently named directory structure, access the job ID, loop through a predefined list of values or parameters, or use various forms of pattern matching. Examples of this are demonstrated `here `_. @@ -324,12 +333,28 @@ Here is a submit file that could be employed: log = $ENV(PWD)/../logs/$(Cluster).$(Process).log output = $ENV(PWD)/../logs/$(Cluster).$(Process).out error = $ENV(PWD)/../logs/$(Cluster).$(Process).err - arguments = $(subid) + arguments = $(inputfile) # find all input data, based on the file names in the source dataset. - # Each relative path to such a file name will become the value of `inputfile` + # The pattern matching below finds all *files* that match the path + # "rawdata/acquisition_*.txt". + # Each relative path to such a file name will become the value of `inputfile`, + # the argument given to the executable (the shell script). # This will queue as many jobs as file names match the pattern queue inputfile matching files rawdata/acquisition_*_.txt + How would the first few jobs look like that this submit file queues up? + It would send out the commands + + .. code-block:: bash + + ./code/analysis_job.sh rawdata/acquisition_day1year1_.txt + ./code/analysis_job.sh rawdata/acquisition_day2year1_.txt + [...] + + and each of them are send to a compute node with at least 1 CPU, 20GB of RAM and 210GB of disk space. + The log, output, and error files are saved under a HTCondor-specific Process and Cluster ID in a log file directory (which would need to be created for HTCondor!). + Two environment variables, ``JOBID`` (defined from HTCondor-specific Process and Cluster IDs) and ``DSLOCKFILE`` (for file locking), will be defined on the compute node. + All it takes to submit is a single ``condor_submit ``. diff --git a/docs/beyond_basics/101-171-enki.rst b/docs/beyond_basics/101-171-enki.rst index 471158019..b4e7cf70f 100644 --- a/docs/beyond_basics/101-171-enki.rst +++ b/docs/beyond_basics/101-171-enki.rst @@ -1,16 +1,25 @@ .. _hcpenki: Walkthrough: Parallel ENKI preprocessing with fMRIprep -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +------------------------------------------------------ The previous section has been an overview on parallel, provenance-tracked computations in DataLad datasets. While the general workflow entails a complete setup, its usually easier to understand it by seeing it applied to a concrete usecase. Its even more informative if that usecase includes some complexities that do not exist in the "picture-perfect" example but are likely to arise in real life. Therefore, the following walk-through in this section is a write-up of an existing and successfully executed analysis. -Its goal was standard data preprocessing using `fMRIprep `_ on neuroimaging data of 1300 subjects in the `eNKI `_ dataset. -In order to associate input data, containerized pipeline, and outputs, the analysis was carried out in a DataLad dataset and with the :command:`datalad containers-run` command. -The pipeline dataset was created with a custom configuration to make it generalizable, and, due to the additional complexity of a large quantity of results, the output was collected in subdatasets. +The analysis +^^^^^^^^^^^^ + +The analysis goal was standard data preprocessing using `fMRIprep `_ on neuroimaging data of 1300 subjects in the `eNKI `_ dataset. +This computational task is ideal for parallelization: Each subject can be preprocessed individually, each preprocessing takes between 6 and 8 hours per subject, resulting in 1300x7h of serial computing, but only about 7 hours of computing time when executed completely in parallel, and +fMRIprep is a containerized pipeline that can be pointed to a specific subject to preprocess. + +ENKI was transformed into a DataLad dataset beforehand, and to set up the analysis, the fMRIprep container was placed -- with a custom configuration to make it generalizable -- into a new dataset called ``pipeline``. +Both of these datasets, input data and ``pipeline`` dataset, became subdataset of a data analysis superdataset. +In order to associate input data, containerized pipeline, and outputs, the analysis was carried out in a toplevel analysis DataLad dataset and with the :command:`datalad containers-run` command. +Finally, as an additional complexity, due to the additional complexity of a large quantity of results, the output was collected in subdatasets. + .. _pipelineenki: @@ -36,7 +45,7 @@ Also note that there is a large collection of pre-existing container datasets av create(ok): /data/projects/enki/pipelines (dataset) $ cd pipelines - As one of tools used in the pipeline, `freesurfer `_, requires a license file, this license file needs to be added into the dataset. + As one of tools used in fMRIprep's the pipeline, `freesurfer `_, requires a license file, this license file needs to be added into the dataset. Only then can this dataset be moved around flexibly and also to different machines. In order to have the license file available right away, it is saved ``--to-git`` and not annexed [#f1]_:: @@ -47,7 +56,7 @@ Also note that there is a large collection of pre-existing container datasets av The important part is the configuration of the container -- it has to be done in a way that makes the container usable in any superdataset the pipeline dataset. Depending on how the container/pipeline needs to be called, the configuration differs. - In the case of an fMRIprep run, we want to be able to invoke the container from a superdataset. + In the case of an fMRIprep run, we want to be able to invoke the container from a data analysis superdataset. The superdataset contains input data and ``pipelines`` dataset as subdatasets, and will collect all of the results. Thus, these are arguments we want to supply the invocation with (following `fMRIprep's documentation `_) during a ``containers-run`` command:: @@ -179,8 +188,8 @@ At this point, the workflow misses a tweak that is necessary in fMRIprep to enab If you want to make sure that your dataset is set up in a way that you have the ability to rerun a computation quickly, the following fMRIprep-specific consideration is important: If fMRIprep finds preexisting results, it will fail to run. - Therefore, all outputs of a job need to be removed before the jobs is started [#f3]_. - We can simply add an attempt to do this in the script:: + Therefore, all outputs of a job need to be removed before the job is started [#f3]_. + We can simply add an attempt to do this in the script (it wouldn't do any harm if there is nothing to be removed):: (cd fmriprep && rm -rf logs "$subid" "$subid.html" dataset_description.json desc-*.tsv) (cd freesurfer && rm -rf fsaverage "$subid") @@ -324,28 +333,9 @@ Merging results Once all jobs have finished, the results lie in individual branches of the output datasets. In this concrete example, the subdatasets ``fmriprep`` and ``freesurfer`` will each have more than 1000 branches that hold individual job results. The only thing left to do now is merging all of these branches into :term:`master` -- and potentially solve any merge conflicts that arise. -Usually, merging branches is done using the ``git merge`` command with a branch specification. -For example, in order to merge one job branch into the :term:`master` :term:`branch`, one would need to be on ``master`` and run ``git merge ``. -Given that the subdatasets each contain >1000 branches, and that each ``merge`` would lead to a commit, in order to not inflate the history of the dataset with hundreds of merge commits, two `Octopus merges `_ were done - one in each subdataset (``fmriprep`` and ``freesurfer``). - -.. findoutmore:: What is an octopus merge? - - Usually a commit that arises from a merge has two *parent* commits: The *first parent* is the branch the merge is being performed from, in the example above, ``master``. The *second parent* is the branch that was merged into the first. - - - However, ``git merge`` is capable of merging more than two branches simultaneously if more than a single branch name is given to the command. - The resulting merge commit has as many parent as were involved in the merge. - If a commit has more than two parents, if is affectionately called an "Octopus" merge. - - Octopus merges require merge-conflict-free situations, and will not be carried out whenever manual resolution of conflicts is needed. - -The merge command can be assembled quickly. -As all result branches were named ``job-``, a complete list of branches is obtained with the following command:: - - $ git branch -l | grep 'job-' | tr -d ' ' +As explained in the previous section, the necessary merging was done with `Octopus merges `_ -- one in each subdataset (``fmriprep`` and ``freesurfer``). -This command line call translates to: "list all branches, of all branches, show me those that contain ``job-``, and remove (``tr -d``) all whitespace. -This can be given to ``git merge`` as in +The merge command was assembled with the trick introduced in the previous section, based on job-ID-named branches: .. code-block:: bash From 4474e83c9fc15d8f3942b518ec38e65e74dd95cc Mon Sep 17 00:00:00 2001 From: Adina Wagner Date: Thu, 22 Oct 2020 17:42:24 +0200 Subject: [PATCH 19/22] checklist for fmriprep on datalad datasets --- docs/beyond_basics/101-172-checklist.rst | 241 +++++++++++++++++++++++ docs/beyond_basics/basics-hpc.rst | 1 + 2 files changed, 242 insertions(+) create mode 100644 docs/beyond_basics/101-172-checklist.rst diff --git a/docs/beyond_basics/101-172-checklist.rst b/docs/beyond_basics/101-172-checklist.rst new file mode 100644 index 000000000..7494167da --- /dev/null +++ b/docs/beyond_basics/101-172-checklist.rst @@ -0,0 +1,241 @@ +.. _inm7checklist: + +Checklist for the impatient: Preprocess a DataLad dataset with fMRIprep +----------------------------------------------------------------------- + +Let's say you have a BIDS-structured DataLad dataset with input data that you want to preprocess with `fMRIprep `_ using HTCondor, but you can't be bothered to read and understand more than a few pages of user documentation. +Here is a step-by-step bullet point instruction that may get you to where you want to be, but doesn't enforce any learning upon you. +It will only work if the data you want to preprocess is already a DataLad dataset and in a BIDS-compliant structure. + +.. admonition:: Placeholders + + Throughout the checklist, the following placeholders need to be replaced with whatever applies to your project: + + - ``projectfolder``: This is your 1TB project folder under ``/data/project/`` on juseless + - ``processed``: This is an arbitrary name that you call the folder to hold preprocessing results + - ``BIDS``: This is your BIDS-compliant input data in a DataLad dataset + +1. Create an analysis dataset +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Go to your project folder on juseless:: + + $ cd /data/project/ + +Create a new dataset, using the YODA-procedure. +This dataset should not be a subdataset of anything at this point. + +.. code-block:: bash + + $ datalad create -c yoda + +It will contain the outputs of fMRIprep (fMRIprep will write its output into two folders it creates, ``fmriprep`` and ``freesurfer``), and scripts related to HTCondor job creation and submission. +Input data and fMRIprep container will be subdatasets. + +Finally, create a new directory ``logs`` outside of the analysis dataset -- this is where HTCondor's log files will be stored. + +.. code-block:: bash + + $ mkdir ../logs + +2. Install your BIDS compliant input dataset as a subdataset +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Go into your newly created dataset:: + + $ cd + +Install your BIDS-compliant input dataset as a subdataset. +We call the subdataset ``sourcedata``. +If you decide to go for a different name you will need to exchange the word "sourcedata" in all other scripts with whatever else you decided to call the dataset. + +.. code-block:: bash + + $ datalad clone -d . path/to/ sourcedata + +3. Install an fMRIprep container dataset as a subdataset +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +There is a preconfigured container dataset with fMRIprep available on juseless. +You should install it as a subdataset. + +.. code-block:: bash + + $ datalad clone -d . TODO code/pipelines + +You can find out how to create such a container dataset and its configuration in paragraph :ref:`pipelineenki` of the previous section. + +4. Build a workflow script +^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Due to concurrency issues, parallel execution can't happen in the same dataset. +Therefore, you need to create a workflow script that handles individual job execution in a temporary location on the compute node and push its results back to your dataset. +This workflow is defined in a workflow script, and this workflow script is defined below. + +Depending on how many files you (roughly) expect to produce, pick either "For small datasets (200k files or less)" or "For large datasets (200k files or more)" below. +It is important to estimate the amount of result files (not including files in fMRIpreps working directory) and pick the correct section -- too many files can make datasets slow or dysfunctional, and the workflow file needs to be adjusted to overcome this. + +A conservative estimate for the amount of files a fMRIprep invocation produces is between 500 and 700 files. +As this amount is dependent on the data structure (the types of acquisitions and amount of files), you could run fMRIprep on a single subject of your dataset, check the amount of produced files, and extrapolate beforehand. +If you don't want to do this, here are a few benchmarks: + +- freesurfer generally produces ~350 files +- eNKI processing (previous section) results in about 500 files per subject +- preprocessing of ``HCP_structural_preprocessed`` data results in about 400 files per subject +- UKBiobank preprocessing leads to about 450 files per subject + +For small datasets (200k files or less) +""""""""""""""""""""""""""""""""""""""" + +If you expect fewer than 200k output files, take the workflow script below, replace the placeholders with the required information, and save it as ``fmriprep_participant_job`` into the ``code/`` directory. + +.. code-block:: bash + + #!/bin/bash + set -e -u -x + + subid=$(basename $1) + + cd /tmp + flock --verbose $DSLOCKFILE datalad clone /data/project// ds + + cd ds + datalad get -n -r -R1 . + git annex dead here + + git checkout -b "job-$JOBID" + + mkdir -p .git/tmp/wdir + find sourcedata -mindepth 2 -name '*.json' -a ! -wholename "$1"/'*' -delete + + # add your required fMRIprep parametrization + datalad containers-run \ + -m "fMRIprep $subid" \ + --explicit \ + -o freesurfer -o fmriprep \ + -i "$1" \ + -n code/pipelines/fmriprep \ + sourcedata . participant \ + --n_cpus 1 \ + --skip-bids-validation \ + -w .git/tmp/wdir \ + --participant-label "$subid" \ + --random-seed 12345 \ + --skull-strip-fixed-seed \ + --md-only-boilerplate \ + --output-spaces MNI152NLin6Asym \ + --use-aroma \ + --cifti-output + # selectively push outputs only + # ignore root dataset, despite recorded changes, needs coordinated + # merge at receiving end + flock --verbose $DSLOCKFILE datalad push --to origin + +Save the addition of this workflow file:: + + $ datalad save -m "added fmriprep preprocessing workflow" code/fmriprep_participant_job + +For large datasets +"""""""""""""""""" + +If you expect more than 200k result files, first create two subdatasets:: + + $ datalad create -d . fmriprep + $ datalad create -d . freesurfer + +If you run ``datalad subdatasets`` afterwards in the root of your dataset you should see four subdatasets listed. +Then, take the workflow script below, replace the placeholders with the required information, and save it as ``fmriprep_participant_job`` into the ``code/`` directory. + +.. code-block:: bash + + #!/bin/bash + set -e -u -x + + subid=$(basename $1) + + cd /tmp + flock --verbose $DSLOCKFILE datalad clone /data/project// ds + + cd ds + datalad get -n -r -R1 . + git submodule foreach --recursive git annex dead here + + git -C fmriprep checkout -b "job-$JOBID" + git -C freesurfer checkout -b "job-$JOBID" + + mkdir -p .git/tmp/wdir + find sourcedata -mindepth 2 -name '*.json' -a ! -wholename "$1"/'*' -delete + + (cd fmriprep && rm -rf logs "$subid" "$subid.html" dataset_description.json desc-*.tsv) + (cd freesurfer && rm -rf fsaverage "$subid") + + # add your required fMRIprep parametrization + datalad containers-run \ + -m "fMRIprep $subid" \ + --explicit \ + -o freesurfer -o fmriprep \ + -i "$1" \ + -n code/pipelines/fmriprep \ + sourcedata . participant \ + --n_cpus 1 \ + --skip-bids-validation \ + -w .git/tmp/wdir \ + --participant-label "$subid" \ + --random-seed 12345 \ + --skull-strip-fixed-seed \ + --md-only-boilerplate \ + --output-spaces MNI152NLin6Asym \ + --use-aroma \ + --cifti-output + + flock --verbose $DSLOCKFILE datalad push -d fmriprep --to origin + flock --verbose $DSLOCKFILE datalad push -d freesurfer --to origin + +Save the addition of this workflow file:: + + $ datalad save -m "added fmriprep preprocessing workflow" code/fmriprep_participant_job + +5. Build a HTCondor submit file +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +To be able to submit the jobs, create a file called ``code/fmriprep_all_participants.submit`` with the following contents: + +.. code-block:: bash + + + universe = vanilla + get_env = True + # resource requirements for each job, determined by + # investigating the demands of a single test job + request_cpus = 1 + request_memory = 20G + request_disk = 210G + + executable = $ENV(PWD)/code/fmriprep_participant_job + + # the job expects to environment variables for labeling and synchronization + environment = "JOBID=$(Cluster).$(Process) DSLOCKFILE=$ENV(PWD)/.git/datalad_lock" + log = $ENV(PWD)/../logs/$(Cluster).$(Process).log + output = $ENV(PWD)/../logs/$(Cluster).$(Process).out + error = $ENV(PWD)/../logs/$(Cluster).$(Process).err + arguments = $(subid) + # find all participants, based on the subdirectory names in the source dataset + # each relative path to such a subdirectory with become the value of `subid` + # and another job is queued. Will queue a total number of jobs matching the + # number of matching subdirectories + queue subid matching dirs sourcedata/sub-* + +Save the addition of this submit file:: + + $ datalad save -m "added fmriprep preprocessing workflow" code/fmriprep_all_participants.submit + +6. Submit the job +^^^^^^^^^^^^^^^^^ + +In the root of your dataset, run + +.. code-block:: bash + + condor_submit code/fmriprep_all_participants.submit + +7. \ No newline at end of file diff --git a/docs/beyond_basics/basics-hpc.rst b/docs/beyond_basics/basics-hpc.rst index 8b7208aa0..b9e6f06ef 100644 --- a/docs/beyond_basics/basics-hpc.rst +++ b/docs/beyond_basics/basics-hpc.rst @@ -12,3 +12,4 @@ Computing on clusters 101-169-cluster 101-170-dataladrun 101-171-enki + 101-172-checklist From 116533ee0a8bf72fcf37020751672aea685b682a Mon Sep 17 00:00:00 2001 From: Adina Wagner Date: Mon, 2 Nov 2020 11:08:25 +0100 Subject: [PATCH 20/22] checklist: Condense fmriprep preprocessing to checklist --- docs/beyond_basics/101-172-checklist.rst | 279 ++++++++++++++--------- docs/beyond_basics/basics-hpc.rst | 1 + 2 files changed, 177 insertions(+), 103 deletions(-) diff --git a/docs/beyond_basics/101-172-checklist.rst b/docs/beyond_basics/101-172-checklist.rst index 7494167da..468864437 100644 --- a/docs/beyond_basics/101-172-checklist.rst +++ b/docs/beyond_basics/101-172-checklist.rst @@ -1,11 +1,39 @@ -.. _inm7checklist: +.. _inm7checklistfmriprep_: -Checklist for the impatient: Preprocess a DataLad dataset with fMRIprep ------------------------------------------------------------------------ +Checklists for the impatient: Preprocess a DataLad dataset with fMRIprep +------------------------------------------------------------------------ + +Let's say you have a BIDS-structured DataLad dataset with input data that you want to preprocess with `fMRIprep `_ using HTCondor. +Here is a step-by-step bullet point instruction that contains all required steps -- in the case of a fully-standard-no-special-cases analysis setup, with all necessary preparations, e.g., input dataset creation and BIDS validation, being done already. +It may get you to where you want to be, but is doomed to fail when your analysis does not align completely with the set up that this example work with, and it will in all likelihood not enable you to understand and solve the task that you are facing yourself if need be. + +.. admonition:: Requirements and implicit assumptions + + The following must be true about your data analysis. Else, adjustments are necessary: + + - The data that you want to preprocess (i.e., your ``sourcedata``) is a DataLad dataset. + If not, read the first three chapters of the Basics and the section :ref:`dataladdening`, and turn it into a dataset. + - You want to preprocess the data with `fMRIprep `_. + - ``sourcedata`` is BIDS-compliant, at least BIDS-compliant enough that fMRIprep is able to run with no fatal errors. + If not, go the `BIDS starterkit `_ to read about it, contact the INM-7 data management people, and provide information about what you need -- they can help you get started. + - The scripts assume that your project is in a `project folder `_ (e.g., ``/data/project/fancyproject/fmripreppreprocessinganalyis`` where ``fancyproject`` is your `project folder `_ and ``fmripreppreprocessinganalysis`` is the data analysis that you will create. + If your data analysis is somewhere else (e.g., some subdirectories down in the project folder), you need to adjust absolute paths that point to it in the workflow script. + + +.. findoutmore:: How much do I need to learn in order to understand everything that is going? + + A lot. + That is not to say that it is an inhumane and needlessly complicated effort. + We're scientists, trying to do a complicated task not only somewhat, but also well. + The goal is that an analysis of yours can be discovered in a decade by someone who does not know you and has no means of reaching you, ever, but that this person is able to understand and hopefully even recompute what you have done in a matter of minutes, from information that your analysis privides on its own. + `While this is be the way science should function, this task is yet something to be commonly accomplished `_. + DataLad can help with this complex task. + But it comes at the expense of learning to use the tool. + If you want to learn, there are enough resources. + Read the :ref:`basics-intro` of the handbook, understand as much as you can, ask about things you don't understand. + +To adjust the commands in the checklist to your own data analysis endeavour, please replace any place holder (enclosed in ``<`` and ``>``) with your own information. -Let's say you have a BIDS-structured DataLad dataset with input data that you want to preprocess with `fMRIprep `_ using HTCondor, but you can't be bothered to read and understand more than a few pages of user documentation. -Here is a step-by-step bullet point instruction that may get you to where you want to be, but doesn't enforce any learning upon you. -It will only work if the data you want to preprocess is already a DataLad dataset and in a BIDS-compliant structure. .. admonition:: Placeholders @@ -15,6 +43,7 @@ It will only work if the data you want to preprocess is already a DataLad datase - ``processed``: This is an arbitrary name that you call the folder to hold preprocessing results - ``BIDS``: This is your BIDS-compliant input data in a DataLad dataset + 1. Create an analysis dataset ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ @@ -36,7 +65,7 @@ Finally, create a new directory ``logs`` outside of the analysis dataset -- this .. code-block:: bash - $ mkdir ../logs + $ mkdir logs 2. Install your BIDS compliant input dataset as a subdataset ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ @@ -84,112 +113,111 @@ If you don't want to do this, here are a few benchmarks: - preprocessing of ``HCP_structural_preprocessed`` data results in about 400 files per subject - UKBiobank preprocessing leads to about 450 files per subject -For small datasets (200k files or less) -""""""""""""""""""""""""""""""""""""""" +.. findoutmore:: For small datasets (200k files or less) -If you expect fewer than 200k output files, take the workflow script below, replace the placeholders with the required information, and save it as ``fmriprep_participant_job`` into the ``code/`` directory. + If you expect fewer than 200k output files, take the workflow script below, replace the placeholders with the required information, and save it as ``fmriprep_participant_job`` into the ``code/`` directory. -.. code-block:: bash + .. code-block:: bash - #!/bin/bash - set -e -u -x - - subid=$(basename $1) - - cd /tmp - flock --verbose $DSLOCKFILE datalad clone /data/project// ds - - cd ds - datalad get -n -r -R1 . - git annex dead here - - git checkout -b "job-$JOBID" - - mkdir -p .git/tmp/wdir - find sourcedata -mindepth 2 -name '*.json' -a ! -wholename "$1"/'*' -delete - - # add your required fMRIprep parametrization - datalad containers-run \ - -m "fMRIprep $subid" \ - --explicit \ - -o freesurfer -o fmriprep \ - -i "$1" \ - -n code/pipelines/fmriprep \ - sourcedata . participant \ - --n_cpus 1 \ - --skip-bids-validation \ - -w .git/tmp/wdir \ - --participant-label "$subid" \ - --random-seed 12345 \ - --skull-strip-fixed-seed \ - --md-only-boilerplate \ - --output-spaces MNI152NLin6Asym \ - --use-aroma \ - --cifti-output - # selectively push outputs only - # ignore root dataset, despite recorded changes, needs coordinated - # merge at receiving end - flock --verbose $DSLOCKFILE datalad push --to origin + #!/bin/bash + set -e -u -x -Save the addition of this workflow file:: + subid=$(basename $1) - $ datalad save -m "added fmriprep preprocessing workflow" code/fmriprep_participant_job + cd /tmp + flock --verbose $DSLOCKFILE datalad clone /data/project// ds -For large datasets -"""""""""""""""""" + cd ds + datalad get -n -r -R1 . + git annex dead here -If you expect more than 200k result files, first create two subdatasets:: + git checkout -b "job-$JOBID" - $ datalad create -d . fmriprep - $ datalad create -d . freesurfer + mkdir -p .git/tmp/wdir + find sourcedata -mindepth 2 -name '*.json' -a ! -wholename "$1"/'*' -delete -If you run ``datalad subdatasets`` afterwards in the root of your dataset you should see four subdatasets listed. -Then, take the workflow script below, replace the placeholders with the required information, and save it as ``fmriprep_participant_job`` into the ``code/`` directory. + # add your required fMRIprep parametrization + datalad containers-run \ + -m "fMRIprep $subid" \ + --explicit \ + -o freesurfer -o fmriprep \ + -i "$1" \ + -n code/pipelines/fmriprep \ + sourcedata . participant \ + --n_cpus 1 \ + --skip-bids-validation \ + -w .git/tmp/wdir \ + --participant-label "$subid" \ + --random-seed 12345 \ + --skull-strip-fixed-seed \ + --md-only-boilerplate \ + --output-spaces MNI152NLin6Asym \ + --use-aroma \ + --cifti-output + # selectively push outputs only + # ignore root dataset, despite recorded changes, needs coordinated + # merge at receiving end + flock --verbose $DSLOCKFILE datalad push --to origin -.. code-block:: bash - #!/bin/bash - set -e -u -x - - subid=$(basename $1) - - cd /tmp - flock --verbose $DSLOCKFILE datalad clone /data/project// ds - - cd ds - datalad get -n -r -R1 . - git submodule foreach --recursive git annex dead here - - git -C fmriprep checkout -b "job-$JOBID" - git -C freesurfer checkout -b "job-$JOBID" - - mkdir -p .git/tmp/wdir - find sourcedata -mindepth 2 -name '*.json' -a ! -wholename "$1"/'*' -delete - - (cd fmriprep && rm -rf logs "$subid" "$subid.html" dataset_description.json desc-*.tsv) - (cd freesurfer && rm -rf fsaverage "$subid") - - # add your required fMRIprep parametrization - datalad containers-run \ - -m "fMRIprep $subid" \ - --explicit \ - -o freesurfer -o fmriprep \ - -i "$1" \ - -n code/pipelines/fmriprep \ - sourcedata . participant \ - --n_cpus 1 \ - --skip-bids-validation \ - -w .git/tmp/wdir \ - --participant-label "$subid" \ - --random-seed 12345 \ - --skull-strip-fixed-seed \ - --md-only-boilerplate \ - --output-spaces MNI152NLin6Asym \ - --use-aroma \ - --cifti-output - - flock --verbose $DSLOCKFILE datalad push -d fmriprep --to origin - flock --verbose $DSLOCKFILE datalad push -d freesurfer --to origin +.. findoutmore:: For large datasets + + If you expect more than 200k result files, first create two subdatasets:: + + $ datalad create -d . fmriprep + $ datalad create -d . freesurfer + + If you run ``datalad subdatasets`` afterwards in the root of your dataset you should see four subdatasets listed. + Then, take the workflow script below, replace the placeholders with the required information, and save it as ``fmriprep_participant_job`` into the ``code/`` directory. + + .. code-block:: bash + + #!/bin/bash + set -e -u -x + + subid=$(basename $1) + + cd /tmp + flock --verbose $DSLOCKFILE datalad clone /data/project// ds + + cd ds + datalad get -n -r -R1 . + git submodule foreach --recursive git annex dead here + + git -C fmriprep checkout -b "job-$JOBID" + git -C freesurfer checkout -b "job-$JOBID" + + mkdir -p .git/tmp/wdir + find sourcedata -mindepth 2 -name '*.json' -a ! -wholename "$1"/'*' -delete + + (cd fmriprep && rm -rf logs "$subid" "$subid.html" dataset_description.json desc-*.tsv) + (cd freesurfer && rm -rf fsaverage "$subid") + + # add your required fMRIprep parametrization + datalad containers-run \ + -m "fMRIprep $subid" \ + --explicit \ + -o freesurfer -o fmriprep \ + -i "$1" \ + -n code/pipelines/fmriprep \ + sourcedata . participant \ + --n_cpus 1 \ + --skip-bids-validation \ + -w .git/tmp/wdir \ + --participant-label "$subid" \ + --random-seed 12345 \ + --skull-strip-fixed-seed \ + --md-only-boilerplate \ + --output-spaces MNI152NLin6Asym \ + --use-aroma \ + --cifti-output + + flock --verbose $DSLOCKFILE datalad push -d fmriprep --to origin + flock --verbose $DSLOCKFILE datalad push -d freesurfer --to origin + +Then, make the script executable:: + + $ chmod +x code/fmriprep_participant_job Save the addition of this workflow file:: @@ -238,4 +266,49 @@ In the root of your dataset, run condor_submit code/fmriprep_all_participants.submit -7. \ No newline at end of file +7. Monitor the job +^^^^^^^^^^^^^^^^^^ + +Use `standard HTCondor commands `_ to monitor your job, and check on it if it is ``held``. + +.. findoutmore:: What kind of content can I expect in which file? + + - ``*.log`` files: You will find no DataLad-related output in this file, only information from HTCondor + - ``*.out`` files: You will find messages such as successful datalad operation result summaries (``get(ok)``, ``install(ok)``, ...) and workflow output from fmriprep. Here is an example:: + + install(ok): /tmp/ds (dataset) + flock: getting lock took 3.562222 seconds + flock: executing datalad + update(ok):../../ /tmp/ds/code/pipelines (dataset) + configure-sibling(ok):../../ /tmp/ds/code/pipelines (sibling) + install(ok): /tmp/ds/code/pipelines (dataset) + update(ok):../ /tmp/ds/sourcedata (dataset) + configure-sibling(ok):../ /tmp/ds/sourcedata (sibling) + install(ok): /tmp/ds/sourcedata (dataset) + action summary: + configure-sibling (ok: 2) + install (ok: 2) + update (ok: 2) + dead here ok + (recording state in git...) + get(ok): /tmp/ds/sourcedata/sub-A00010893/ses-DS2/anat/sub-A00010893_ses-DS2_T1w.nii.gz (file) [from inm7-storage...] + get(ok): /tmp/ds/sourcedata/sub-A00010893/ses-DS2/dwi/sub-A00010893_ses-DS2_dwi.bval (file) [from inm7-storage...] + get(ok): /tmp/ds/sourcedata/sub-A00010893/ses-DS2/dwi/sub-A00010893_ses-DS2_dwi.bvec (file) [from inm7-storage...] + get(ok): /tmp/ds/sourcedata/sub-A00010893/ses-DS2/dwi/sub-A00010893_ses-DS2_dwi.nii.gz (file) [from inm7-storage...] + get(ok): /tmp/ds/sourcedata/sub-A00010893/ses-DS2/func/sub-A00010893_ses-DS2_task-breathhold_acq-1400_bold.nii.gz (file) [from inm7-storage...] + get(ok): /tmp/ds/sourcedata/sub-A00010893/ses-DS2/func/sub-A00010893_ses-DS2_task-checkerboard_acq-1400_bold.nii.gz (file) [from inm7-storage...] + get(ok): /tmp/ds/sourcedata/sub-A00010893/ses-DS2/func/sub-A00010893_ses-DS2_task-checkerboard_acq-645_bold.nii.gz (file) [from inm7-storage...] + get(ok): /tmp/ds/sourcedata/sub-A00010893/ses-DS2/func/sub-A00010893_ses-DS2_task-rest_acq-1400_bold.nii.gz (file) [from inm7-storage...] + get(ok): /tmp/ds/sourcedata/sub-A00010893/ses-DS2/func/sub-A00010893_ses-DS2_task-rest_acq-645_bold.nii.gz (file) [from inm7-storage...] + get(ok): /tmp/ds/sourcedata/sub-A00010893/ses-DS2/func/sub-A00010893_ses-DS2_task-rest_acq-cap_bold.nii.gz (file) [from inm7-storage...] + get(ok): /tmp/ds/sourcedata/sub-A00010893 (directory) + get(ok): /tmp/ds/code/pipelines/.datalad/environments/fmriprep/image (file) [from origin-2...] + 201023-12:36:57,535 nipype.workflow IMPORTANT: + + Running fMRIPREP version 20.1.1: + * BIDS dataset path: /tmp/ds/sourcedata. + * Participant list: ['A00010893']. + * Run identifier: 20201023-123648_216eb011-9b7f-4f2b-8d43-482bf4795041. + * Output spaces: MNI152NLin6Asym:res-native. + * Pre-run FreeSurfer's SUBJECTS_DIR: /tmp/ds/freesurfer. + 201023-12:37:33,593 nipype.workflow INFO: diff --git a/docs/beyond_basics/basics-hpc.rst b/docs/beyond_basics/basics-hpc.rst index b9e6f06ef..775e1f74b 100644 --- a/docs/beyond_basics/basics-hpc.rst +++ b/docs/beyond_basics/basics-hpc.rst @@ -13,3 +13,4 @@ Computing on clusters 101-170-dataladrun 101-171-enki 101-172-checklist + 101-173-matlab From c0bdf42c6e335337a356b9e8fc5b1c3a28f0b80d Mon Sep 17 00:00:00 2001 From: Adina Wagner Date: Tue, 3 Nov 2020 08:18:08 +0100 Subject: [PATCH 21/22] BF: Fix a trailing slash in the script --- docs/beyond_basics/101-171-enki.rst | 2 +- docs/beyond_basics/101-172-checklist.rst | 4 ++-- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/beyond_basics/101-171-enki.rst b/docs/beyond_basics/101-171-enki.rst index b4e7cf70f..30b313fc2 100644 --- a/docs/beyond_basics/101-171-enki.rst +++ b/docs/beyond_basics/101-171-enki.rst @@ -249,7 +249,7 @@ At this point, the workflow misses a tweak that is necessary in fMRIprep to enab # pybids (inside fmriprep) gets angry when it sees dangling symlinks # of .json files -- wipe them out, spare only those that belong to # the participant we want to process in this job - find sourcedata -mindepth 2 -name '*.json' -a ! -wholename "$1"/'*' -delete + find sourcedata -mindepth 2 -name '*.json' -a ! -wholename "$1"'*' -delete # next one is important to get job-reruns correct. We remove all anticipated # output, such that fmriprep isn't confused by the presence of stale diff --git a/docs/beyond_basics/101-172-checklist.rst b/docs/beyond_basics/101-172-checklist.rst index 468864437..dc171e0bb 100644 --- a/docs/beyond_basics/101-172-checklist.rst +++ b/docs/beyond_basics/101-172-checklist.rst @@ -134,7 +134,7 @@ If you don't want to do this, here are a few benchmarks: git checkout -b "job-$JOBID" mkdir -p .git/tmp/wdir - find sourcedata -mindepth 2 -name '*.json' -a ! -wholename "$1"/'*' -delete + find sourcedata -mindepth 2 -name '*.json' -a ! -wholename "$1"'*' -delete # add your required fMRIprep parametrization datalad containers-run \ @@ -188,7 +188,7 @@ If you don't want to do this, here are a few benchmarks: git -C freesurfer checkout -b "job-$JOBID" mkdir -p .git/tmp/wdir - find sourcedata -mindepth 2 -name '*.json' -a ! -wholename "$1"/'*' -delete + find sourcedata -mindepth 2 -name '*.json' -a ! -wholename "$1"'*' -delete (cd fmriprep && rm -rf logs "$subid" "$subid.html" dataset_description.json desc-*.tsv) (cd freesurfer && rm -rf fsaverage "$subid") From 2e53f45bb237d5dbb52dc13a56d6458622c1fe5c Mon Sep 17 00:00:00 2001 From: Adina Wagner Date: Tue, 3 Nov 2020 10:31:52 +0100 Subject: [PATCH 22/22] Finish a first draft for a ENKI checklist --- docs/beyond_basics/101-172-checklist.rst | 175 ++++++++++++++++++++--- docs/beyond_basics/101-173-matlab.rst | 9 ++ 2 files changed, 167 insertions(+), 17 deletions(-) create mode 100644 docs/beyond_basics/101-173-matlab.rst diff --git a/docs/beyond_basics/101-172-checklist.rst b/docs/beyond_basics/101-172-checklist.rst index dc171e0bb..ff052261f 100644 --- a/docs/beyond_basics/101-172-checklist.rst +++ b/docs/beyond_basics/101-172-checklist.rst @@ -42,7 +42,7 @@ To adjust the commands in the checklist to your own data analysis endeavour, ple - ``projectfolder``: This is your 1TB project folder under ``/data/project/`` on juseless - ``processed``: This is an arbitrary name that you call the folder to hold preprocessing results - ``BIDS``: This is your BIDS-compliant input data in a DataLad dataset - + - ``cluster``: This is the cluster ID HTCondor assigns to your jobs (you will see it once your jobs are submitted) 1. Create an analysis dataset ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ @@ -67,8 +67,8 @@ Finally, create a new directory ``logs`` outside of the analysis dataset -- this $ mkdir logs -2. Install your BIDS compliant input dataset as a subdataset -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +Install your BIDS compliant input dataset as a subdataset +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Go into your newly created dataset:: @@ -82,8 +82,8 @@ If you decide to go for a different name you will need to exchange the word "sou $ datalad clone -d . path/to/ sourcedata -3. Install an fMRIprep container dataset as a subdataset -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +Install an fMRIprep container dataset as a subdataset +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ There is a preconfigured container dataset with fMRIprep available on juseless. You should install it as a subdataset. @@ -94,8 +94,8 @@ You should install it as a subdataset. You can find out how to create such a container dataset and its configuration in paragraph :ref:`pipelineenki` of the previous section. -4. Build a workflow script -^^^^^^^^^^^^^^^^^^^^^^^^^^ +Build a workflow script +^^^^^^^^^^^^^^^^^^^^^^^ Due to concurrency issues, parallel execution can't happen in the same dataset. Therefore, you need to create a workflow script that handles individual job execution in a temporary location on the compute node and push its results back to your dataset. @@ -223,8 +223,8 @@ Save the addition of this workflow file:: $ datalad save -m "added fmriprep preprocessing workflow" code/fmriprep_participant_job -5. Build a HTCondor submit file -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +Build a HTCondor submit file +^^^^^^^^^^^^^^^^^^^^^^^^^^^^ To be able to submit the jobs, create a file called ``code/fmriprep_all_participants.submit`` with the following contents: @@ -257,8 +257,8 @@ Save the addition of this submit file:: $ datalad save -m "added fmriprep preprocessing workflow" code/fmriprep_all_participants.submit -6. Submit the job -^^^^^^^^^^^^^^^^^ +Submit the job +^^^^^^^^^^^^^^ In the root of your dataset, run @@ -266,15 +266,30 @@ In the root of your dataset, run condor_submit code/fmriprep_all_participants.submit -7. Monitor the job -^^^^^^^^^^^^^^^^^^ +Monitor the job +^^^^^^^^^^^^^^^ + +Use `standard HTCondor commands `_ to monitor your job. +Your jobs should be listed as either "idle" (awaiting to be ran), or "run":: + + + -- Schedd: head1.htc.inm7.de : <10.0.8.10:9618?... @ 11/03/20 10:07:19 + OWNER BATCH_NAME SUBMITTED DONE RUN IDLE TOTAL JOB_IDS + adina ID: 323991 11/3 08:16 _ 151 303 454 323991.0 -Use `standard HTCondor commands `_ to monitor your job, and check on it if it is ``held``. +If they are being ``held``, you should check on them (see the `INM-7 docs `_ for info and commands). -.. findoutmore:: What kind of content can I expect in which file? +HTCondor will also write log files into your project directory in ``/data/project//logs``. +You should examine the contents of those files to monitor jobs and troubleshoot problems. +The Findoutmores below detail what type of content can be expected in each file. - - ``*.log`` files: You will find no DataLad-related output in this file, only information from HTCondor - - ``*.out`` files: You will find messages such as successful datalad operation result summaries (``get(ok)``, ``install(ok)``, ...) and workflow output from fmriprep. Here is an example:: +.. findoutmore:: What kind of content can I expect in log files? + + ``*.log`` files will contain no DataLad-related output, only information from HTCondor + +.. findoutmore:: What kind of content can I expect in out files? + + ``out`` files contain messages such as successful datalad operation result summaries (``get(ok)``, ``install(ok)``, ...) and workflow output from fmriprep. Here is an example:: install(ok): /tmp/ds (dataset) flock: getting lock took 3.562222 seconds @@ -312,3 +327,129 @@ Use `standard HTCondor commands `_ to m * Output spaces: MNI152NLin6Asym:res-native. * Pre-run FreeSurfer's SUBJECTS_DIR: /tmp/ds/freesurfer. 201023-12:37:33,593 nipype.workflow INFO: + [...] + +.. findoutmore:: What kind of content can I expect in err files? + + ``*.err`` files will contain any message that is sent to the `"stderr" output stream `_. + With the setup detailed in this checklist, there are three different things that could end up in those files: + + - fMRIprep tracebacks. Those are actual, troublesome errors that require action + - log messages from DataLad. In most cases, those message are fine and do not require action. + - log messages from the script. In most cases, those message are fine and do not require action. + + fMRIprep will send Python tracebacks into this file. + If this happens, the pipeline has crashed, and you should investigate the error. + Here is an example:: + + You are using fMRIPrep-20.1.1, and a newer version of fMRIPrep is available: 20.2.0. + Please check out our documentation about how and when to upgrade: + https://fmriprep.readthedocs.io/en/latest/faq.html#upgrading + Process Process-2: + Traceback (most recent call last): + File "/usr/local/miniconda/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap + self.run() + File "/usr/local/miniconda/lib/python3.7/multiprocessing/process.py", line 99, in run + self._target(*self._args, **self._kwargs) + File "/usr/local/miniconda/lib/python3.7/site-packages/fmriprep/cli/workflow.py", line 84, in build_workflow + retval["workflow"] = init_fmriprep_wf() + File "/usr/local/miniconda/lib/python3.7/site-packages/fmriprep/workflows/base.py", line 64, in init_fmriprep_wf + single_subject_wf = init_single_subject_wf(subject_id) + File "/usr/local/miniconda/lib/python3.7/site-packages/fmriprep/workflows/base.py", line 292, in init_single_subject_wf + func_preproc_wf = init_func_preproc_wf(bold_file) + File "/usr/local/miniconda/lib/python3.7/site-packages/fmriprep/workflows/bold/base.py", line 261, in init_func_preproc_wf + tr=metadata.get("RepetitionTime")), + File "/usr/local/miniconda/lib/python3.7/site-packages/nipype/interfaces/base/core.py", line 611, in __init__ + from_file=from_file, resource_monitor=resource_monitor, **inputs + File "/usr/local/miniconda/lib/python3.7/site-packages/nipype/interfaces/base/core.py", line 183, in __init__ + self.inputs = self.input_spec(**inputs) + File "/usr/local/miniconda/lib/python3.7/site-packages/nipype/interfaces/base/specs.py", line 66, in __init__ + super(BaseTraitedSpec, self).__init__(**kwargs) + File "/usr/local/miniconda/lib/python3.7/site-packages/traits/trait_handlers.py", line 172, in error + value ) + traits.trait_errors.TraitError: The 'tr' trait of a FunctionalSummaryInputSpec instance must be a float, but a value of None was specified. + + DataLad will send all of its logging messages, i.e., messages that start with ``[INFO]``, ``[WARNING]``, or ``[ERROR]`` into this file. + Unless it is an error message, the presence of DataLad log messages in the ``*.err`` files is not worrisome, but only a byproduct of how Unix systems handle input and output communication. + In most cases, you will see ``[INFO]`` messages that state the progress of the task at hand. + Note that there is also one ``ConnectionOpenFailedError`` included as an INFO message -- while this looks like trouble, its only an information that using first of several clone targets has not worked out:: + + [INFO] Cloning dataset to Dataset(/tmp/ds) + [INFO] Attempting to clone from /data/project/enki/processed to /tmp/ds + [INFO] Completed clone attempts for Dataset(/tmp/ds) + + cd ds + + datalad get -n -r -R1 . + [INFO] Installing Dataset(/tmp/ds) to get /tmp/ds recursively + [INFO] Cloning dataset to Dataset(/tmp/ds/code/pipelines) + [INFO] Attempting to clone from /data/project/enki/processed/code/pipelines to /tmp/ds/code/pipelines + [INFO] Completed clone attempts for Dataset(/tmp/ds/code/pipelines) + [INFO] Cloning dataset to Dataset(/tmp/ds/fmriprep) + [INFO] Attempting to clone from /data/project/enki/processed/fmriprep to /tmp/ds/fmriprep + [INFO] Completed clone attempts for Dataset(/tmp/ds/fmriprep) + [INFO] Cloning dataset to Dataset(/tmp/ds/freesurfer) + [INFO] Attempting to clone from /data/project/enki/processed/freesurfer to /tmp/ds/freesurfer + [INFO] Completed clone attempts for Dataset(/tmp/ds/freesurfer) + [INFO] Cloning dataset to Dataset(/tmp/ds/sourcedata) + [INFO] Attempting to clone from /data/project/enki/processed/sourcedata to /tmp/ds/sourcedata + [INFO] Start check out things + [INFO] Completed clone attempts for Dataset(/tmp/ds/sourcedata) + [INFO] hanke4@judac.fz-juelich.de: Permission denied (publickey). + [INFO] ConnectionOpenFailedError: 'ssh -fN -o ControlMaster=auto -o ControlPersist=15m -o ControlPath=/home/mih/.cache/datalad/sockets/64c612f8 judac.fz-juelich.de' failed with exitcode 255 [Failed to open SSH connection (could not start ControlMaster process)] + + git submodule foreach --recursive git annex dead here + + git -C fmriprep checkout -b job-107890.1168 + Switched to a new branch 'job-107890.1168' + + git -C freesurfer checkout -b job-107890.1168 + Switched to a new branch 'job-107890.1168' + + mkdir -p .git/tmp/wdir + + find sourcedata -mindepth 2 -name '*.json' -a '!' -wholename 'sourcedata/sub-A00081239/*' -delete + + cd fmriprep + + rm -rf logs sub-A00081239 sub-A00081239.html dataset_description.json desc-aparcaseg_dseg.tsv desc-aseg_dseg.tsv + + cd freesurfer + + rm -rf fsaverage sub-A00081239 + + datalad containers-run -m 'fMRIprep sub-A00081239' --explicit -o freesurfer -o fmriprep -i sourcedata/sub-A00081239/ -n code/pipelines/fmriprep sourcedata . participant --n_cpus 1 --skip-bids-validation -w .git/tmp/wdir --participant-label sub-A00081239 --random-seed 12345 --skull-strip-fixed-seed --md-only-boilerplate --output-spaces MNI152NLin6Asym --use-aroma --cifti-output + [INFO] Making sure inputs are available (this may take some time) + [INFO] == Command start (output follows) ===== + [INFO] == Command exit (modification check follows) ===== + + flock --verbose /data/project/enki/processed/.git/datalad_lock datalad push -d fmriprep --to origin + [INFO] Determine push target + [INFO] Push refspecs + [INFO] Start enumerating objects + [INFO] Start counting objects + [INFO] Start compressing objects + [INFO] Start writing objects + [INFO] Start resolving deltas + [INFO] Finished + [INFO] Transfer data + [INFO] Start annex operation + [INFO] sub-A00081239.html + [INFO] sub-A00081239/anat/sub-A00081239_desc-aparcaseg_dseg.nii.gz + [...] + + Note that the ``fmriprep_participant_job`` script's log messages are also included in the script. + Those are the lines that start with a ``+`` and simply log which line of workflow script is presently executed. + + + +Merge the results +^^^^^^^^^^^^^^^^^ + +fMRIprep writes out a ``CITATION.md`` file in each job. +These files contain a general summary, such as the number of sessions that have been processed. +If those differ between subjects, a straight :term:`merge` will fail. +You can safely try it out first, though (the command would abort if it can't perform the operation):: + + git merge -m "Merge results from job cluster " $(git branch -l | grep 'job-' | tr -d ' ') + +If this fails, copy the contents of one ``CITATION.md`` file into the :term:`master` branch:: + + TODO - catfile command + +Afterwards, delete the ``CITATION.md`` files in all branches with the following command:: + + for b in $(git branch -l | grep 'job-' | tr -d ' '); + do ( git checkout -b m$b $b && git rm logs/CITATION.md && git commit --amend --no-edit ) ; + done + +Lastly, repeat the merge command from above:: + + git merge -m "Merge results from job cluster " $(git branch -l | grep 'job-' | tr -d ' ') diff --git a/docs/beyond_basics/101-173-matlab.rst b/docs/beyond_basics/101-173-matlab.rst new file mode 100644 index 000000000..f403bd6cb --- /dev/null +++ b/docs/beyond_basics/101-173-matlab.rst @@ -0,0 +1,9 @@ +.. _inm7checklistmatlab_: + +Checklists for the impatient: Process a DataLad dataset with MatLab +------------------------------------------------------------------- + +.. todo:: + + Find someone who would like to play a matlab analysis through with me. + Maybe Susanne or Nevena.