Skip to content

Commit

Permalink
checklist: Condense fmriprep preprocessing to checklist
Browse files Browse the repository at this point in the history
  • Loading branch information
adswa committed Nov 2, 2020
1 parent b5954b0 commit 2cfa9b4
Show file tree
Hide file tree
Showing 2 changed files with 177 additions and 103 deletions.
279 changes: 176 additions & 103 deletions docs/beyond_basics/101-172-checklist.rst
Original file line number Diff line number Diff line change
@@ -1,11 +1,39 @@
.. _inm7checklist:
.. _inm7checklistfmriprep_:

Checklist for the impatient: Preprocess a DataLad dataset with fMRIprep
-----------------------------------------------------------------------
Checklists for the impatient: Preprocess a DataLad dataset with fMRIprep
------------------------------------------------------------------------

Let's say you have a BIDS-structured DataLad dataset with input data that you want to preprocess with `fMRIprep <https://fmriprep.readthedocs.io/>`_ using HTCondor.
Here is a step-by-step bullet point instruction that contains all required steps -- in the case of a fully-standard-no-special-cases analysis setup, with all necessary preparations, e.g., input dataset creation and BIDS validation, being done already.
It may get you to where you want to be, but is doomed to fail when your analysis does not align completely with the set up that this example work with, and it will in all likelihood not enable you to understand and solve the task that you are facing yourself if need be.

.. admonition:: Requirements and implicit assumptions

The following must be true about your data analysis. Else, adjustments are necessary:

- The data that you want to preprocess (i.e., your ``sourcedata``) is a DataLad dataset.
If not, read the first three chapters of the Basics and the section :ref:`dataladdening`, and turn it into a dataset.
- You want to preprocess the data with `fMRIprep <https://fmriprep.readthedocs.io/>`_.
- ``sourcedata`` is BIDS-compliant, at least BIDS-compliant enough that fMRIprep is able to run with no fatal errors.
If not, go the `BIDS starterkit <https://github.com/bids-standard/bids-starter-kit>`_ to read about it, contact the INM-7 data management people, and provide information about what you need -- they can help you get started.
- The scripts assume that your project is in a `project folder <https://docs.inm7.de/cluster/data/>`_ (e.g., ``/data/project/fancyproject/fmripreppreprocessinganalyis`` where ``fancyproject`` is your `project folder <https://docs.inm7.de/cluster/data/>`_ and ``fmripreppreprocessinganalysis`` is the data analysis that you will create.
If your data analysis is somewhere else (e.g., some subdirectories down in the project folder), you need to adjust absolute paths that point to it in the workflow script.


.. findoutmore:: How much do I need to learn in order to understand everything that is going?

A lot.
That is not to say that it is an inhumane and needlessly complicated effort.
We're scientists, trying to do a complicated task not only somewhat, but also well.
The goal is that an analysis of yours can be discovered in a decade by someone who does not know you and has no means of reaching you, ever, but that this person is able to understand and hopefully even recompute what you have done in a matter of minutes, from information that your analysis privides on its own.
`While this is be the way science should function, this task is yet something to be commonly accomplished <https://www.nature.com/articles/d41586-020-02462-7>`_.
DataLad can help with this complex task.
But it comes at the expense of learning to use the tool.
If you want to learn, there are enough resources.
Read the :ref:`basics-intro` of the handbook, understand as much as you can, ask about things you don't understand.

To adjust the commands in the checklist to your own data analysis endeavour, please replace any place holder (enclosed in ``<`` and ``>``) with your own information.

Let's say you have a BIDS-structured DataLad dataset with input data that you want to preprocess with `fMRIprep <https://fmriprep.readthedocs.io/>`_ using HTCondor, but you can't be bothered to read and understand more than a few pages of user documentation.
Here is a step-by-step bullet point instruction that may get you to where you want to be, but doesn't enforce any learning upon you.
It will only work if the data you want to preprocess is already a DataLad dataset and in a BIDS-compliant structure.

.. admonition:: Placeholders

Expand All @@ -15,6 +43,7 @@ It will only work if the data you want to preprocess is already a DataLad datase
- ``processed``: This is an arbitrary name that you call the folder to hold preprocessing results
- ``BIDS``: This is your BIDS-compliant input data in a DataLad dataset


1. Create an analysis dataset
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Expand All @@ -36,7 +65,7 @@ Finally, create a new directory ``logs`` outside of the analysis dataset -- this

.. code-block:: bash
$ mkdir ../logs
$ mkdir logs
2. Install your BIDS compliant input dataset as a subdataset
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Expand Down Expand Up @@ -84,112 +113,111 @@ If you don't want to do this, here are a few benchmarks:
- preprocessing of ``HCP_structural_preprocessed`` data results in about 400 files per subject
- UKBiobank preprocessing leads to about 450 files per subject

For small datasets (200k files or less)
"""""""""""""""""""""""""""""""""""""""
.. findoutmore:: For small datasets (200k files or less)

If you expect fewer than 200k output files, take the workflow script below, replace the placeholders with the required information, and save it as ``fmriprep_participant_job`` into the ``code/`` directory.
If you expect fewer than 200k output files, take the workflow script below, replace the placeholders with the required information, and save it as ``fmriprep_participant_job`` into the ``code/`` directory.

.. code-block:: bash
.. code-block:: bash
#!/bin/bash
set -e -u -x
subid=$(basename $1)
cd /tmp
flock --verbose $DSLOCKFILE datalad clone /data/project/<projectfolder>/<processed> ds
cd ds
datalad get -n -r -R1 .
git annex dead here
git checkout -b "job-$JOBID"
mkdir -p .git/tmp/wdir
find sourcedata -mindepth 2 -name '*.json' -a ! -wholename "$1"/'*' -delete
# add your required fMRIprep parametrization
datalad containers-run \
-m "fMRIprep $subid" \
--explicit \
-o freesurfer -o fmriprep \
-i "$1" \
-n code/pipelines/fmriprep \
sourcedata . participant \
--n_cpus 1 \
--skip-bids-validation \
-w .git/tmp/wdir \
--participant-label "$subid" \
--random-seed 12345 \
--skull-strip-fixed-seed \
--md-only-boilerplate \
--output-spaces MNI152NLin6Asym \
--use-aroma \
--cifti-output
# selectively push outputs only
# ignore root dataset, despite recorded changes, needs coordinated
# merge at receiving end
flock --verbose $DSLOCKFILE datalad push --to origin
#!/bin/bash
set -e -u -x
Save the addition of this workflow file::
subid=$(basename $1)
$ datalad save -m "added fmriprep preprocessing workflow" code/fmriprep_participant_job
cd /tmp
flock --verbose $DSLOCKFILE datalad clone /data/project/<projectfolder>/<processed> ds
For large datasets
""""""""""""""""""
cd ds
datalad get -n -r -R1 .
git annex dead here
If you expect more than 200k result files, first create two subdatasets::
git checkout -b "job-$JOBID"
$ datalad create -d . fmriprep
$ datalad create -d . freesurfer
mkdir -p .git/tmp/wdir
find sourcedata -mindepth 2 -name '*.json' -a ! -wholename "$1"/'*' -delete
If you run ``datalad subdatasets`` afterwards in the root of your dataset you should see four subdatasets listed.
Then, take the workflow script below, replace the placeholders with the required information, and save it as ``fmriprep_participant_job`` into the ``code/`` directory.
# add your required fMRIprep parametrization
datalad containers-run \
-m "fMRIprep $subid" \
--explicit \
-o freesurfer -o fmriprep \
-i "$1" \
-n code/pipelines/fmriprep \
sourcedata . participant \
--n_cpus 1 \
--skip-bids-validation \
-w .git/tmp/wdir \
--participant-label "$subid" \
--random-seed 12345 \
--skull-strip-fixed-seed \
--md-only-boilerplate \
--output-spaces MNI152NLin6Asym \
--use-aroma \
--cifti-output
# selectively push outputs only
# ignore root dataset, despite recorded changes, needs coordinated
# merge at receiving end
flock --verbose $DSLOCKFILE datalad push --to origin
.. code-block:: bash
#!/bin/bash
set -e -u -x
subid=$(basename $1)
cd /tmp
flock --verbose $DSLOCKFILE datalad clone /data/project/<projectfolder>/<processed> ds
cd ds
datalad get -n -r -R1 .
git submodule foreach --recursive git annex dead here
git -C fmriprep checkout -b "job-$JOBID"
git -C freesurfer checkout -b "job-$JOBID"
mkdir -p .git/tmp/wdir
find sourcedata -mindepth 2 -name '*.json' -a ! -wholename "$1"/'*' -delete
(cd fmriprep && rm -rf logs "$subid" "$subid.html" dataset_description.json desc-*.tsv)
(cd freesurfer && rm -rf fsaverage "$subid")
# add your required fMRIprep parametrization
datalad containers-run \
-m "fMRIprep $subid" \
--explicit \
-o freesurfer -o fmriprep \
-i "$1" \
-n code/pipelines/fmriprep \
sourcedata . participant \
--n_cpus 1 \
--skip-bids-validation \
-w .git/tmp/wdir \
--participant-label "$subid" \
--random-seed 12345 \
--skull-strip-fixed-seed \
--md-only-boilerplate \
--output-spaces MNI152NLin6Asym \
--use-aroma \
--cifti-output
flock --verbose $DSLOCKFILE datalad push -d fmriprep --to origin
flock --verbose $DSLOCKFILE datalad push -d freesurfer --to origin
.. findoutmore:: For large datasets

If you expect more than 200k result files, first create two subdatasets::

$ datalad create -d . fmriprep
$ datalad create -d . freesurfer

If you run ``datalad subdatasets`` afterwards in the root of your dataset you should see four subdatasets listed.
Then, take the workflow script below, replace the placeholders with the required information, and save it as ``fmriprep_participant_job`` into the ``code/`` directory.

.. code-block:: bash
#!/bin/bash
set -e -u -x
subid=$(basename $1)
cd /tmp
flock --verbose $DSLOCKFILE datalad clone /data/project/<projectfolder>/<processed> ds
cd ds
datalad get -n -r -R1 .
git submodule foreach --recursive git annex dead here
git -C fmriprep checkout -b "job-$JOBID"
git -C freesurfer checkout -b "job-$JOBID"
mkdir -p .git/tmp/wdir
find sourcedata -mindepth 2 -name '*.json' -a ! -wholename "$1"/'*' -delete
(cd fmriprep && rm -rf logs "$subid" "$subid.html" dataset_description.json desc-*.tsv)
(cd freesurfer && rm -rf fsaverage "$subid")
# add your required fMRIprep parametrization
datalad containers-run \
-m "fMRIprep $subid" \
--explicit \
-o freesurfer -o fmriprep \
-i "$1" \
-n code/pipelines/fmriprep \
sourcedata . participant \
--n_cpus 1 \
--skip-bids-validation \
-w .git/tmp/wdir \
--participant-label "$subid" \
--random-seed 12345 \
--skull-strip-fixed-seed \
--md-only-boilerplate \
--output-spaces MNI152NLin6Asym \
--use-aroma \
--cifti-output
flock --verbose $DSLOCKFILE datalad push -d fmriprep --to origin
flock --verbose $DSLOCKFILE datalad push -d freesurfer --to origin
Then, make the script executable::

$ chmod +x code/fmriprep_participant_job

Save the addition of this workflow file::

Expand Down Expand Up @@ -238,4 +266,49 @@ In the root of your dataset, run
condor_submit code/fmriprep_all_participants.submit
7.
7. Monitor the job
^^^^^^^^^^^^^^^^^^

Use `standard HTCondor commands <https://docs.inm7.de/htcondor/commands/>`_ to monitor your job, and check on it if it is ``held``.

.. findoutmore:: What kind of content can I expect in which file?

- ``*.log`` files: You will find no DataLad-related output in this file, only information from HTCondor
- ``*.out`` files: You will find messages such as successful datalad operation result summaries (``get(ok)``, ``install(ok)``, ...) and workflow output from fmriprep. Here is an example::

install(ok): /tmp/ds (dataset)
flock: getting lock took 3.562222 seconds
flock: executing datalad
update(ok):../../ /tmp/ds/code/pipelines (dataset)
configure-sibling(ok):../../ /tmp/ds/code/pipelines (sibling)
install(ok): /tmp/ds/code/pipelines (dataset)
update(ok):../ /tmp/ds/sourcedata (dataset)
configure-sibling(ok):../ /tmp/ds/sourcedata (sibling)
install(ok): /tmp/ds/sourcedata (dataset)
action summary:
configure-sibling (ok: 2)
install (ok: 2)
update (ok: 2)
dead here ok
(recording state in git...)
get(ok): /tmp/ds/sourcedata/sub-A00010893/ses-DS2/anat/sub-A00010893_ses-DS2_T1w.nii.gz (file) [from inm7-storage...]
get(ok): /tmp/ds/sourcedata/sub-A00010893/ses-DS2/dwi/sub-A00010893_ses-DS2_dwi.bval (file) [from inm7-storage...]
get(ok): /tmp/ds/sourcedata/sub-A00010893/ses-DS2/dwi/sub-A00010893_ses-DS2_dwi.bvec (file) [from inm7-storage...]
get(ok): /tmp/ds/sourcedata/sub-A00010893/ses-DS2/dwi/sub-A00010893_ses-DS2_dwi.nii.gz (file) [from inm7-storage...]
get(ok): /tmp/ds/sourcedata/sub-A00010893/ses-DS2/func/sub-A00010893_ses-DS2_task-breathhold_acq-1400_bold.nii.gz (file) [from inm7-storage...]
get(ok): /tmp/ds/sourcedata/sub-A00010893/ses-DS2/func/sub-A00010893_ses-DS2_task-checkerboard_acq-1400_bold.nii.gz (file) [from inm7-storage...]
get(ok): /tmp/ds/sourcedata/sub-A00010893/ses-DS2/func/sub-A00010893_ses-DS2_task-checkerboard_acq-645_bold.nii.gz (file) [from inm7-storage...]
get(ok): /tmp/ds/sourcedata/sub-A00010893/ses-DS2/func/sub-A00010893_ses-DS2_task-rest_acq-1400_bold.nii.gz (file) [from inm7-storage...]
get(ok): /tmp/ds/sourcedata/sub-A00010893/ses-DS2/func/sub-A00010893_ses-DS2_task-rest_acq-645_bold.nii.gz (file) [from inm7-storage...]
get(ok): /tmp/ds/sourcedata/sub-A00010893/ses-DS2/func/sub-A00010893_ses-DS2_task-rest_acq-cap_bold.nii.gz (file) [from inm7-storage...]
get(ok): /tmp/ds/sourcedata/sub-A00010893 (directory)
get(ok): /tmp/ds/code/pipelines/.datalad/environments/fmriprep/image (file) [from origin-2...]
201023-12:36:57,535 nipype.workflow IMPORTANT:

Running fMRIPREP version 20.1.1:
* BIDS dataset path: /tmp/ds/sourcedata.
* Participant list: ['A00010893'].
* Run identifier: 20201023-123648_216eb011-9b7f-4f2b-8d43-482bf4795041.
* Output spaces: MNI152NLin6Asym:res-native.
* Pre-run FreeSurfer's SUBJECTS_DIR: /tmp/ds/freesurfer.
201023-12:37:33,593 nipype.workflow INFO:
1 change: 1 addition & 0 deletions docs/beyond_basics/basics-hpc.rst
Original file line number Diff line number Diff line change
Expand Up @@ -13,3 +13,4 @@ Computing on clusters
101-170-dataladrun
101-171-enki
101-172-checklist
101-173-matlab

0 comments on commit 2cfa9b4

Please sign in to comment.