Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bulletpoint Checklists for large-scale preprocessing #601

Open
wants to merge 22 commits into
base: inm7
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
22 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion docs/basics/101-135-help.rst
Original file line number Diff line number Diff line change
Expand Up @@ -316,6 +316,8 @@ this means that the sibling contains changes that your local dataset does not ye
know about. It can be fixed by updating from the sibling first with a
:command:`datalad update --merge`.

.. _nonbarepush:

Here is a different push rejection::

$ datalad push --to roommate
Expand All @@ -329,7 +331,7 @@ As you can see, the :term:`git-annex branch` was pushed successfully, but updati
the ``master`` branch was rejected: ``[remote rejected] (branch is currently checked out) [publish(/home/me/dl-101/DataLad-101)]``.
In this particular case, this is because it was an attempt to push from ``DataLad-101``
to the ``roommate`` sibling that was created in chapter :ref:`chapter_collaboration`.
This is a special case of pushing, because it - in technical terms - is a push
This is a special case of pushing, because it -- in technical terms -- is a push
to a non-bare repository. Unlike :term:`bare Git repositories`, non-bare
repositories can not be pushed to at all times. To fix this, you either want to
`checkout another branch <https://git-scm.com/docs/git-checkout>`_
Expand Down
1 change: 1 addition & 0 deletions docs/beyond_basics/101-160-gobig.rst
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@ and points to benchmarks, rules of thumb, and general solutions.
Upcoming sections demonstrate how one can attempt
large-scale analyses with DataLad, and how to fix things up when dataset sizes
got out of hand.
The upcoming chapter :ref:`chapter_hpc`, finally, extends this chapter with advice and examples from large scale analyses on computational clusters.

Why scaling up Git repos can become difficult
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Expand Down
25 changes: 25 additions & 0 deletions docs/beyond_basics/101-169-cluster.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
.. _hpc:

DataLad on High Throughput or High Performance Compute Clusters
---------------------------------------------------------------

For efficient computing of large analysis, to comply to best computing practices, or to fulfil the requirements that `responsible system administrators <https://xkcd.com/705/>`_ impose, users may turn to computational clusters such as :term:`high-performance computing (HPC)` or :term:`high-throughput computing (HTC)` infrastructure for data analysis, back-up, or storage.

This chapter is a collection of useful resources and examples that aims to help you get started with DataLad-centric workflows on clusters.
We hope to grow this chapter further, so please `get in touch <https://github.com/datalad-handbook/book/issues/new/>`_ if you want to share your use case or seek more advice.

Pointers to content in other chapters
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

To find out more about centralized storage solutions, you may want to checkout the usecase :ref:`usecase_datastore` or the section :ref:`riastore`.

DataLad installation on a cluster
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Users of a compute cluster generally do not have administrative privileges (sudo rights) and thus can not install software as easily as on their own, private machine.
In order to get DataLad and its underlying tools installed, you can either `bribe (kindly ask) your system administrator <https://hsto.org/getpro/habr/post_images/02e/e3b/369/02ee3b369a0326760a160004aca631dc.jpg>`_ [#f1]_ or install everything for your own user only following the instructions in the paragraph :ref:`norootinstall` of the :ref:`installation page <install>`.


.. rubric:: Footnotes

.. [#f1] You may not need to bribe your system administrator if you are kind to them. Consider frequent gestures of appreciation, or send a geeky T-Shirt for `SysAdminDay <https://en.wikipedia.org/wiki/System_Administrator_Appreciation_Day>`_ (the last Friday in July) -- Sysadmins do amazing work!
405 changes: 405 additions & 0 deletions docs/beyond_basics/101-170-dataladrun.rst

Large diffs are not rendered by default.

393 changes: 393 additions & 0 deletions docs/beyond_basics/101-171-enki.rst

Large diffs are not rendered by default.

455 changes: 455 additions & 0 deletions docs/beyond_basics/101-172-checklist.rst

Large diffs are not rendered by default.

9 changes: 9 additions & 0 deletions docs/beyond_basics/101-173-matlab.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
.. _inm7checklistmatlab_:

Checklists for the impatient: Process a DataLad dataset with MatLab
-------------------------------------------------------------------

.. todo::

Find someone who would like to play a matlab analysis through with me.
Maybe Susanne or Nevena.
16 changes: 16 additions & 0 deletions docs/beyond_basics/basics-hpc.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
.. _chapter_hpc:

Computing on clusters
---------------------

.. figure:: ../artwork/src/cluster.svg

.. toctree::
:maxdepth: 1
:caption: Strategies for high performance computing with DataLad

101-169-cluster
101-170-dataladrun
101-171-enki
101-172-checklist
101-173-matlab
1 change: 1 addition & 0 deletions docs/beyond_basics/intro.rst
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@ associated usecases.
basics-scaling
basics-retrospective
basics-specialpurpose
basics-hpc

.. figure:: /artwork/src/hero.svg
:width: 70%
16 changes: 14 additions & 2 deletions docs/glossary.rst
Original file line number Diff line number Diff line change
Expand Up @@ -74,6 +74,8 @@ Glossary
Container images are *built* from :term:`container recipe` files.
They are a static filesystem inside a file, populated with the software specified in the recipe, and some initial configuration.

compute node
A compute node is an individual computer, part of a :term:`high-performance computing (HPC)` or :term:`high-throughput computing (HTC)` cluster.

DataLad dataset
A DataLad dataset is a Git repository that may or may not have a data annex that is used to
Expand Down Expand Up @@ -128,7 +130,10 @@ Glossary
You can find out a bit more on environment variable :ref:`in this footnote <envvars>`.

ephemeral clone
TODO
dataset clones that share the annex with the dataset they were cloned from, without :term:`git-annex` being aware of it.
On a technical level, this is achieved via symlinks.
They can be created with the ``--reckless ephemeral`` option of :command:`datalad clone`.


force-push
Git concept; Enforcing a :command:`git push` command with the ``--force``
Expand Down Expand Up @@ -185,6 +190,13 @@ Glossary
You can read about more about Pattern Matching in
`Bash's Docs <https://www.gnu.org/savannah-checkouts/gnu/bash/manual/bash.html#Pattern-Matching>`_.

high-performance computing (HPC)
Aggregating computing power from a bond of computers in a way that delivers higher performance than a typical desktop computer in order to solve computing tasks that require high computing power or demand a lot of disk space or memory.


high-throughput computing (HTC)
A computing environment build from a bond of computers and tuned to deliver large amounts of computational power to allow parallel processing of independent computational jobs. For more information, see `this Wikipedia entry <https://en.wikipedia.org/wiki/High-throughput_computing>`_.

http
Hypertext Transfer Protocol; A protocol for file transfer over a network.

Expand Down Expand Up @@ -437,4 +449,4 @@ Glossary
The Windows Subsystem for Linux, a compatibility layer for running Linux destributions on recent versions of Windows. Find out more `here <https://en.wikipedia.org/wiki/Windows_Subsystem_for_Linux>`__.

zsh
A Unix shell.
A Unix shell.
1 change: 1 addition & 0 deletions docs/intro/installation.rst
Original file line number Diff line number Diff line change
Expand Up @@ -240,6 +240,7 @@ Subsequently, DataLad can be installed via ``pip``.
Alternatively, DataLad can be installed together with :term:`Git` and
:term:`git-annex` via ``conda`` as outlined in the section below.

.. _norootinstall:

Linux-machines with no root access (e.g. HPC systems)
"""""""""""""""""""""""""""""""""""""""""""""""""""""
Expand Down