Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add guide on downloading files in parallel #415

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,7 @@
"python": ("https://docs.python.org/3/", None),
"pandas": ("http://pandas.pydata.org/pandas-docs/stable/", None),
"requests": ("https://requests.readthedocs.io/en/latest/", None),
"filelock": ("https://py-filelock.readthedocs.io/en/latest/", None),
}

# Autosummary pages will be generated by sphinx-autogen instead of sphinx-build
Expand Down
1 change: 1 addition & 0 deletions doc/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -134,6 +134,7 @@ Are you a **scientist** or researcher? Pooch can help you too!
progressbars.rst
unpacking.rst
decompressing.rst
parallel-downloads.rst

.. toctree::
:caption: Reference
Expand Down
51 changes: 51 additions & 0 deletions doc/parallel-downloads.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
.. _paralleldownloads:

Parallel downloads
==================

When running :func:`pooch.retrieve` or :meth:`pooch.Pooch.fetch` on parallel
processes, Pooch will trigger multiple downloads of the same file(s). Although
there is no `race condition <https://en.wikipedia.org/wiki/Race_condition>`_
happening in this process, download the same file multiple time is not
desirable, it slows down the fetching process and consumes more bandwidth than
necessary.

A solution to this problem is to create a `lock file
<https://en.wikipedia.org/wiki/File_locking#Lock_files>`_ that will allow only
one process to download the desired file, and force all the other processes to
wait until it finishes for fetching the file directly from the cache.
Lock files can be easily created through the :mod:`filelock` package.

For example, let's create a ``download.py`` file that defines a lock file
before calling the :fun:`pooch.retrieve` function.

.. code:: python

# file: download.py
import pooch
import filelock

lock = filelock.LockFile(path="foo.lock")
with lock:
file_path = pooch.retrieve(
url="https://github.com/fatiando/pooch/raw/v1.0.0/data/tiny-data.txt",
known_hash="md5:70e2afd3fd7e336ae478b1e740a5f08e",
path="my_dir",
)

# Perform tasks with this file using different parameters passed as argument
parameter = sys.arg[1] # get parameter from first argument
... # perform tasks using the file and the parameter

We can run this script in parallel using the Bash ampersand:

.. code:: bash

python download.py 1 &
python download.py 2 &
python download.py 3 &

Since we are using a lock file, only one of these process will take care of the
download. The rest will wait for it to finish, and then fetch the file from the
cache. Then all further tasks that the ``download.py`` performs using the
different arguments will be run in parallel as usual.
1 change: 1 addition & 0 deletions env/requirements-docs.txt
Original file line number Diff line number Diff line change
Expand Up @@ -2,3 +2,4 @@
sphinx==7.2.*
sphinx-book-theme==1.1.*
sphinx-design==0.5.*
filelock
1 change: 1 addition & 0 deletions environment.yml
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@ dependencies:
- sphinx==7.2.*
- sphinx-book-theme==1.1.*
- sphinx-design==0.5.*
- filelock
# Style
- pathspec
- black>=20.8b1
Expand Down
Loading