Add parallelisation to the cli and test parallelisation options #17

soutobias · 2024-10-02T15:56:15Z

The package currently supports parallelization using Dask. However, there are some important points to note:

No CLI Option for Parallelization: The parallelization feature is not exposed via the command-line interface (CLI).
Intermittent Dask Issues: We are encountering the following error intermittently when using Dask for parallelization:

cat  transfer_to_os_1y_U.err
/home/users/acc/.conda/envs/env_cylc/lib/python3.10/site-packages/distributed/node.py:182: UserWarning: Port 8787 is already in use.
Perhaps you already have a cluster running?
Hosting the HTTP server on port 40343 instead
  warnings.warn(
/home/users/acc/.conda/envs/env_cylc/lib/python3.10/site-packages/distributed/client.py:3164: UserWarning: Sending large graph of size 18.33 GiB.
This may cause some slowdown.
Consider scattering data ahead of time and using futures.
  warnings.warn(
2024-06-27 13:15:58,818 - distributed.protocol.core - CRITICAL - Failed to Serialize
Traceback (most recent call last):
  File "/home/users/acc/.conda/envs/env_cylc/lib/python3.10/site-packages/distributed/protocol/core.py", line 109, in dumps
    frames[0] = msgpack.dumps(msg, default=_encode_default, use_bin_type=True)
  File "/home/users/acc/.conda/envs/env_cylc/lib/python3.10/site-packages/msgpack/__init__.py", line 36, in packb
    return Packer(**kwargs).pack(o)
  File "msgpack/_packer.pyx", line 294, in msgpack._cmsgpack.Packer.pack
  File "msgpack/_packer.pyx", line 300, in msgpack._cmsgpack.Packer.pack
  File "msgpack/_packer.pyx", line 297, in msgpack._cmsgpack.Packer.pack
  File "msgpack/_packer.pyx", line 264, in msgpack._cmsgpack.Packer._pack
  File "msgpack/_packer.pyx", line 231, in msgpack._cmsgpack.Packer._pack
  File "msgpack/_packer.pyx", line 264, in msgpack._cmsgpack.Packer._pack
  File "msgpack/_packer.pyx", line 272, in msgpack._cmsgpack.Packer._pack
ValueError: memoryview is too large
2024-06-27 13:15:58,822 - distributed.comm.utils - ERROR - memoryview is too large

This issue only occurs about 30% of the time. The remaining 70% of the runs complete successfully. I have been investigating the memory size, number of jobs, and heartbeat timing, but none of these seem to resolve the issue consistently.

Some users have mentioned that upgrading Dask might help, as referenced here: Dask Issue #7552.

Potential Solutions:

Add a flag to allow users to select the number of workers.
Consider alternative approaches:
- Update the Dask version.
- Change the parallelization method (e.g., threads, Dask delayed).
- Test the upload process with a different object store (e.g., Oracle).
- Monitor memory usage more closely during job execution.
- Submit multiple smaller SLURM jobs instead of one large job.

The text was updated successfully, but these errors were encountered:

soutobias self-assigned this Nov 5, 2024

soutobias linked a pull request Nov 6, 2024 that will close this issue

17 add parallelisation to the cli and test parallelisation options #19

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add parallelisation to the cli and test parallelisation options #17

Add parallelisation to the cli and test parallelisation options #17

soutobias commented Oct 2, 2024

Add parallelisation to the cli and test parallelisation options #17

Add parallelisation to the cli and test parallelisation options #17

Comments

soutobias commented Oct 2, 2024

Potential Solutions: