Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add parallelisation to the cli and test parallelisation options #17

Open
soutobias opened this issue Oct 2, 2024 · 0 comments · May be fixed by #19
Open

Add parallelisation to the cli and test parallelisation options #17

soutobias opened this issue Oct 2, 2024 · 0 comments · May be fixed by #19
Assignees

Comments

@soutobias
Copy link
Member

The package currently supports parallelization using Dask. However, there are some important points to note:

  1. No CLI Option for Parallelization: The parallelization feature is not exposed via the command-line interface (CLI).

  2. Intermittent Dask Issues: We are encountering the following error intermittently when using Dask for parallelization:

cat  transfer_to_os_1y_U.err
/home/users/acc/.conda/envs/env_cylc/lib/python3.10/site-packages/distributed/node.py:182: UserWarning: Port 8787 is already in use.
Perhaps you already have a cluster running?
Hosting the HTTP server on port 40343 instead
  warnings.warn(
/home/users/acc/.conda/envs/env_cylc/lib/python3.10/site-packages/distributed/client.py:3164: UserWarning: Sending large graph of size 18.33 GiB.
This may cause some slowdown.
Consider scattering data ahead of time and using futures.
  warnings.warn(
2024-06-27 13:15:58,818 - distributed.protocol.core - CRITICAL - Failed to Serialize
Traceback (most recent call last):
  File "/home/users/acc/.conda/envs/env_cylc/lib/python3.10/site-packages/distributed/protocol/core.py", line 109, in dumps
    frames[0] = msgpack.dumps(msg, default=_encode_default, use_bin_type=True)
  File "/home/users/acc/.conda/envs/env_cylc/lib/python3.10/site-packages/msgpack/__init__.py", line 36, in packb
    return Packer(**kwargs).pack(o)
  File "msgpack/_packer.pyx", line 294, in msgpack._cmsgpack.Packer.pack
  File "msgpack/_packer.pyx", line 300, in msgpack._cmsgpack.Packer.pack
  File "msgpack/_packer.pyx", line 297, in msgpack._cmsgpack.Packer.pack
  File "msgpack/_packer.pyx", line 264, in msgpack._cmsgpack.Packer._pack
  File "msgpack/_packer.pyx", line 231, in msgpack._cmsgpack.Packer._pack
  File "msgpack/_packer.pyx", line 264, in msgpack._cmsgpack.Packer._pack
  File "msgpack/_packer.pyx", line 272, in msgpack._cmsgpack.Packer._pack
ValueError: memoryview is too large
2024-06-27 13:15:58,822 - distributed.comm.utils - ERROR - memoryview is too large

This issue only occurs about 30% of the time. The remaining 70% of the runs complete successfully. I have been investigating the memory size, number of jobs, and heartbeat timing, but none of these seem to resolve the issue consistently.

Some users have mentioned that upgrading Dask might help, as referenced here: Dask Issue #7552.

Potential Solutions:

  1. Add a flag to allow users to select the number of workers.

  2. Consider alternative approaches:

    • Update the Dask version.
    • Change the parallelization method (e.g., threads, Dask delayed).
    • Test the upload process with a different object store (e.g., Oracle).
    • Monitor memory usage more closely during job execution.
    • Submit multiple smaller SLURM jobs instead of one large job.
@soutobias soutobias self-assigned this Nov 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant