You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I tried running your script (I also added couple commands: hostname -f
and date) and I am not getting any errors on either of the hosts.
Do you still experience problem on midway2-0411 ? Are there any other
nodes like that?
Is there any number of cores for which the program is guaranteed not to
crash?
Might it simply be a matter of running out of memory as the error
messages seem to suggest?
Does your program crash at the beginning, right after submission or
after running for some time?
On 02/19/2017 11:35 AM, [email protected] via RT wrote:
Dear Hossein,
It seems to be node dependent. For example, if I just do:
sbatch --nodelist midway2-0411 /home/pdeperio/170219-thread_test/test.sh # Empty script, just SBATCH setup
Hi Patrick,
I tried running your script (I also added couple commands: hostname -f
and date) and I am not getting any errors on either of the hosts.
Do you still experience problem on midway2-0411 ? Are there any other
nodes like that?
Thank you,
Igor
Begin forwarded message:
From: Patrick de Perio [email protected]
Subject: Re: [rcc.uchicago.edu #11361] Midway2 "can't start new thread"
Date: February 19, 2017 at 12:35:26 PM EST
To: [email protected]
Cc: Joseph John Howlett [email protected]
Dear Hossein,
It seems to be node dependent. For example, if I just do:
sbatch --nodelist midway2-0411 /home/pdeperio/170219-thread_test/test.sh # Empty script, just SBATCH setup
On Feb 14, 2017, at 9:30 PM, Hossein Pourreza via RT [email protected] wrote:
Do you have a simple Python code using those packages that I can run to regenerate that error message? There might be an option to avoid creating unnecessary threads.
Thanks
Hossein
From: [email protected] via RT [[email protected]]
Sent: Tuesday, February 14, 2017 3:48 PM
Subject: Re: [rcc.uchicago.edu #11361] Midway2 "can't start new thread"
Dear Hossein,
Should be only Python. We're not so familiar with the details of dask and pymongo packages, in case they may be requesting many threads?
The task should only require 1 core, so I don't want to occupy more than that if possible.
Patrick
On Feb 14, 2017, at 3:15 PM, Hossein Pourreza via RT [email protected] wrote:
Dear Patrick,
Sorry about the inconvenience. Do your jobs use Python only or combination of Python and MPI, for example? Any reason that you do not want to use more than one core. Looks like your code tries to do multi-threading and the mechanism in the Linux kernel (called cgroup) is preventing yout job from using more than on assigned core.
I tried to generated a similar error message by running an OpenMP code using multiple threads on one core on Midway1 but my code works.
I'm observing the following error when running jobs on Midway2 nodes with 1 CPU per task:
slurmstepd-midway2-0411: error: task/cgroup: unable to add task[pid=28708] to memory cg '(null)'
which then seems to lead to several instances (for some fraction of jobs) of the following error:
File "/project/lgrandi/anaconda3/envs/pax_v6.3.2/lib/python3.4/threading.py", line 841, in start
_start_new_thread(self._bootstrap, ())
RuntimeError: can't start new thread
the last of which arises from a few packages (dask, pymongo).
This does not occur on Midway1 or if we specify >1 CPU per task.
Any idea what's wrong or how to fix?
Thank you,
Patrick
The text was updated successfully, but these errors were encountered:
On 02/27/2017 04:18 PM, Igor Yakushin wrote:
I tried running your script (I also added couple commands: hostname -f
and date) and I am not getting any errors on either of the hosts.
Do you still experience problem on midway2-0411 ? Are there any other
nodes like that?
Is there any number of cores for which the program is guaranteed not to
crash?
Might it simply be a matter of running out of memory as the error
messages seem to suggest?
Does your program crash at the beginning, right after submission or
after running for some time?
On 02/19/2017 11:35 AM, [email protected] via RT wrote:
Dear Hossein,
It seems to be node dependent. For example, if I just do:
it returns the error, while:
returns no error.
Hi Patrick,
I tried running your script (I also added couple commands: hostname -f
and date) and I am not getting any errors on either of the hosts.
Do you still experience problem on midway2-0411 ? Are there any other
nodes like that?
Thank you,
Igor
Begin forwarded message:
From: Patrick de Perio [email protected]
Subject: Re: [rcc.uchicago.edu #11361] Midway2 "can't start new thread"
Date: February 19, 2017 at 12:35:26 PM EST
To: [email protected]
Cc: Joseph John Howlett [email protected]
Dear Hossein,
It seems to be node dependent. For example, if I just do:
it returns the error, while:
returns no error.
Thank you,
Patrick
On Feb 14, 2017, at 9:30 PM, Hossein Pourreza via RT [email protected] wrote:
Do you have a simple Python code using those packages that I can run to regenerate that error message? There might be an option to avoid creating unnecessary threads.
Thanks
Hossein
From: [email protected] via RT [[email protected]]
Sent: Tuesday, February 14, 2017 3:48 PM
Subject: Re: [rcc.uchicago.edu #11361] Midway2 "can't start new thread"
Dear Hossein,
Should be only Python. We're not so familiar with the details of dask and pymongo packages, in case they may be requesting many threads?
The task should only require 1 core, so I don't want to occupy more than that if possible.
Patrick
On Feb 14, 2017, at 3:15 PM, Hossein Pourreza via RT [email protected] wrote:
Dear Patrick,
Sorry about the inconvenience. Do your jobs use Python only or combination of Python and MPI, for example? Any reason that you do not want to use more than one core. Looks like your code tries to do multi-threading and the mechanism in the Linux kernel (called cgroup) is preventing yout job from using more than on assigned core.
I tried to generated a similar error message by running an OpenMP code using multiple threads on one core on Midway1 but my code works.
Thanks
Hossein
From: "[email protected] via RT" [email protected]
Reply-To: "[email protected]" [email protected]
Date: Tuesday, February 14, 2017 at 12:25 PM
Subject: [rcc.uchicago.edu #11361] Midway2 "can't start new thread"
Tue Feb 14 12:25:52 2017: Request 11361 was acted upon.
Transaction: Ticket created by [email protected]mailto:[email protected]
Queue: General
Subject: Midway2 "can't start new thread"
Owner: Nobody
Requestors: [email protected]mailto:[email protected]
Status: new
Ticket <URL: https://rt.rcc.uchicago.edu/rt/Ticket/Display.html?id=11361 >
Dear RCC,
I'm observing the following error when running jobs on Midway2 nodes with 1 CPU per task:
which then seems to lead to several instances (for some fraction of jobs) of the following error:
the last of which arises from a few packages (dask, pymongo).
This does not occur on Midway1 or if we specify >1 CPU per task.
Any idea what's wrong or how to fix?
Thank you,
Patrick
The text was updated successfully, but these errors were encountered: