Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Midway2 "can't start new thread" #92

Open
pdeperio opened this issue Mar 20, 2017 · 0 comments
Open

Midway2 "can't start new thread" #92

pdeperio opened this issue Mar 20, 2017 · 0 comments

Comments

@pdeperio
Copy link
Contributor

On 02/27/2017 04:18 PM, Igor Yakushin wrote:

I tried running your script (I also added couple commands: hostname -f
and date) and I am not getting any errors on either of the hosts.
Do you still experience problem on midway2-0411 ? Are there any other
nodes like that?

Is there any number of cores for which the program is guaranteed not to
crash?
Might it simply be a matter of running out of memory as the error
messages seem to suggest?
Does your program crash at the beginning, right after submission or
after running for some time?

On 02/19/2017 11:35 AM, [email protected] via RT wrote:
Dear Hossein,

It seems to be node dependent. For example, if I just do:

sbatch --nodelist midway2-0411 /home/pdeperio/170219-thread_test/test.sh  # Empty script, just SBATCH setup

it returns the error, while:

sbatch --nodelist midway2-0417 /home/pdeperio/170219-thread_test/test.sh

returns no error.

Hi Patrick,
I tried running your script (I also added couple commands: hostname -f
and date) and I am not getting any errors on either of the hosts.
Do you still experience problem on midway2-0411 ? Are there any other
nodes like that?
Thank you,
Igor

Begin forwarded message:

From: Patrick de Perio [email protected]
Subject: Re: [rcc.uchicago.edu #11361] Midway2 "can't start new thread"
Date: February 19, 2017 at 12:35:26 PM EST
To: [email protected]
Cc: Joseph John Howlett [email protected]

Dear Hossein,

It seems to be node dependent. For example, if I just do:

sbatch --nodelist midway2-0411 /home/pdeperio/170219-thread_test/test.sh  # Empty script, just SBATCH setup

it returns the error, while:

sbatch --nodelist midway2-0417 /home/pdeperio/170219-thread_test/test.sh

returns no error.

Thank you,
Patrick

On Feb 14, 2017, at 9:30 PM, Hossein Pourreza via RT [email protected] wrote:

Do you have a simple Python code using those packages that I can run to regenerate that error message? There might be an option to avoid creating unnecessary threads.

Thanks
Hossein


From: [email protected] via RT [[email protected]]
Sent: Tuesday, February 14, 2017 3:48 PM
Subject: Re: [rcc.uchicago.edu #11361] Midway2 "can't start new thread"

Dear Hossein,

Should be only Python. We're not so familiar with the details of dask and pymongo packages, in case they may be requesting many threads?

The task should only require 1 core, so I don't want to occupy more than that if possible.

Patrick

On Feb 14, 2017, at 3:15 PM, Hossein Pourreza via RT [email protected] wrote:

Dear Patrick,

Sorry about the inconvenience. Do your jobs use Python only or combination of Python and MPI, for example? Any reason that you do not want to use more than one core. Looks like your code tries to do multi-threading and the mechanism in the Linux kernel (called cgroup) is preventing yout job from using more than on assigned core.

I tried to generated a similar error message by running an OpenMP code using multiple threads on one core on Midway1 but my code works.

Thanks
Hossein

From: "[email protected] via RT" [email protected]
Reply-To: "[email protected]" [email protected]
Date: Tuesday, February 14, 2017 at 12:25 PM
Subject: [rcc.uchicago.edu #11361] Midway2 "can't start new thread"

Tue Feb 14 12:25:52 2017: Request 11361 was acted upon.
Transaction: Ticket created by [email protected]mailto:[email protected]
Queue: General
Subject: Midway2 "can't start new thread"
Owner: Nobody
Requestors: [email protected]mailto:[email protected]
Status: new
Ticket <URL: https://rt.rcc.uchicago.edu/rt/Ticket/Display.html?id=11361 >

Dear RCC,

I'm observing the following error when running jobs on Midway2 nodes with 1 CPU per task:

     slurmstepd-midway2-0411: error: task/cgroup: unable to add task[pid=28708] to memory cg '(null)'

which then seems to lead to several instances (for some fraction of jobs) of the following error:

     File "/project/lgrandi/anaconda3/envs/pax_v6.3.2/lib/python3.4/threading.py", line 841, in start
         _start_new_thread(self._bootstrap, ())
     RuntimeError: can't start new thread

the last of which arises from a few packages (dask, pymongo).

This does not occur on Midway1 or if we specify >1 CPU per task.

Any idea what's wrong or how to fix?

Thank you,
Patrick

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant