Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Troubleshooting with ptq #167

Open
manoaman opened this issue Feb 7, 2024 · 11 comments
Open

Troubleshooting with ptq #167

manoaman opened this issue Feb 7, 2024 · 11 comments
Labels

Comments

@manoaman
Copy link

manoaman commented Feb 7, 2024

Hi @william-silversmith ,

I have a situation where my igneous execution is stuck at one point and does not seem to progress. And I don't see any notable logs or outputs. Would you be able to guide me how to troubleshoot what is causing the issue?

Thanks,
-m

ptq status ./queue/                                                                                                                                                                                                                             
 
Inserted: 5586
Enqueued: 1 (0.0% left)
Completed: 5617 (100.6%)
Leased: 1 (100.0% of queue)
@william-silversmith
Copy link
Contributor

Hi m! What kind of task are you running? Can you let me know what its parameters are?

Is your CPU cooking or your network or disk working?

@manoaman
Copy link
Author

manoaman commented Feb 7, 2024

Hi Will,

Notable parameters used in xfer task:

igneous image xfer --mip 0 --chunk-size 128,128,64 --fill-missing --sharded

Utilizing 36 cpu cores.

Indeed, the disk has been unstable at times. Although, both cpu and network seem fine to me. Any suggestions how to identify the cause?

@william-silversmith
Copy link
Contributor

How's your memory usage? Sharded transfers can potentially use a lot of memory. If you start swapping, that would cause low utilization of network, cpu, but weird access patterns to disk.

Try setting the memory parameter lower (which will create more shards but makes each task smaller).

@manoaman
Copy link
Author

manoaman commented Feb 7, 2024

I'm using a compute node which has 36 cpu cores, and 1TB memory. I understand the default is 3.5GB. Should I set much smaller than the default value? Maybe 1GB? (--memory 1000000000)

  --memory INTEGER                Task memory limit in bytes. Task shape will
                                  be chosen to fit and maximize downsamples.
                                  [default: 3500000000.0]

@william-silversmith
Copy link
Contributor

Hmm... 1TB should be more than enough. Can you check how much RAM is being used?

@manoaman
Copy link
Author

manoaman commented Feb 7, 2024

I reset the queue and running once again. From looking at htop summary, total memory usage is fluctuating between 130GB~150GB at the moment. Gradually increasing.

@william-silversmith
Copy link
Contributor

If it gets stuck again and RAM isn't a problem, one thing you can try is turning off parallel and see if it executes.

@manoaman
Copy link
Author

manoaman commented Feb 9, 2024

Hi @william-silversmith ,

I tried running without -p 36 parameter this time after observing another hang, executing igneous still seems to stuck at same state from looking at ptq status.

Inserted: 5586
Enqueued: 1 (0.0% left)
Completed: 5610 (100.4%)
Leased: 1 (100.0% of queue)

htop command shows there are two processes and one is kept running at CPU 0.0%, MEM 0.1%. and another process doesn't seem to be using any resources. Both are showing "S" which seems to be sleeping?

-m

@manoaman
Copy link
Author

manoaman commented Feb 9, 2024

Testing to see if the storage has a problem by switching to another storage.

@william-silversmith
Copy link
Contributor

This is a good strategy! Let me know how it goes.

@manoaman
Copy link
Author

Okay, so this definitely had something to do with the storage. When I switched over to a different storage, I do not see this issue where igneous hangs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants