Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lockup: Print tasks which have been on-cpu for too long #109

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

richl9
Copy link
Contributor

@richl9 richl9 commented Sep 30, 2024

  1. add helper for scanning potential lockups on cpus and dumping their info
  2. add prio in runq dump

@oracle-contributor-agreement oracle-contributor-agreement bot added the OCA Verified All contributors have signed the Oracle Contributor Agreement. label Sep 30, 2024
Copy link
Member

@brenns10 brenns10 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks interesting. I think my biggest question is whether the number of seconds on CPU is actually a reliable indicator of some sort of lockup, and whether 0 is a good default for that parameter.

Comment on lines 56 to 58
"""
Corelens Module for lockup
"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This text appears in the corelens -L output. It should be only one line, and useful to tell users what it does. EG:

Suggested change
"""
Corelens Module for lockup
"""
"""Print tasks which have been on-cpu for too long"""

"--time",
"-t",
type=float,
default=0,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default of 0 seems unhelpful, since that prints any running task, right? Can we specify a more useful default?

Does --time specify the amount of time that a task has been on-CPU without the opportunity to schedule? Or is it just for the current time slice? If there's a low load, it would be totally reasonable for a task to be on-CPU for a long time, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. Forgot to change it back after testing. Was thinking about default = 2. What do you think? And --time is for the current time slice.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems reasonable, we can always adjust it in the future.

"-t",
type=float,
default=0,
help="list all the processes that have been running more than <time> seconds",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clarify this: is it the total amount of seconds the task has been running? OR just since the last schedule? Or just since the last cond_resched()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got them time from task_lastrun2now(), so it's the duration from task last run timestamp to now.

Comment on lines 93 to 94
f' COMMAND: "{comm}"',
f" RUNTIME: {timestamp_str(run_time)}",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding this string after a comma makes it a second argument to print().

The rest of these strings are specified without commas, which instead performs string concatenation. Could you please get rid of the comma between these two lines?

"""
Print tasks which are in the RT and CFS runqueues on each CPU
Print tasks which are in the RT and CFS runqueues on each CPU. Specify min_run_time_seconds to x to only print
processes running more than x seconds. Set skip_swapper = True to not include swapper process in the ouput.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no skip_swapper arg here. Either add it or delete the sentence.

@richl9 richl9 changed the title Richard/lockup lockup: Print tasks which have been on-cpu for too long Oct 18, 2024
@richl9 richl9 requested a review from brenns10 October 19, 2024 01:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
OCA Verified All contributors have signed the Oracle Contributor Agreement.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants