Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpTestPCI: Ignore harmless timeout message. #493

Closed
wants to merge 1 commit into from

Conversation

sammj
Copy link
Contributor

@sammj sammj commented May 6, 2019

When a sufficient amount of debugging has been enabled[0] Skiboot will
produce messages of the form
PHB#0030[8:0]: TRACE: Timeout waiting for link up
This is harmless and only represents an empty slot so filter it out on
all platforms.

Fixes #491

[0] nvram -p ibm,skiboot --update-config pci-tracing=true

Signed-off-by: Samuel Mendoza-Jonas [email protected]

When a sufficient amount of debugging has been enabled[0] Skiboot will
produce messages of the form
	PHB#0030[8:0]: TRACE: Timeout waiting for link up
This is harmless and only represents an empty slot so filter it out on
all platforms.

Fixes open-power#491

[0] nvram -p ibm,skiboot --update-config pci-tracing=true

Signed-off-by: Samuel Mendoza-Jonas <[email protected]>
@debmc
Copy link
Collaborator

debmc commented May 6, 2019

harmless

I would think we already addressed this with skiboot not reporting this as an error (by lowering the severity of the skiboot TRACE) and we may be going down a path here that we make defensive filtering and where would we stop that. The proper fix is in skiboot and we should catch this failure and manually clean the system if needed.

The original bug where this was reported, 177004, the Witherspoon had old DD2.x procs which were updated for DD2.3's and so there was also a pcie-max-link-speed=3 nvram parameter which needed removed. If this fix had been upstream we would have missed this mis-configuration and there could have been other performance related issues, timing changes introduced etc where we may never have found it or it being surfaced. I did see the new nvram patch where skiboot nvram parameters will have ASCII art printed out, but again I don't think adding this type of defensive filtering in op-test is a good direction. We add filtering for messages that are truly to be ignored and this particular one-off will be ignored with the proper debug level set in skiboot for the tracing messages.

I don't think this is needed, IMHO.

@sammj
Copy link
Contributor Author

sammj commented May 7, 2019

I can see what you're saying, but putting aside the log-level of the message this message is informational only AFAICS so it seems an over-reaction to fail a test on it. Indeed even once Skiboot changes the level for this to NOTICE, depending on what the log-level at runtime is op-test could still fail on it.
With the 177004 bug the timeout message was not directly related to the pcie-max-link-speed parameter being set I believe, but the point about clearing NVRAM debug options is good - I'm currently working on a solution for this for #492

@debmc
Copy link
Collaborator

debmc commented May 7, 2019

log-level at runtime

I think the command will not output the PR_NOTICE:

("grep ',[432]\].*PHB#.* Link down' /sys/firmware/opal/msglog"

@debmc
Copy link
Collaborator

debmc commented May 7, 2019

clearing NVRAM debug options is good

Yes.

@sammj
Copy link
Contributor Author

sammj commented May 7, 2019

Ah right, it will ignore NOTICE. I still think we should filter this for existing Skiboot versions though and handle warning/clearing NVRAM options in a separate patch.

@debmc
Copy link
Collaborator

debmc commented May 7, 2019

filter this for existing Skiboot versions

Since this is the first time we have seen this, I'm not sure it warrants introducing a maintenance item for the short window this will exist.

@oohal
Copy link
Contributor

oohal commented Nov 5, 2019

I think this got resolved in skiboot, so eh.

@oohal oohal closed this Nov 5, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

spurious failures in PCI tests caused by pci-tracing nvram option
3 participants