Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle gracefully TPMI initilization fail (with missing MCFG tables) #749

Closed
wants to merge 1 commit into from

Conversation

ppalucki
Copy link
Contributor

@ppalucki ppalucki commented May 27, 2024

In enviornments without access to MCFG tables like unprivileged docker without mapping directories or when running inside VM, MCFG tables are missing and I would expect pcm to collect limited set of metrics but not fail.

Instead afetr adding SRF support (TPMI) with this commit we get this:

Initialization of TPMI that happens in initUncoreObjects,
with following code:

  • if (TPMIHandle::getNumInstances() == (size_t)num_sockets)
    calls implicitly:
  • PFSInstances::get() to get singleton,
  • processprocessDVSEC() -> forAllIntelDevices() to do discovery and then...
  • getMCFGRecords() -> PciHandleMM::getMCFGRecords() -> readMCfg()
  • and finally openMcfgTable
    which throws "anonymous" exception, when MCFG files aren't not available (files cannot be open):
 if (mcfg_handle < 0)
    {
        throw std::exception();
    }

above exception is not handled anywhere and propagtes up to main routine, resulting with fatal error like this:

docker run -ti --cap-add SYS_ADMIN --cap-add SYS_RAWIO -e PCM_NO_MSR=1 -e PCM_NO_PERF=0 -e PCM_USE_UNCORE_PERF=1 ghcr.io/intel/pcm

output is:

Package thermal spec power: 0 Watt; Package minimum power: 0 Watt; Package maximum power: 0 Watt;

Can't open MCFG table. Check permission of /sys/firmware/acpi/tables/MCFG
Can't open MCFG table. Check permission of /sys/firmware/acpi/tables/MCFG1
Can't open MCFG table. Check permission of /pcm/sys/firmware/acpi/tables/MCFG
Can't open MCFG table. Check permission of /pcm/sys/firmware/acpi/tables/MCFG1
WARNING: enumeration of devices in UncorePMUDiscovery failed
INFO: Linux perf interface to program uncore PMUs is present
INFO: using Linux perf interface to program uncore PMUs because env variable PCM_USE_UNCORE_PERF=1
INFO: Secure Boot detected. Using Linux perf for uncore PMU programming.
Socket 0: 2 memory controllers detected with total number of 6 channels. 3 UPI ports detected. 2 M2M (mesh to memory)/B2CMI blocks detected. 0 HBM M2M blocks detected. 0 EDC/HBM channels detected. 0 Home Agents detected. 3 M3UPI/B2UPI blocks detected.
Socket 1: 2 memory controllers detected with total number of 6 channels. 3 UPI ports detected. 2 M2M (mesh to memory)/B2CMI blocks detected. 0 HBM M2M blocks detected. 0 EDC/HBM channels detected. 0 Home Agents detected. 3 M3UPI/B2UPI blocks detected.
Can't open MCFG table. Check permission of /sys/firmware/acpi/tables/MCFG
Can't open MCFG table. Check permission of /sys/firmware/acpi/tables/MCFG1
Can't open MCFG table. Check permission of /pcm/sys/firmware/acpi/tables/MCFG
Can't open MCFG table. Check permission of /pcm/sys/firmware/acpi/tables/MCFG1
terminate called without an active exception...

and pcm-sensor-server stops.

Last message from output: "terminate called without an active exception..." is not very informative and missleading, because missing access to MCFG tables is not a blocker/crtiical, as like in previously during Uncore discovery (see WARNING: enumeration of devices in UncorePMUDiscovery failed above) - just causing some metrics to be unavailable.

In other words, I would expect pcm/pcm-sensor-server not to fail resulting with TPMI metrics missing. I assume we need catch exception one level upper in initUncoreObjects as proposed in this pull request.

Additionally this pull requests replaces "anonymous" std::exception() with runtime_error() so we will get more detailed warning like this:

When running this PR

# 1) build
docker build . -t pcm-local
# 2) run 
docker run -ti --cap-add SYS_ADMIN --cap-add SYS_RAWIO -e PCM_NO_MSR=1 -e PCM_NO_PERF=0 -e PCM_USE_UNCORE_PERF=1 pcm-local

output is more friendly and doesn't block collection of other metircs:

...
Can't open MCFG table. Check permission of /sys/firmware/acpi/tables/MCFG
Can't open MCFG table. Check permission of /sys/firmware/acpi/tables/MCFG1
Can't open MCFG table. Check permission of /pcm/sys/firmware/acpi/tables/MCFG
Can't open MCFG table. Check permission of /pcm/sys/firmware/acpi/tables/MCFG1
ERROR: Could not initialize TPMI. Uncore frequency metrics will be unavailable. Exception details: cannot open /[pcm]/sys/firmware/acpi/tables/MCFG* files
...
Starting plain HTTP server on http://localhost:9738/

Generally, IMO it would be much helpful in future debuging if we replace "plain" std::exception with std::runtime_error so we when we run with PCM_NO_MAIN_EXCEPTION_HANDLER=1 and then if we forgot to handle exception then we can easily identify source of exception (instead of debugging all places with std::exception).

@ppalucki ppalucki changed the title [BLOCKER] Handle TPMI initilization for systems without access to MCFG tables. [BLOCKER] Handle gracefully TPMI initilization fail (for enviornment without access to MCFG tables) May 27, 2024
@ppalucki ppalucki changed the title [BLOCKER] Handle gracefully TPMI initilization fail (for enviornment without access to MCFG tables) Handle gracefully TPMI initilization fail (for enviornment without access to MCFG tables unprivileged docker/vm) May 27, 2024
@ppalucki ppalucki changed the title Handle gracefully TPMI initilization fail (for enviornment without access to MCFG tables unprivileged docker/vm) Handle gracefully TPMI initilization fail (with missing MCFG tables) May 27, 2024
@rdementi
Copy link
Contributor

rdementi commented Jun 4, 2024

merged via #752

@rdementi rdementi closed this Jun 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants