Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Creating AK triggers TPM failure with hardware TPM on some Intel systems #367

Open
lkatalin opened this issue Aug 24, 2022 · 9 comments
Open

Comments

@lkatalin
Copy link

I'm submitting this on behalf of my team, including @kkaarreell and @sergio-correia.

While running tests on the Rust Keylime agent using a hardware TPM, the team noticed that running the Create AK function triggers a failed TPM state on certain Intel systems (we don't have an exhaustive list). These same hardware TPMs work fine with our other Python agent, which does not use the rust-tss-esapi, leading us to believe this error may be rust-tss-esapi related. There is a miminal reproducer here that fails on this line.

  • OS: RHEL 9.1 *see note
  • TPM info: 'Issuer: C = DE, O = Infineon Technologies AG, OU = OPTIGA(TM) TPM2.0, CN = Infineon OPTIGA(TM) RSA Manufacturing CA 007'

Note: We don't currently have a Fedora test machine with hardware TPM as of today, but can get one if necessary. @sergio-correia did test this on F35 in the past week and reports similar results.

All tpm2_ commands produce expected output before running the mini reproducer. The output related to the error includes the following below:

This shows how the mini reproducer process hangs on ak::create_ak():

Hello, TPM world!
Preparing to kill the TPM of those Intel Whitley, Wilson City 2S, Ice Lake machines...
Preparing for ek::create_ek_object()
Preparing for ak::create_ak()

Output from dmesg after running the reproducer:

[ 1269.275809] tpm tpm0: tpm2_save_context: failed with a TPM error 0x0101
[ 1269.284334] tpm tpm0: A TPM error (257) occurred flushing context
[ 1269.292270] tpm tpm0: A TPM error (257) occurred flushing context
[ 1269.300268] tpm tpm0: A TPM error (257) occurred flushing context
[ 1269.306366] tpm tpm0: tpm2_commit_space: error -14

with that 257 error indicating...

# tpm2_rc_decode 257
tpm:error(2.0): commands not being accepted because of a TPM failure

Here is the output of tpm2_pcrread after running the reproducer:

# tpm2_pcrread
ERROR:tcti:src/tss2-tcti/tcti-device.c:501:Tss2_Tcti_Device_Init() timeout waiting for response from fd 3 
WARNING:esys:src/tss2-esys/api/Esys_GetCapability.c:303:Esys_GetCapability_Finish() Received TPM Error 
ERROR:esys:src/tss2-esys/api/Esys_GetCapability.c:107:Esys_GetCapability() Esys Finish ErrorCode (0x00000084) 
ERROR: Esys_GetCapability(0x84) - tpm:handle(unk):value is out of range or is not correct for the context
ERROR: Unable to run tpm2_pcrread

And tpm2_gettestresult (the value for data before the test was 0001f9db000000000000):

# tpm2_gettestresult
ERROR:tcti:src/tss2-tcti/tcti-device.c:501:Tss2_Tcti_Device_Init() timeout waiting for response from fd 3 
status:   success
data:   0001f9db0000000001d6

Any ideas on this?

@ionut-arm
Copy link
Member

Hey, thanks for reporting this!

Any chance you can get a backtrace of where the example hangs, or figure out at least the call that does it (at any level of abstraction lower than create_ak?

Output from dmesg after running the reproducer:

Is that output before the process was killed or after?

There are 4 handles that get flushed following that example: three sessions (1 for the EK, 2 for the AK), and the primary EK object. The tpm2_save_context and tpm2_commit_space errors seem to come from the kernel, though it's difficult to tell exactly where things go awry, and thus why.

Unfortunately it's kinda difficult to figure things out just from the code, especially since it seems to only fail on some select hardware. Perhaps it has something to do with the limitations of those TPMs (e.g., in terms of available memory), and we don't follow some portion of the spec in our calls?

@Superhepper
Copy link
Collaborator

This very strange.

The only time I have encountered something that hangs using the tss-esapi crate was when I had two different process trying to access the TPM simultaneously without using a resource manager.

What version of tpm-tss is used in the minimal reproduction example? Or what versions have you tested with and achieved the same error?

If you get a chance to run it on actual hardware could you enable tpm2-tss trace output while running the minimal reproduction example and post what gets printed to the output.

I will try to reproduce using our integration tests.

@kkaarreell
Copy link

The only time I have encountered something that hangs using the tss-esapi crate was when I had two different process trying to access the TPM simultaneously without using a resource manager.

Hello,
this is in fact possible. We are observing this issue on a system with kernel IMA policy enabled so it is not just Rust agent accessing TPM but also kernel adding IMA measurements. But I don't have the detailed knowledge, maybe it is not relevant.

@Superhepper
Copy link
Collaborator

It could be. But it is kind of strange it only happens with tss-esapi crate. It should happen for anything that uses the ESAPI even python code.

@ionut-arm
Copy link
Member

It should happen for anything that uses the ESAPI even python code.

Maybe? I don't think we know that well why we get the hanging when there are multiple threads trying to access without a RM... Maybe we could reproduce some of the errors, but if you only use the TPM simulator you won't get the calls going through the kernel stack.

@sergio-correia
Copy link

I got my hands on the machine today so I was able to run some tests.

Here's the output of the example with TSS2_LOG=all+ERROR,marshal+TRACE,tcti+DEBUG: https://gist.github.com/sergio-correia/234bcc244cdff2a8b0d08f85017192d1 - I am running this on CentOS Stream 9, tpm2-tss 3.0.3, but I tried 3.2.0 and it looks similar: it just hangs at the last call.

The device seems to be IFX0740:00: 2.0 TPM (device-id 0x1B, rev-id 16)

@Superhepper
Copy link
Collaborator

Superhepper commented Aug 28, 2022

I have been looking at this a little bit and have not detected anything strange so far. The only thing I can see is that create_ak sometimes have three object handles active at the same time and I have seen in the past that some TPM simulators can report errors (depending on how they are compiled) when three transient objects are in used at the same time and space for fourth one is needed.

Do you happen to have the link to the Python code that works?

@lkatalin
Copy link
Author

I have seen in the past that some TPM simulators can report errors (depending on how they are compiled) when three transient objects are in used at the same time and space for fourth one is needed

Yes, this sounds familiar as we saw it before with swtpm, although I don't know enough to make a stronger connection between that error and this one.

Do you happen to have the link to the Python code that works?

These are the tpm2_tss and tpm2_tools libraries used by the Python agent. I'm not sure where something corresponding to create_ak() is happening in that code or in the libraries, although the Python agent code is generally here. @sergio-correia @kkaarreell do you know anything more specific?

@lkatalin
Copy link
Author

lkatalin commented Aug 31, 2022

Update: Looks like this is the equivalent code in the Python agent, specifically using this from the above TPM libraries.

if self.tools_version == "3.2":
command = ["tpm2_getpubak" "-E", hex(ek_handle),"-k","0x81010008","-g",asym_alg,"-D",hash_alg,"-s",sign_alg,"-f",akpubfile.name,"-e",owner_pw,"-P",aik_pw,"-o",owner_pw,]

elif self.tools_version in ["4.0", "4.2"]:
command = ["tpm2_createak","-C",hex(ek_handle),"-c",secpath,"-G",asym_alg,"-g",hash_alg,"-s",sign_alg,"-u",akpubfile.name,"-p",aik_pw,"-P",owner_pw,]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants