Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use of indices may leak information #29

Closed
manicprogrammer opened this issue May 18, 2020 · 15 comments
Closed

Use of indices may leak information #29

manicprogrammer opened this issue May 18, 2020 · 15 comments
Labels
pending-close The issue will be closed in 7 days if there are no objections.

Comments

@manicprogrammer
Copy link

manicprogrammer commented May 18, 2020

The use of indices for Revealed Indices and the TotalStatements value leaks information.

This leakage removes or undermines any zero knowledge claims. It may not be the intent of this specification to minimize leakage and if so this issue can be disregarded.

The use of the Revealed Indices leaks information- it proves not only that information was not disclosed but in some cases what information was not exposed and in more significant cases may leak implicit values.

Take for instance the following two snippet of an attestation on two different subjects
Subject A

"maritalStatus": "M", --statement 7
"dependents": [
	{
		"fullName": "Alice" --statement 8
	},
	{
		"fullName: "Bob"  --statement 9
	}
]
"employmentType": "F" --statement 10

Subject B

"maritalStatus": "M", --statement 7
"employmentType": "F" --statement 8

if each subject wished to reveal only maritalStatus and employmentType the below is how each subject's proof might look:

Subject A

"proof": {
   ...
   "revealedIndices": [
         7,
         10
   ]
...

Subject B

"proof": {
   ...
   "revealedIndices": [
         7,
         8
   ]
...

I just leaked out in a common structure that the Subject A has 2 dependents and Subject B has 0 dependents.

If I know by convention or strict schema employmentType is followed by dependents, if exist, then followed by marital status for a type of credential or a credential from a given issuer then I know by the indices the number of dependents though they did not wish to reveal that information.

Even if the dependents statements were deeper set of other statements normalized into some variable set of statements I can differentiate between subjects with 0 dependents and subjects with > 0 dependents.

There are a multitude of ways and variations on this theme of how an expected or known ordinal listing will leak undesired information.

If the intent is not to leak undisclosed information then the proof must be structured in a way so that not only is the redacted data not disclosed but that it is not shown that any other data existed.

@tplooker
Copy link
Contributor

@manicprogrammer, thanks for raising it, I had identified this before, however it slipped my mind to add a note in the privacy considerations section of this specification.

I would probably further clarify the issue you raise with the following.

The feasibility of a verifier discerning additional information about the un-revealed statements in a proof, based on the revealed indicies and total statements, relies on the following assumptions.

  1. The verifier understands the original shape of the total data set that was signed enough such that they are able to determine which statements are un-revealed at certain indicies (Its important to highlight here that this can be very hard to do when the verifier does not know the node identifiers and or whether blank nodes were used for the un-revealed statements as these factors effect the canonical form i.e in particular the ordering of the statements).
  2. The representation of the signed information has to be in a form that enables this kind of leakage. Put another way this problem appears to really only effect signing over arrays of elements.
  3. The revealed indicies has to actually leak this information, i.e in your example this information would only leak in choosing to reveal a specific set of statements from the original proof.

In summary it would be great to add this to the privacy considerations section of the specification.

Also because what you have cited only occurs in very particular cases (i.e the criteria outlined above, something we should continue to clarify), do you mind updating the title of the issue to, use of indicies MAY leak information. Because it's not a forgone conclusion or absolute that the use of indicies will always leak information.

@manicprogrammer manicprogrammer changed the title Use of indices leaks information Use of indices may leak information May 18, 2020
@manicprogrammer
Copy link
Author

manicprogrammer commented May 18, 2020

Of course. I added may into the title as requested but it is definite that by design the use of indices always provides information beyond the specific statements chosen to be disclosed; it is just a matter of how much information. The indices explicitly states that information was redacted and it may leak more specific information due to providing positional and quantitative details about both disclosed and non-disclosed data.

Again, if this is outside the privacy guarantees of the scheme then that is cool but it seems it would be highly limiting to the utility of the scheme.

If the scheme itself is not to be robust against potentially leaking non-disclosed meta-data then no need to read further. Every scheme has it's strengths and weaknesses and it's bounds of operations.

But if it is...

As for 1) if reliance on not-leaking data is based on the verifier not knowing the form of the original data then this is a problem for many domains of use. It can be that this scheme is not a candidate for those domains. I see the world as where the verifier almost certainly knows the potential shape of the original data in order to know what data to request and the context around it.

For 2) It certainly affects more than arrays of information. Well shaped data can minimize leakage but a reliance for the scheme to not leak being made on a flawless data shape regardless of how it is selective disclosed is going to be repeated failures by implementers. Especially if the implementer needs to meet both an unknown shape (no. 1) and a maximally safe shape (no. 2). People will screw it up. I know I will. Protecting against this is hard to do even if you don't provide the indices due to potential dependencies between statements but that occurs between related statements where the shape analysis can yield fully unrelated data.

In regards to 3) obviously an index itself doesn't reveal the content at that index. The knowledge of the existence of a statement at a defined position can tell a lot of information someone did not wish to disclose and in the simple example provide significant data about something they active chose not to ** disclose.

My example was specific to indicate the outcome. There are lots of data context and shapes of data that may not leak additional data but there are lots of data context and shapes where it will. As long as the indices are provided you are guaranteed to provide beyond the disclosed data the existence of non-disclosed data and it's positioning and may leak other significant data that can be collected from that (even more so if looking at a large set of claims from a common issuer).

Again, if this is outside the privacy guarantees of the scheme then that is cool I just wanted to ensure it was recognized and found not pertinent or made visible if pertinent.

@NickDarvey
Copy link

I feel like I might be missing something but would any of this be resolved by making all fields mandatory of schemas using this kinda proof?
Like in your dependents example, everyone would have that field it's just some would be [] (or for other kinds of fields, null).

Is there something about why that is semantically incorrect? Would it help here?

@tplooker
Copy link
Contributor

tplooker commented May 19, 2020

@manicprogrammer, thanks and also to clarify, I did not mean my comment to give the impression that the point you raise is in anyway insignificant or not valid, it is important for us to make sure that these limitations and considerations are known and highlighted in the spec.

Again, if this is outside the privacy guarantees of the scheme then that is cool but it seems it would be highly limiting to the utility of the scheme.

To an extent, as is the case with any technology there are a recognised set of boundaries that it can provide within and I think eliminating all possible unwanted information disclosure through the mechanism you describe would be difficult to achieve.

As for 1) if reliance on not-leaking data is based on the verifier not knowing the form of the original data then this is a problem for many domains of use. It can be that this scheme is not a candidate for those domains. I see the world as where the verifier almost certainly knows the potential shape of the original data in order to know what data to request and the context around it.

I would say to be able to discern information like in your example, it requires deeper knowledge then knowing just the schema of the information (which I agree would probably be quite trivial in many cases). Instead, much of the ordering of the resulting statements not only relies on the schema or vocab used but also the information it-self, e.g the node identifiers and the allocation of blank node identifiers during the normalisation process.

For 2) It certainly affects more than arrays of information. Well shaped data can minimize leakage but a reliance for the scheme to not leak being made on a flawless data shape regardless of how it is selective disclosed is going to be repeated failures by implementers. Especially if the implementer needs to meet both an unknown shape (no. 1) and a maximally safe shape (no. 2). People will screw it up. I know I will. Protecting against this is hard to do even if you don't provide the indices due to potential dependencies between statements but that occurs between related statements where the shape analysis can yield fully unrelated data.

Yes I would agree that there are similar possible leakages that could occur for representations other than arrays, for example purely knowing whether a value is present in an assertion or not, however I think arrays are the more interesting case.

In regards to 3) obviously an index itself doesn't reveal the content at that index. The knowledge of the existence of a statement at a defined position can tell a lot of information someone did not wish to disclose and in the simple example provide significant data about something they active chose not to ** disclose.

I think perhaps this point was a little lost, what I was meaning to say is that not all proofs on the example you gave would leak the information. For example say I only revealed my maritalStatus, and not my employmentType , in this instance it would be hard for you to determine how many dependents I had which is un-revealed.

@tplooker
Copy link
Contributor

@NickDarvey you could, however this would cause a bloat issue and add in complexity in the issuance and presentation protocols where the complexity would defeat the value being pursued.

For example say you are signing an array, what number of elements do you choose to sign as a default value that when populated in other assertions doesn't lead to the leakage outlined?

@NickDarvey
Copy link

Ahh I didn't realize you could selectively disclose items in an array. Understood!

@manicprogrammer
Copy link
Author

All good. My goal in this issue is to ensure some limitations, assumptions, and boundaries are recognized to be mitigated or documented for the scheme. The goal of the issue is not to ensure they don't exist. No scheme has perfect everything; there are always trade-offs and why different schemes for different purposes are needed.

@manicprogrammer
Copy link
Author

Is the resolution action for this issue to be that the assumptions (such as explicitly providing shape information and existence of non-disclosed data is within the implementers data handling constraints) and warnings/limitations (whatever word is proper) about how leakage can occur?

@manicprogrammer
Copy link
Author

To clarify just in case it is not clear- when I say "The goal of the issue is not to ensure they don't exist." It's not the goal but I clearly think it is a significant limitation in privacy assurances if the scheme is formed in this way and could mute any ability for it to be adopted for anything but the most narrow scopes.

@tplooker
Copy link
Contributor

tplooker commented May 19, 2020

@manicprogrammer I'm more than open to suggestions that would help to improve the scheme in this aspect, provided they do not impose an exponential increase in the complexity of the scheme. I do think this issue is highlighting at the very least that a section in the spec must be devoted to describing a set of considerations around this topic and potential mitigants that can be used, such as how the information is represented.

Do you have any potential solutions or are you aware of any other selective disclosure approaches that do not suffer from this problem?

@tplooker
Copy link
Contributor

Also as a side note for those interested in this issue, you may have noticed #30 removing the fields totalStatements and revealedIndices , this does not mean that information has gone away it is just now encoded into the actual proofValue.

@manicprogrammer
Copy link
Author

@tplooker I'll look at the code which I have not dug into fully. I suspect you would recognize any reasonable means of doing that already and it's just a trade-off choice between size, complexity and other parameters. So, NO, I don't have a ready proposal for a solution or even know if there is one that would meet the other requirements. It would almost certainly be more verbose and much a much larger proof. Other redacted signature schemes I have looked at in order to address this scenario have a witness/proof per statement- so you get a linear size increase with each revealed statement - so you could easily add 1200 bits per revealed statement- not what you are looking for here.

I was hoping that what I was going to find here through an implementation and the use of the bi-linear curve/map signatures that it would be the best of both worlds- a succinct single proof that didn't expose the existence of redacted statements. There might be a mechanism to do that in this scheme with some tweaks and there might not - but if the purpose of the scheme is not to be concerned with that then there is no need to try and derive it. I am not wanting to push this scheme to a different purpose.

@manicprogrammer
Copy link
Author

As a final follow up on this I did review the pertinent parts of the code and after also reviewing the Rust BBS crate documentation and the papers it is based on - as you know it's inherent in the BBS+ implementation in use to take as part of the key construction a value for the discrete number of messages that key will sign and the BBS+ implementation further requires the verifier to have knowledge of the messages indices to do verification.

So, as @tplooker mentioned, outside of something really major that potential leakage is just the nature of this specific approach. We can't have it all. :-)

@OR13
Copy link
Contributor

OR13 commented Apr 25, 2023

I suggest we close this and focus on adding security consideration / guidance, as I noted in #60

@OR13 OR13 added the pending-close The issue will be closed in 7 days if there are no objections. label Apr 25, 2023
@OR13
Copy link
Contributor

OR13 commented May 1, 2023

Pending close for 1 week, no objections, closing

@OR13 OR13 closed this as completed May 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pending-close The issue will be closed in 7 days if there are no objections.
Projects
None yet
Development

No branches or pull requests

4 participants