-
Notifications
You must be signed in to change notification settings - Fork 364
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PercentCodec performs destructive decoding of valid UTF-8 byte sequences #667
Comments
On the surface it would seem that the environment whitelist configuration needs to be adjusted to allow the canonicalized data in the format you're expecting. From the documentation in the class, the workflow is implemented as intended, where the value of If you're using the default configurations distributed for ESAPI, then there may be an issue with one of the regexes in relation to the canonicalized form of this input. Based on the documentation of
The value of |
In the previous version of esapi jar 2.2.1.1, the raw input was directly passed to Is there any way to get around this ? |
Well, yes. HTTP headers were never defined for anything other than ASCII. You have to encode characters represented with bytes larger than 127 or 0xff. Use percent encoding on non-ascii data. UTF-8 headers are informally supported by some browsers and web application servers, but it’s best practice not to let the environment make those decisions for you. From ESAPI’s perspective we could discuss allowing a canonicalize-less version of that method, but you can solve your problem today by percent encoding and calling it done. |
First off, let me say that we knew that such "breakage" was possible. It was noted in ESAPI issue #588 (specifically, in my comment at #588 (comment)). IMO, it was a bug and we ought not to be relying on a bug for backward compatibility. I believe the way it behaves now is correct (unless of course we botched something in the attempted fix; looking back on issue #588, and the PR #611 which "closed" this, only touched "test" files. @xeno6696 - could you take a look and provide an explanation? Thanks.) @xeno6696 wrote:
Yes, we could do that. @planetlevel even mentioned something like that in issue #588... providing something like a |
@kwwall first commit 92a96ec in the sequence was the actual logic change. There was some reverse-engineering in #588 breaking apart what I did and why, but the actual logic change was just adding the data parameter to the whitelist/blacklist calls which somehow got missed back in '17. Everything else was testing as you pointed out. Like this question, it was mostly a "behavior changed, is it the right behavior?" kind of question/analysis. We are still doing the right thing here. |
@xeno6696 - My bad. Not sure how I missed that commit, especially when I was looking for it. I think I had false expectations that maybe you swapped the formal parameters 'input' and 'orig' in the private method definition rather than the actual parameter order in the call. Anyhow, thanks for the quick response. I knew it had to be there somewhere. That said, I should have done better calling out more attention to issue #588 and our fix to it in the 2.2.3.0 release notes and how it potentially could break "backward compatibility" because of the bug fix. I will take the blame for that. Lastly, unless someone can point to a revised RFC or a W3C standards document that allows HTTP header values to be non-ASCII, I think that we should close this and if someone such as @Aakash4396 wants something like the new aforementioned I'll leave this issue open for a few more days to give it some more time for discussion, but unless someone can point out that non-ASCII header values (and names even??) are now permitted, I will opt to close this as 'wontfix'. |
I had some free time over lunch. I created the following unit test:
At any rate Aakash4396, if the characters you're getting at the time canonicalize is called are If it fails--it's possible platform encoding could be getting in the way. Windows 10 normalized UTF-8 as the standard byte encoding but if you're running prior to that it's cp-1252 and that can transform bytes under your feet. The same error can happen in your browser or on the target webserver, if some default somewhere is set to something other than UTF-8. |
Hi @xeno6696, The use case is |
Ah, see, you needed to LEAD with that information. You would have saved a ton of time and energy, if you had said At any rate, I should have a solution ready quickly. Here's the culprit: When I converted the API over to be non-destructive to higher-order bytes I only converted the HTMLEntityCodec leaving the rest as an exercise for the future. The future is now. lol |
@xeno6696 - Now that you have a better grasp of the issue, do you think you could maybe suggest a revision of the issue title? I think what is listed here is way too broad, as seen by the rabbit trail that we went down. |
@Aakash4396 -- only partly in jest, one thing that would probably speed up a fix for this if you could submit a PR to fix it. We'd credit you in the next release notes for your assistance. And if you can't do that, at least maybe contribute a few other cases where it is failing. But @xeno6696 would have to be the one who gives you an estimate of when it could be fixed since I don't know his schedule and I'm not sure what is wrong or how to fix this. I don't think I understand enough about codepoints to know how to fix it and I expect that is partly what is involved here. |
I started working on it, but timelines when dealing with something that isn't a job--are funky things. It should be in the next release--hopefully within a couple months, but honestly if I convert one codec, I should convert them all. PercentCodec has a thorn in that IIRC it's also based on Latin-1 and so to do this correctly I should create a version of each and we should consider moving the default encoding to purposefully shake up some things. =-) |
All of that is to say, I'll get to it when I get to it, if you want it done faster we accept MOST PRs. I could share a patch for my halfways conversion as it sits now if you want to run with it. |
I missed @planetlevel 's link, but yes, I intend to ensure we support UTF-8 encoding/decoding. |
Just a thought...Does the OWASP Java Encoder Project already handle this?
If so, maybe we can steal some code from them. (They also are licensed
under the New BSD License.) They don't have decoders, so that would still
need to be written though.
…On Thu, Apr 28, 2022, 1:48 PM Matt Seil ***@***.***> wrote:
I missed @planetlevel <https://github.com/planetlevel> 's link, but yes,
I intend to ensure we support UTF-8 encoding/decoding.
—
Reply to this email directly, view it on GitHub
<#667 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAO6PG2DZAZHJGARM5TRQLTVHLFONANCNFSM5QX5EIMQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
It's not as helpful as you might think--they do so without any of the decoding baggage. Which is exactly what's broken here. The decoding. Encoding is simple: Have codePoint/byte, get encoded URI string. Because java is UTF-16 under the hood, characters that can be represented by integer codepoints are trivial to convert into UTF-8 URI strings. On decode we need to detect the correct byte string representation, which is why their code isn't super useful to us. This is also--I suspect--a key reason they chose only to encode. That entire problem goes away. But at any rate, decode logic needs to have the ability to detect So the easy part is converting the existing codec to do Unicode, the harder part is ensuring we properly detect and handle these URL-to-Byte sequences. It's not super hard, but like I said I need the better part of a day to get in the flow. |
@xeno6696 - Tell you what. I'll give you the weekend off to work on it! 😂 |
@xeno6696 - Note when you extend the various codecs to handle Unicode, there is a Javadoc comment in the |
Right. The issue here is that our encoding/decoding only handles Latin-1 byte renderings. ‘%E6%B5%8B’ is rendering into incompatible and unrecoverable gibberish. |
Oh wait… you’re saying the doc is out of step. |
Tell me how to rewrite that sentence & I'll revise it. I have lots of other
Javadoc revisions to Encoder so it will save you a merge conflict. I'm
fixing issue #686. Fix was delete 3 lines and replace it with one, but doc
changes are extensive.
…-kevin
On Sat, Apr 30, 2022, 3:49 PM Matt Seil ***@***.***> wrote:
Oh wait… you’re saying the doc is out of step.
—
Reply to this email directly, view it on GitHub
<#667 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAO6PG5QC2SNAJOXJFKZJGLVHWFD3ANCNFSM5QX5EIMQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
So @kwwall that comment can stay. The mistakes in the current implementation actually break that contract. Not blaming who came before, but their mindset was clearly pre UTF-8's dominance. I would actually be interested to do a performance test vs the Encoder project and ESAPI once this is is finished. On paper theirs should be faster, but the change I'm working has us treating all the underlying data as raw ints which should give significant memory and speed advantages. |
Just so you know @kwwall, I took some time to double-check my thinking here, and I'm questioning the approach I took to rewrite this part of the library back in '17. yes, I made a ton of bugs go away that relate to encoding issues, but something doesn't smell right. I double checked the char data type, and it's supposed to be good for all ranges So... I think that's what forced me into jumping to integers from the java lang docs: Integer was the smallest data type available for me to be able to capture sequences like I wonder if before I go too far on this one if I shouldn't set up a series of boundary tests and consider confining my implementation more firmly on that 21b range that character encodings actually encompass? This is where I wish I took better notes sometimes. |
Hi, is there any further progress on this ticket? We are impacted by this issue. |
I'm sure you're impacted @dlgma0 but you're dealing with volunteer hours. I won't be able to make next week's release. The reason why I didn't code this back in '17 is because it dawned on me--like it did this morning--that the codec to handle the kinds of characters you're talking about require the percent codec to have UTF-8 encoding detection for bytes > 0x00ff. in short, the ASCII case is a straightforward swap because the default implementation of this WAS ASCII only. But to handle |
That's okay, Matt. We're all volunteers here. Anyone who needs urgently could help by submitting a PR that includes suitable tests and we will do a code review on it and it all looks good merge it to our 'develop' branch as well as being sure to give them proper credit in the next release notes. |
@xeno6696 Thanks for your response! Understood and could you please add a comment to this thread when the issue got fixed to which release. :) |
Oh trust me, I will. Hopefully future codec updaters can build from my notes. |
Seemingly somewhat related to issue #377, but I'm not the expert here. I dropped a link to a technical report on "Unicode Security Considerations" there that may be useful here as well. |
If special characters ( 测试 ) are present in name of file, then the esapi validator fails to match the input.
These characters are encoded when passed to validator, but internally call goes to
ESAPI.validator().getValidInput("setHeader", strippedValue, "HTTPHeaderValue", sc.getIntProp("HttpUtilities.MaxHeaderValueSize"), false);
which internally calls
getValidInput(context, input, type, maxLength, allowNull, true);
point to be noted :- input is the string that has to be validated against regex.
further it calls
StringValidationRule class's
getValid(context, input);
method.The main problem occured here in
getValid(context, input);
method.Here the input that is passed to validate is decoded
encoder.canonicalize(input);
asdata
to print in log message.but when these parameters are passed to
checkWhitelist(context, data, input);
method they are passed in reverse order,beacuse internally
checkWhitelist(context, data, input);
usesdata
to match with regex, instead of usinginput
.The actual call to
checkWhitelist
should becheckWhitelist(context, input, data);
The text was updated successfully, but these errors were encountered: