Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong encoding of non-ASCII characters in JSON #40

Open
jscissr opened this issue Mar 12, 2024 · 3 comments
Open

Wrong encoding of non-ASCII characters in JSON #40

jscissr opened this issue Mar 12, 2024 · 3 comments

Comments

@jscissr
Copy link

jscissr commented Mar 12, 2024

Non-ASCII characters are encoded incorrectly by rdb-cli dump.rdb json.

Example:
Add a key to redis with SET demo "Müller".
Run rdb-cli dump.rdb json.
The result is:

[{
    "demo":"M\u00c3\u00bcller"
}]

After unescaping, we get "Müller".

For comparison, rdbtools (which does not work with newer redis versions) outputs:

[{
"demo":"M\u00fcller"}]

The simplest way to fix this is to avoid escaping non-ASCII characters entirely, and output them as is:

diff --git a/src/ext/handlersToJson.c b/src/ext/handlersToJson.c
index c5addf7..7a7e299 100644
--- a/src/ext/handlersToJson.c
+++ b/src/ext/handlersToJson.c
@@ -65,7 +65,7 @@ static void outputPlainEscaping(RdbxToJson *ctx, char *p, size_t len) {
             case '\t': fprintf(ctx->outfile, "\\t"); break;
             case '\b': fprintf(ctx->outfile, "\\b"); break;
             default:
-                fprintf(ctx->outfile, (isprint(*p)) ? "%c" : "\\u%04x", (unsigned char)*p);
+                fprintf(ctx->outfile, ((unsigned char)*p > 127 || isprint(*p)) ? "%c" : "\\u%04x", (unsigned char)*p);
         }
         p++;
     }

With this change, the result is:

[{
    "demo":"Müller"
}]
@moticless
Copy link
Collaborator

Hello @jscissr,

You've raised a crucial aspect of encoding that hasn't received much attention until now.

Your proposed solution may not be suitable for handling binary data. From my perspective, the appropriate approach would involve expanding the functionality of the rdb-cli tool, particularly when dealing with JSON format, to include an option for specifying the encoding of the data. Either UTF-8 or raw data. Supporting UTF-8 would necessitate integrating a UTF-8 decoder. While it's important to have this capability, implementing it won't be a quick fix.

@jscissr
Copy link
Author

jscissr commented Mar 22, 2024

Yes, my solution doesn't handle binary data, but that's because JSON itself doesn't support that. If you want to support binary, you would have to base64-encode all strings.

If on the other hand all strings in your redis database are UTF-8 encoded, then my solution just passes through the strings as is, and you get a UTF-8 encoded JSON. You don't need to decode the UTF-8, you can just pass it through.

rdbtools parses UTF-8 and generates unicode escape sequences, but that is not necessary and increases file size.

@moticless
Copy link
Collaborator

If on the other hand all strings in your redis database are UTF-8 encoded

We cannot make that assumption. Unless the user explicitly requests the use of UTF-8, the tool should default to printing what could be considered as raw data to prevent potential data corruption.

The suggested conversion from UTF-8 to Unicode overlooks the fact that UTF-8 character encoding can vary in length, with sequences of up to 6 bytes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants