Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Message consumer garbles subjects with non-ascii characters #1143

Closed
cjohansen opened this issue May 14, 2024 · 10 comments
Closed

Message consumer garbles subjects with non-ascii characters #1143

cjohansen opened this issue May 14, 2024 · 10 comments
Labels
enhancement Enhancement to existing functionality

Comments

@cjohansen
Copy link

cjohansen commented May 14, 2024

Observed behavior

I set up a consumer for a topic containing the utf-8 character ø, e.g. test.løp. nats subscribe 'test.løp' picks up messages published with nats publish test.løp "Hello". The jnats consumer receives a message with the subject "test.lᅢᄌp".

jnats is able to publish messages with non-ascii characters correctly.

Expected behavior

I expect the subject received by jnats to be read as the UTF-8 string "test.løp", not "test.lᅢᄌp"

Server and client version

jnats 2.17.7
nats-server 2.10.12

Host environment

Mac OSX

Steps to reproduce

package nats.example;

import io.nats.client.Connection;
import io.nats.client.Message;
import io.nats.client.Nats;
import io.nats.client.Subscription;
import java.nio.charset.StandardCharsets;
import java.time.Duration;

public class Demo {
  public static void main(String[] args) {
    Connection nc = Nats.connect("nats://localhost:4222");

    new Thread(() -> {
        Subscription sub = nc.subscribe("test.>");
        Message msg = sub.nextMessage(Duration.ofSeconds(1));

        System.out.printf("Received \"%s\" on \"%s\"\n",
                  new String(msg.getData(), StandardCharsets.UTF_8),
                  msg.getSubject());
    }).start();

    String message = "Hello world!";
    nc.publish("test.løp", message.getBytes(StandardCharsets.UTF_8));
  }
}
@cjohansen cjohansen added the defect Suspected defect such as a bug or regression label May 14, 2024
@scottf
Copy link
Contributor

scottf commented May 14, 2024

Understood. We recognize that the server actually will support utf-8 subjects, but for cross client compatibility, at one point we decided the clients would not and suggest using ascii only as noted in our docs: https://docs.nats.io/nats-concepts/subjects

That being said, we discussed this topic today, and decided that it is okay to allow the clients to opt-in for this behavior, so I will add this to my todo list.

@scottf scottf added enhancement Enhancement to existing functionality and removed defect Suspected defect such as a bug or regression labels May 14, 2024
@cjohansen
Copy link
Author

Ah, I see. I thought I read somewhere that subjects were allowed to use "any printable character", but I can't find the reference. Happy to hear it will be fixed 👍

@scottf
Copy link
Contributor

scottf commented May 15, 2024

Ah, I see. I thought I read somewhere that subjects were allowed to use "any printable character", but I can't find the reference. Happy to hear it will be fixed 👍

Probably in ADR-6

@roeschter
Copy link
Collaborator

roeschter commented Jun 21, 2024

I clarified the "ASCII only" for subjects and other names in NATS. https://docs.nats.io/nats-concepts/subjects

We don't support UTF8 for the same reasons its a bad idea to use Unicode for names on the web. Names need to be shared in all kinds of documents, written to logs files and sometimes typed by people. Once you allow anything outside ASCII you open the floodgates and you cannot work with peoples systems in other countries anymore.

If you want to know where this can lead I suggest some computer archaeology into the 90s when Microsoft "localized" Visual basic by actually translating the language key words.

@cjohansen
Copy link
Author

If this is how you want to do it, that's fine. I will add: The developer experience would be much smoother if both the server and the clients did the same validation on this. We ran into this problem because of inconsistent behavior between the server, CLI and Java SDK. Had non-ascii characters been outright disallowed this wouldn't have been an issue.

If you don't want to tighten validations to avoid breaking backwards compatibility then I completely agree. But in that case, making the Java SDK behave the same way as the server and CLI (e.g. more permissive, by reading subject names as UTF-8) would have given fewer surprises. It could even issue a warning when non-asciis are detected at creation time.

@roeschter
Copy link
Collaborator

roeschter commented Jun 21, 2024

Correction: There was a bit of a mixup between recommendations for subject usage and support for UTF-8.

We decided to bring back the UTF-8 support for most clients. In general UTF-8 should work. But it may be optional for some client implementations.

We recommend to NOT use non-ASCII characters in subjects as this can cause all kinds of issues in configuration files, command line tools and simply people being able to read and configure the subjects. I'm speaking as somebody who worked with the Unicode standard in IT systems since 1995(!!!). If you use non ASCII in "names" - its at your own risk.

@cjohansen
Copy link
Author

Thanks for clarifying, this sounds good to me 👍😊

@roeschter
Copy link
Collaborator

roeschter commented Jun 21, 2024

I have found the issues (subject not utf8 decoded for incoming messages) and will suggest a fix.

PS: Its a feature. We are planning to optionally allow UTF-8 again in a future release. The code exists but is disabled for the current release.

@scottf
Copy link
Contributor

scottf commented Jun 28, 2024

@cjohansen I have merged the PR #1169

This allows you to turn on UTF-8 support via the connect options with the supportUTF8Subjects builder method.

You can publish with a UTF-8 subject whether or not this flag is set, since all that happens outgoing is we convert strings to byte arrays using UTF-8 character encoding anyway.

But on incoming messages, it's a different code path to process messages that might expect a UTF-8 subject, so the option is required.

@scottf scottf closed this as completed Jun 28, 2024
@cjohansen
Copy link
Author

Awesome, thanks a lot 😊👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhancement to existing functionality
Projects
None yet
Development

No branches or pull requests

3 participants