Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Severe request/reply performance hit when using allow_responses map #6058

Open
jack7803m opened this issue Oct 30, 2024 · 2 comments · May be fixed by #6064
Open

Severe request/reply performance hit when using allow_responses map #6058

jack7803m opened this issue Oct 30, 2024 · 2 comments · May be fixed by #6064
Assignees
Labels
defect Suspected defect such as a bug or regression

Comments

@jack7803m
Copy link

Observed behavior

When using the allow_responses map in user permissions with max set, request/reply takes a significant performance hit depending on the value of expires. If the max value is set but the expires value is unset, performance continually degrades over time.
Even if using expires: "10s", the request/reply throughput drops by about 70% (on my machine).

Not sure if there's any way around this - although I haven't looked into the server code, I would imagine having to cache the request subjects for the max reply count indefinitely would cause memory usage to creep up and performance to go down over time.
That said, I think this should really be either fixed (if possible) or at least documented so that it's clear that using the allow_responses map will have an impact on request/reply performance. It might also be a good idea to just not allow specifying max without an expiry, as it seems like this will always lead to a memory leak.

Expected behavior

Performance should not degrade or it should be documented

Server and client version

v2.10.20

Host environment

Tested with the latest docker image on an ubuntu server and separately with the server binary in a WSL ubuntu environment

Steps to reproduce

nats.conf with no issues:

host: 0.0.0.0

authorization {
  users = [
    {user: test, password: test, permissions: { publish: ">", subscribe: ">" }}
  ]
}

nats.conf with worse performance:

host: 0.0.0.0

authorization {
  users = [
    {user: test, password: test, permissions: { publish: ">", subscribe: ">", allow_responses: { max: 1, expires: "10s" } }}
  ]
}

nats.conf with memory leak:

host: 0.0.0.0

authorization {
  users = [
    {user: test, password: test, permissions: { publish: ">", subscribe: ">", allow_responses: { max: 1 } }}
  ]
}

Using nats cli (nats) and nats server (nats-server) binaries in three terminals:

./nats-server -c nats.conf
./nats bench test --sub 1 --reply --user test --password test
./nats bench test --pub 1 --request --user test --password test

@jack7803m jack7803m added the defect Suspected defect such as a bug or regression label Oct 30, 2024
@jack7803m
Copy link
Author

Spent some time tracing back to where the response permissions are actually used and as I expected, the replies are stored in a map.

https://github.com/nats-io/nats-server/blob/1ee2b8a11af89c8c2c2281105bee851e53250771/server/client.go#L3635C1-L3642C3

Looks like the major performance problem (tested this locally with the CPU profiler too) is coming from client.pruneReplyPerms() because it loops over the entire map and it gets called way too often. Maybe it should be called once every replyPermLimit messages instead of every message when the map length is bigger than replyPermLimit? Testing that change locally seemed to have a massive performance improvement, but not sure how it would affect memory usage on a system that's not getting messages all the time. Maybe it would be better to check based on a time interval?

I'm quite inexperienced with this repository, so if I could get a sanity check from any of the regular maintainers here that would be great.
Planning on submitting a PR with a fix tomorrow, probably using a combination of the message count and time interval checking.

@neilalexander neilalexander self-assigned this Oct 30, 2024
@neilalexander
Copy link
Member

I think you're right in that there's a problem with how these entries are cleaned up.

If there's no expiry timestamp, then a lack of response can effectively mean the entries live forever. Likewise, if there's a max response count and we don't get that many responses, they too can live forever.

It is my feeling that any configuration with no expires should be invalid, so we should either maybe error in that case or set a default expires amount.

We might also want to schedule a goroutine to clean up on a regular schedule instead of on each call to deliverMsg, otherwise the work to clean up the map just compounds.

Will have a think about it a bit more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
defect Suspected defect such as a bug or regression
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants