Severe request/reply performance hit when using allow_responses map #6058

jack7803m · 2024-10-30T18:20:41Z

Observed behavior

When using the allow_responses map in user permissions with max set, request/reply takes a significant performance hit depending on the value of expires. If the max value is set but the expires value is unset, performance continually degrades over time.
Even if using expires: "10s", the request/reply throughput drops by about 70% (on my machine).

Not sure if there's any way around this - although I haven't looked into the server code, I would imagine having to cache the request subjects for the max reply count indefinitely would cause memory usage to creep up and performance to go down over time.
That said, I think this should really be either fixed (if possible) or at least documented so that it's clear that using the allow_responses map will have an impact on request/reply performance. It might also be a good idea to just not allow specifying max without an expiry, as it seems like this will always lead to a memory leak.

Expected behavior

Performance should not degrade or it should be documented

Server and client version

v2.10.20

Host environment

Tested with the latest docker image on an ubuntu server and separately with the server binary in a WSL ubuntu environment

Steps to reproduce

nats.conf with no issues:

host: 0.0.0.0

authorization {
  users = [
    {user: test, password: test, permissions: { publish: ">", subscribe: ">" }}
  ]
}

nats.conf with worse performance:

host: 0.0.0.0

authorization {
  users = [
    {user: test, password: test, permissions: { publish: ">", subscribe: ">", allow_responses: { max: 1, expires: "10s" } }}
  ]
}

nats.conf with memory leak:

host: 0.0.0.0

authorization {
  users = [
    {user: test, password: test, permissions: { publish: ">", subscribe: ">", allow_responses: { max: 1 } }}
  ]
}

Using nats cli (nats) and nats server (nats-server) binaries in three terminals:

./nats-server -c nats.conf
./nats bench test --sub 1 --reply --user test --password test
./nats bench test --pub 1 --request --user test --password test

The text was updated successfully, but these errors were encountered:

jack7803m · 2024-10-30T20:43:59Z

Spent some time tracing back to where the response permissions are actually used and as I expected, the replies are stored in a map.

https://github.com/nats-io/nats-server/blob/1ee2b8a11af89c8c2c2281105bee851e53250771/server/client.go#L3635C1-L3642C3

Looks like the major performance problem (tested this locally with the CPU profiler too) is coming from client.pruneReplyPerms() because it loops over the entire map and it gets called way too often. Maybe it should be called once every replyPermLimit messages instead of every message when the map length is bigger than replyPermLimit? Testing that change locally seemed to have a massive performance improvement, but not sure how it would affect memory usage on a system that's not getting messages all the time. Maybe it would be better to check based on a time interval?

I'm quite inexperienced with this repository, so if I could get a sanity check from any of the regular maintainers here that would be great.
Planning on submitting a PR with a fix tomorrow, probably using a combination of the message count and time interval checking.

neilalexander · 2024-10-31T11:29:53Z

I think you're right in that there's a problem with how these entries are cleaned up.

If there's no expiry timestamp, then a lack of response can effectively mean the entries live forever. Likewise, if there's a max response count and we don't get that many responses, they too can live forever.

It is my feeling that any configuration with no expires should be invalid, so we should either maybe error in that case or set a default expires amount.

We might also want to schedule a goroutine to clean up on a regular schedule instead of on each call to deliverMsg, otherwise the work to clean up the map just compounds.

Will have a think about it a bit more.

jack7803m added the defect Suspected defect such as a bug or regression label Oct 30, 2024

neilalexander self-assigned this Oct 30, 2024

jack7803m linked a pull request Oct 31, 2024 that will close this issue

[WIP][FIXED] Fix request/reply performance when using allow_responses perms #6064

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Severe request/reply performance hit when using allow_responses map #6058

Severe request/reply performance hit when using allow_responses map #6058

jack7803m commented Oct 30, 2024

jack7803m commented Oct 30, 2024

neilalexander commented Oct 31, 2024

Severe request/reply performance hit when using allow_responses map #6058

Severe request/reply performance hit when using allow_responses map #6058

Comments

jack7803m commented Oct 30, 2024

Observed behavior

Expected behavior

Server and client version

Host environment

Steps to reproduce

jack7803m commented Oct 30, 2024

neilalexander commented Oct 31, 2024