Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

All benchmark-worker heap size to flex with the size of the machine #78

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

tmgstevens
Copy link
Contributor

No description provided.

@tmgstevens tmgstevens requested a review from a team as a code owner November 22, 2023 11:22
Copy link
Member

@StephanDollberg StephanDollberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a JVM tuning expert but those sound sensible.

Only one I would be concerned about would be GC threads one which are being set to 32. Would this be a problem on smaller machines?

@tmgstevens
Copy link
Contributor Author

I think I took that from upstream. We could probably take that one out and let the JVM choose the number of threads, what do you think?

@StephanDollberg
Copy link
Member

I think I took that from upstream. We could probably take that one out and let the JVM choose the number of threads, what do you think?

Yeah lets maybe take those out.

@travisdowns
Copy link
Member

@tmgstevens coming back to this one: I don't see this long list of tuning options upstream, over there it is just a quite simple -Xms4G -Xmx4G -XX:+UseG1GC.

Maybe this is tuning for a big box?

I think host-size specific tuning might be better off in the deployment & launch scripts for specific scenarios, as we have in the Redpanda folder, since things like very large heap sizes (as we had hardcoded in the past) and thread counts are sort of tied to specific benchmark deployments with machines of a certain size? The idea is that you override the HEAP_OPTS when you launch the worker.

I do like the the idea of the letting the heap "flex" with the machine size, because we also currently run into problems with 8G heap on small machines were we get an OOME.

What if we just put in the "flex" part (which is swapping out Xms/Xms for MaxRAMPercentage and friends) for now?

@travisdowns
Copy link
Member

Ping @tmgstevens .

@tmgstevens
Copy link
Contributor Author

@travisdowns it comes from here: https://github.com/openmessaging/benchmark/blob/master/driver-kafka/deploy/ssd-deployment/deploy.yaml#L229

The other salient reference being https://github.com/confluentinc/benchmark/blob/10x-performance-blog/driver-kafka/deploy/confluent-deployment/deploy.yaml#L335

But now I also see https://github.com/redpanda-data/openmessaging-benchmark/blob/main/driver-redpanda/deploy/deploy.yaml#L389 as well.

So the purpose of this change was specifically for when Ansible isn't being used, which would be the Helm chart install. Otherwise, the stuff in the Ansible is going to take precedence, although tbh we should change that so it's not hard coded to 50GB.

@travisdowns
Copy link
Member

Thanks @tmgstevens . I didn't know the ansible actually just modiied the start script in-situ with a regex. Among other things, this means we'd better be careful that this JVM_MEM assignment remains on a single line (i.e., if it used \ + newline continuation this would break) and that the internal logic remains compatible with overriding that variable.

Anyway, back to the change:

Here's a concrete suggestion: what if we just make these variables properly overridable by env vars (same names) passed in by the caller? Then the helm chart (and, arguably ansible) can just set those env vars to whatever they want depending on the scenario they are setting up. Because it doesn't seem like 1 default to rule them all is going to work anyway since we presumably have different scenarios we might set up for.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants