-
Notifications
You must be signed in to change notification settings - Fork 463
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug][minimal reproducible demo] High variability across batch inference runs #1729
Comments
also able to confirm this, also get it with flashinfer on vllm, |
@FredericOdermatt @jonzhep This is very helpful. We will take a close look this week and hopefully fix it soon. |
This is what I got when running your example commands (Normal server start) on 8xH100 with the current main (87a7cfa) It can basically reproduce what you said, although not as bad as what you show. I will start investigation. May I know the hardware you are using? You can also get that by running |
I was running this on either 8 RTX A6000, or 4 A100's. The plot above is from the RTX's python3 -m sglang.check_env
|
Checklist
Describe the bug
Background
This bug might be related to #1316.
When asking the model a block of questions it should answer with
yes
followed by a block of questions that should be answered byno
a degradation in quality can be observed for some runs, when running the same data many times.Standard
lmsysorg/sglang:v0.3.3.post1-cu121-srt
Asking 200 times the same 40 yes, 40 no questions and recording logit averages.
Blue: questions that should be answered yes: average yes logit (post-softmax)
Orange: questions that should be answered no: average yes logit (post-softmax).
(please check the minimal reproducible sample here)
Restricted
lmsysorg/sglang:v0.3.3.post1-cu121-srt
Adding the following flags and running 100 times:
Observations
yes
token for questions that should be answered with yes when set up correctly.v0.2.6
equallyyes
(simply commenting out the 40 questions that should be answered withno
)This observation makes me suspect a caching mechanism
Further notes
Reproduction
Current minimal reproducible example here
Normal server start
python3 -m sglang.launch_server --model-path mistralai/Mixtral-8x22B-Instruct-v0.1 --random-seed 42 --tp-size 8 --dp-size 1 --host 0.0.0.0 --port 30001
Restricted server start
python3 -m sglang.launch_server --model-path mistralai/Mixtral-8x22B-Instruct-v0.1 --attention-backend triton --sampling-backend pytorch --disable-radix-cache --disable-regex-jump-forward --disable-cuda-graph --disable-cuda-graph-padding --disable-disk-cache --disable-custom-all-reduce --disable-mla --random-seed 42 --tp-size 8 --dp-size 1 --host 0.0.0.0 --port 30001
Environment
Environment for problematic runs
lmsysorg/sglang:v0.3.3.post1-cu121-srt
The text was updated successfully, but these errors were encountered: