[Bug][minimal reproducible demo] High variability across batch inference runs #1729

FredericOdermatt · 2024-10-20T14:07:03Z

Checklist

1. I have searched related issues but cannot get the expected help.
2. The bug has not been fixed in the latest version.
3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
5. Please use English, otherwise it will be closed.

Describe the bug

Background

This bug might be related to #1316.

When asking the model a block of questions it should answer with yes followed by a block of questions that should be answered by no a degradation in quality can be observed for some runs, when running the same data many times.

Standard `lmsysorg/sglang:v0.3.3.post1-cu121-srt`

Asking 200 times the same 40 yes, 40 no questions and recording logit averages.
Blue: questions that should be answered yes: average yes logit (post-softmax)
Orange: questions that should be answered no: average yes logit (post-softmax).
(please check the minimal reproducible sample here)

Restricted `lmsysorg/sglang:v0.3.3.post1-cu121-srt`

Adding the following flags and running 100 times:

--attention-backend triton --sampling-backend pytorch --disable-radix-cache --disable-regex-jump-forward --disable-cuda-graph --disable-cuda-graph-padding --disable-disk-cache --disable-custom-all-reduce --disable-mla

Observations

We see that Mixtral22B should have an average of around 0.93 probability mass towards the yes token for questions that should be answered with yes when set up correctly.
For the current docker image (v0.3.3.post1) there are some intermittent runs that can go as low as 0.5 on average
A more restricted setup disabling caches etc doesn't show the deteriorating behavior
the behavior happens irrespective of random seed choice
I was able to reproduce the behavior on sglang v0.2.6 equally
The behavior doesn't happen if all correct answers are yes (simply commenting out the 40 questions that should be answered with no)

This observation makes me suspect a caching mechanism

Further notes

I haven't checked yet whether the long prompt is really necessary (see minimal example), I can run that experiment at the next occasion

Reproduction

Current minimal reproducible example here

Normal server start

python3 -m sglang.launch_server --model-path mistralai/Mixtral-8x22B-Instruct-v0.1 --random-seed 42 --tp-size 8 --dp-size 1 --host 0.0.0.0 --port 30001

Restricted server start
python3 -m sglang.launch_server --model-path mistralai/Mixtral-8x22B-Instruct-v0.1 --attention-backend triton --sampling-backend pytorch --disable-radix-cache --disable-regex-jump-forward --disable-cuda-graph --disable-cuda-graph-padding --disable-disk-cache --disable-custom-all-reduce --disable-mla --random-seed 42 --tp-size 8 --dp-size 1 --host 0.0.0.0 --port 30001

Environment

Environment for problematic runs
lmsysorg/sglang:v0.3.3.post1-cu121-srt

The text was updated successfully, but these errors were encountered:

jonzhep · 2024-10-22T21:05:55Z

also able to confirm this, also get it with flashinfer on vllm,

merrymercy · 2024-10-24T01:46:45Z

@FredericOdermatt @jonzhep This is very helpful. We will take a close look this week and hopefully fix it soon.

merrymercy · 2024-10-24T04:58:37Z

This is what I got when running your example commands (Normal server start) on 8xH100 with the current main (87a7cfa)

It can basically reproduce what you said, although not as bad as what you show. I will start investigation. May I know the hardware you are using? You can also get that by running python3 -m sglang.check_env

FredericOdermatt · 2024-10-28T14:42:52Z

I was running this on either 8 RTX A6000, or 4 A100's. The plot above is from the RTX's

python3 -m sglang.check_env

Python: 3.10.15 (main, Sep  7 2024, 18:35:33) [GCC 9.4.0]
CUDA available: True
GPU 0,1,2,3,4,5,6,7: NVIDIA RTX A6000
GPU 0,1,2,3,4,5,6,7 Compute Capability: 8.6
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.1, V12.1.105
CUDA Driver Version: 535.183.06
PyTorch: 2.4.0+cu121
flashinfer: 0.1.6+cu121torch2.4
triton: 3.0.0
transformers: 4.45.2
requests: 2.32.3
tqdm: 4.66.5
numpy: 1.26.4
aiohttp: 3.10.10
fastapi: 0.115.0
hf_transfer: 0.1.8
huggingface_hub: 0.25.2
interegular: 0.3.3
packaging: 24.1
PIL: 10.4.0
psutil: 6.0.0
pydantic: 2.9.2
uvicorn: 0.31.1
uvloop: 0.20.0
zmq: 26.2.0
vllm: 0.5.5
multipart: 0.0.12
openai: 1.51.2
tiktoken: 0.8.0
anthropic: Module Not Found
litellm: Module Not Found
NVIDIA Topology:
        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      NV4     NODE    NODE    SYS     SYS     SYS     SYS     0-63,128-191    0               N/A
GPU1    NV4      X      NODE    NODE    SYS     SYS     SYS     SYS     0-63,128-191    0               N/A
GPU2    NODE    NODE     X      NV4     SYS     SYS     SYS     SYS     0-63,128-191    0               N/A
GPU3    NODE    NODE    NV4      X      SYS     SYS     SYS     SYS     0-63,128-191    0               N/A
GPU4    SYS     SYS     SYS     SYS      X      NV4     NODE    NODE    64-127,192-254  1               N/A
GPU5    SYS     SYS     SYS     SYS     NV4      X      NODE    NODE    64-127,192-254  1               N/A
GPU6    SYS     SYS     SYS     SYS     NODE    NODE     X      NV4     64-127,192-254  1               N/A
GPU7    SYS     SYS     SYS     SYS     NODE    NODE    NV4      X      64-127,192-254  1               N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

ulimit soft: 1048576

zhyncs added the bug Something isn't working label Oct 21, 2024

zhyncs assigned Ying1123, merrymercy, ByronHsu, ispobock, zhyncs and hnyls2002 Oct 21, 2024

merrymercy unassigned ByronHsu, ispobock and zhyncs Oct 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug][minimal reproducible demo] High variability across batch inference runs #1729

[Bug][minimal reproducible demo] High variability across batch inference runs #1729

FredericOdermatt commented Oct 20, 2024 •

edited

Loading

jonzhep commented Oct 22, 2024

merrymercy commented Oct 24, 2024 •

edited

Loading

merrymercy commented Oct 24, 2024 •

edited

Loading

FredericOdermatt commented Oct 28, 2024

[Bug][minimal reproducible demo] High variability across batch inference runs #1729

[Bug][minimal reproducible demo] High variability across batch inference runs #1729

Comments

FredericOdermatt commented Oct 20, 2024 • edited Loading

Checklist

Describe the bug

Background

Standard lmsysorg/sglang:v0.3.3.post1-cu121-srt

Restricted lmsysorg/sglang:v0.3.3.post1-cu121-srt

Observations

Further notes

Reproduction

Environment

jonzhep commented Oct 22, 2024

merrymercy commented Oct 24, 2024 • edited Loading

merrymercy commented Oct 24, 2024 • edited Loading

FredericOdermatt commented Oct 28, 2024

FredericOdermatt commented Oct 20, 2024 •

edited

Loading

Standard `lmsysorg/sglang:v0.3.3.post1-cu121-srt`

Restricted `lmsysorg/sglang:v0.3.3.post1-cu121-srt`

merrymercy commented Oct 24, 2024 •

edited

Loading

merrymercy commented Oct 24, 2024 •

edited

Loading