Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug][minimal reproducible demo] High variability across batch inference runs #1729

Open
5 tasks done
FredericOdermatt opened this issue Oct 20, 2024 · 4 comments
Open
5 tasks done
Assignees
Labels
bug Something isn't working

Comments

@FredericOdermatt
Copy link
Contributor

FredericOdermatt commented Oct 20, 2024

Checklist

  • 1. I have searched related issues but cannot get the expected help.
  • 2. The bug has not been fixed in the latest version.
  • 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
  • 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
  • 5. Please use English, otherwise it will be closed.

Describe the bug

Background

This bug might be related to #1316.

When asking the model a block of questions it should answer with yes followed by a block of questions that should be answered by no a degradation in quality can be observed for some runs, when running the same data many times.

Standard lmsysorg/sglang:v0.3.3.post1-cu121-srt

Asking 200 times the same 40 yes, 40 no questions and recording logit averages.
Blue: questions that should be answered yes: average yes logit (post-softmax)
Orange: questions that should be answered no: average yes logit (post-softmax).
(please check the minimal reproducible sample here)

image

Restricted lmsysorg/sglang:v0.3.3.post1-cu121-srt

Adding the following flags and running 100 times:

--attention-backend triton --sampling-backend pytorch --disable-radix-cache --disable-regex-jump-forward --disable-cuda-graph --disable-cuda-graph-padding --disable-disk-cache --disable-custom-all-reduce --disable-mla

image

Observations

  • We see that Mixtral22B should have an average of around 0.93 probability mass towards the yes token for questions that should be answered with yes when set up correctly.
  • For the current docker image (v0.3.3.post1) there are some intermittent runs that can go as low as 0.5 on average
  • A more restricted setup disabling caches etc doesn't show the deteriorating behavior
  • the behavior happens irrespective of random seed choice
  • I was able to reproduce the behavior on sglang v0.2.6 equally
  • The behavior doesn't happen if all correct answers are yes (simply commenting out the 40 questions that should be answered with no)
    image
    This observation makes me suspect a caching mechanism

Further notes

  • I haven't checked yet whether the long prompt is really necessary (see minimal example), I can run that experiment at the next occasion

Reproduction

Current minimal reproducible example here

Normal server start

python3 -m sglang.launch_server --model-path mistralai/Mixtral-8x22B-Instruct-v0.1 --random-seed 42 --tp-size 8 --dp-size 1 --host 0.0.0.0 --port 30001

Restricted server start
python3 -m sglang.launch_server --model-path mistralai/Mixtral-8x22B-Instruct-v0.1 --attention-backend triton --sampling-backend pytorch --disable-radix-cache --disable-regex-jump-forward --disable-cuda-graph --disable-cuda-graph-padding --disable-disk-cache --disable-custom-all-reduce --disable-mla --random-seed 42 --tp-size 8 --dp-size 1 --host 0.0.0.0 --port 30001

Environment

Environment for problematic runs
lmsysorg/sglang:v0.3.3.post1-cu121-srt

@zhyncs zhyncs added the bug Something isn't working label Oct 21, 2024
@jonzhep
Copy link

jonzhep commented Oct 22, 2024

also able to confirm this, also get it with flashinfer on vllm,

@merrymercy
Copy link
Contributor

merrymercy commented Oct 24, 2024

@FredericOdermatt @jonzhep This is very helpful. We will take a close look this week and hopefully fix it soon.

@merrymercy
Copy link
Contributor

merrymercy commented Oct 24, 2024

This is what I got when running your example commands (Normal server start) on 8xH100 with the current main (87a7cfa)

yes_no_logits

It can basically reproduce what you said, although not as bad as what you show. I will start investigation. May I know the hardware you are using? You can also get that by running python3 -m sglang.check_env

@FredericOdermatt
Copy link
Contributor Author

I was running this on either 8 RTX A6000, or 4 A100's. The plot above is from the RTX's

python3 -m sglang.check_env
Python: 3.10.15 (main, Sep  7 2024, 18:35:33) [GCC 9.4.0]
CUDA available: True
GPU 0,1,2,3,4,5,6,7: NVIDIA RTX A6000
GPU 0,1,2,3,4,5,6,7 Compute Capability: 8.6
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.1, V12.1.105
CUDA Driver Version: 535.183.06
PyTorch: 2.4.0+cu121
flashinfer: 0.1.6+cu121torch2.4
triton: 3.0.0
transformers: 4.45.2
requests: 2.32.3
tqdm: 4.66.5
numpy: 1.26.4
aiohttp: 3.10.10
fastapi: 0.115.0
hf_transfer: 0.1.8
huggingface_hub: 0.25.2
interegular: 0.3.3
packaging: 24.1
PIL: 10.4.0
psutil: 6.0.0
pydantic: 2.9.2
uvicorn: 0.31.1
uvloop: 0.20.0
zmq: 26.2.0
vllm: 0.5.5
multipart: 0.0.12
openai: 1.51.2
tiktoken: 0.8.0
anthropic: Module Not Found
litellm: Module Not Found
NVIDIA Topology:
        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      NV4     NODE    NODE    SYS     SYS     SYS     SYS     0-63,128-191    0               N/A
GPU1    NV4      X      NODE    NODE    SYS     SYS     SYS     SYS     0-63,128-191    0               N/A
GPU2    NODE    NODE     X      NV4     SYS     SYS     SYS     SYS     0-63,128-191    0               N/A
GPU3    NODE    NODE    NV4      X      SYS     SYS     SYS     SYS     0-63,128-191    0               N/A
GPU4    SYS     SYS     SYS     SYS      X      NV4     NODE    NODE    64-127,192-254  1               N/A
GPU5    SYS     SYS     SYS     SYS     NV4      X      NODE    NODE    64-127,192-254  1               N/A
GPU6    SYS     SYS     SYS     SYS     NODE    NODE     X      NV4     64-127,192-254  1               N/A
GPU7    SYS     SYS     SYS     SYS     NODE    NODE    NV4      X      64-127,192-254  1               N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

ulimit soft: 1048576

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

8 participants