Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Got error with awq_marlin quantization args. #1792

Open
5 tasks done
liangzelang opened this issue Oct 25, 2024 · 0 comments
Open
5 tasks done

[Bug] Got error with awq_marlin quantization args. #1792

liangzelang opened this issue Oct 25, 2024 · 0 comments

Comments

@liangzelang
Copy link

liangzelang commented Oct 25, 2024

Checklist

  • I have searched related issues but cannot get the expected help.

  • The bug has not been fixed in the latest version.

  • Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.

  • If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.

  • Please use English, otherwise it will be closed.

Describe the bug

I used the AutoAWQ tool to quantize Deepseek-V2 model . The quantization script is as follows, resulting in a quantized network. I expect to obtain a model in awq_marlin quantization format.

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer


model_path = 'path/to/Deepseek-V2'
quant_path = 'path/to/Deepseek-V2_marlin'
quant_config = { "zero_point": False, "q_group_size": 128, "w_bit": 4, "version": "Marlin" }

# Load model
model = AutoAWQForCausalLM.from_pretrained(
    model_path, **{"low_cpu_mem_usage": True, "use_cache": False}
)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

# Quantize
model.quantize(tokenizer, quant_config=quant_config)

# Save quantized model
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)

print(f'Model is quantized and saved at "{quant_path}"')

The config.json corresponding to the quantized model is as follows.

{
  "_name_or_path": "/path/to/Deepseek-V2",
  "architectures": [
    "DeepseekV2ForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "auto_map": {
    "AutoConfig": "configuration_deepseek.DeepseekV2Config",
    "AutoModel": "modeling_deepseek.DeepseekV2Model",
    "AutoModelForCausalLM": "modeling_deepseek.DeepseekV2ForCausalLM"
  },
  "aux_loss_alpha": 0.001,
  "bos_token_id": 100000,
  "eos_token_id": 100001,
  "ep_size": 1,
  "first_k_dense_replace": 1,
  "hidden_act": "silu",
  "hidden_size": 5120,
  "initializer_range": 0.02,
  "intermediate_size": 12288,
  "kv_lora_rank": 512,
  "max_position_embeddings": 163840,
  "model_type": "deepseek_v2",
  "moe_intermediate_size": 1536,
  "moe_layer_freq": 1,
  "n_group": 8,
  "n_routed_experts": 80,
  "n_shared_experts": 2,
  "norm_topk_prob": false,
  "num_attention_heads": 128,
  "num_experts_per_tok": 6,
  "num_hidden_layers": 60,
  "num_key_value_heads": 128,
  "pretraining_tp": 1,
  "q_lora_rank": 1536,
  "qk_nope_head_dim": 128,
  "qk_rope_head_dim": 64,
  "quantization_config": {
    "bits": 4,
    "group_size": 128,
    "modules_to_not_convert": null,
    "quant_method": "awq",
    "version": "marlin",
    "zero_point": false
  },
  "rms_norm_eps": 1e-06,
  "rope_scaling": {
    "beta_fast": 32,
    "beta_slow": 1,
    "factor": 40,
    "mscale": 0.707,
    "mscale_all_dim": 0.707,
    "original_max_position_embeddings": 4096,
    "type": "yarn"
  },
  "rope_theta": 10000,
  "routed_scaling_factor": 16.0,
  "scoring_func": "softmax",
  "seq_aux": true,
  "tie_word_embeddings": false,
  "topk_group": 3,
  "topk_method": "group_limited_greedy",
  "torch_dtype": "float16",
  "transformers_version": "4.45.2",
  "use_cache": false,
  "v_head_dim": 128,
  "vocab_size": 102400
}

Then, I used SGLang to run quantized model with the following command.

python -m sglang.launch_server --trust-remote-code --model-path $MODEL_PATH --port $SERVER_PORT --quantization awq_marlin --tp 4 --mem-fraction-static 0.9

And got the error

[2024-10-25 18:12:30 TP0] Traceback (most recent call last):
  File "/opt/conda/envs/sglang_py310/lib/python3.10/site-packages/sglang/srt/managers/scheduler.py", line 1115, in run_scheduler_process
    scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank)
  File "/opt/conda/envs/sglang_py310/lib/python3.10/site-packages/sglang/srt/managers/scheduler.py", line 146, in __init__
    self.tp_worker = TpModelWorker(
  File "/opt/conda/envs/sglang_py310/lib/python3.10/site-packages/sglang/srt/managers/tp_worker.py", line 58, in __init__
    self.model_runner = ModelRunner(
  File "/opt/conda/envs/sglang_py310/lib/python3.10/site-packages/sglang/srt/model_executor/model_runner.py", line 147, in __init__
    self.load_model()
  File "/opt/conda/envs/sglang_py310/lib/python3.10/site-packages/sglang/srt/model_executor/model_runner.py", line 234, in load_model
    self.vllm_model_config = VllmModelConfig(
  File "/opt/conda/envs/sglang_py310/lib/python3.10/site-packages/vllm/config.py", line 227, in __init__
    self._verify_quantization()
  File "/opt/conda/envs/sglang_py310/lib/python3.10/site-packages/vllm/config.py", line 296, in _verify_quantization
    raise ValueError(
ValueError: Quantization method specified in the model config (awq) does not match the quantization method specified in the `quantization` argument (awq_marlin).

If I change quantization_config to --quantization awq , also got error.

  File "/opt/conda/envs/sglang_py310/lib/python3.10/site-packages/sglang/srt/managers/scheduler.py", line 1115, in run_scheduler_process
    scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank)
  File "/opt/conda/envs/sglang_py310/lib/python3.10/site-packages/sglang/srt/managers/scheduler.py", line 146, in __init__
    self.tp_worker = TpModelWorker(
  File "/opt/conda/envs/sglang_py310/lib/python3.10/site-packages/sglang/srt/managers/tp_worker.py", line 58, in __init__
    self.model_runner = ModelRunner(
  File "/opt/conda/envs/sglang_py310/lib/python3.10/site-packages/sglang/srt/model_executor/model_runner.py", line 147, in __init__
    self.load_model()
  File "/opt/conda/envs/sglang_py310/lib/python3.10/site-packages/sglang/srt/model_executor/model_runner.py", line 251, in load_model
    self.model = get_model(
  File "/opt/conda/envs/sglang_py310/lib/python3.10/site-packages/vllm/model_executor/model_loader/__init__.py", line 19, in get_model
    return loader.load_model(model_config=model_config,
  File "/opt/conda/envs/sglang_py310/lib/python3.10/site-packages/vllm/model_executor/model_loader/loader.py", line 341, in load_model
    model = _initialize_model(model_config, self.load_config,
  File "/opt/conda/envs/sglang_py310/lib/python3.10/site-packages/vllm/model_executor/model_loader/loader.py", line 170, in _initialize_model
    return build_model(
  File "/opt/conda/envs/sglang_py310/lib/python3.10/site-packages/vllm/model_executor/model_loader/loader.py", line 155, in build_model
    return model_class(config=hf_config,
  File "/opt/conda/envs/sglang_py310/lib/python3.10/site-packages/sglang/srt/models/deepseek_v2.py", line 648, in __init__
    self.model = DeepseekV2Model(config, cache_config, quant_config)
  File "/opt/conda/envs/sglang_py310/lib/python3.10/site-packages/sglang/srt/models/deepseek_v2.py", line 608, in __init__
    [
  File "/opt/conda/envs/sglang_py310/lib/python3.10/site-packages/sglang/srt/models/deepseek_v2.py", line 609, in <listcomp>
    DeepseekV2DecoderLayer(
  File "/opt/conda/envs/sglang_py310/lib/python3.10/site-packages/sglang/srt/models/deepseek_v2.py", line 551, in __init__
    self.mlp = DeepseekV2MoE(config=config, quant_config=quant_config)
  File "/opt/conda/envs/sglang_py310/lib/python3.10/site-packages/sglang/srt/models/deepseek_v2.py", line 113, in __init__
    self.experts = FusedMoE(
  File "/opt/conda/envs/sglang_py310/lib/python3.10/site-packages/vllm/model_executor/layers/fused_moe/layer.py", line 192, in __init__
    assert self.quant_method is not None
AssertionError

So how to run an awq_marlin or marlin quantized model with SGLang?

Reproduction

  1. Quant the Deepseek-V2 model; In fact you can use small model to reproduce;

  2. Run quantization model with SGLang.

Environment

python -m sglang.check_env
Python: 3.10.15 | packaged by conda-forge | (main, Oct 16 2024, 01:24:24) [GCC 13.3.0]
CUDA available: True
GPU 0,1,2,3,4,5,6,7: NVIDIA H800
GPU 0,1,2,3,4,5,6,7 Compute Capability: 9.0
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.1, V12.1.105
CUDA Driver Version: 550.90.07
PyTorch: 2.4.0+cu121
sglang: 0.3.4
flashinfer: 0.1.6+cu121torch2.4
triton: 3.0.0
transformers: 4.45.2
requests: 2.32.3
tqdm: 4.66.5
numpy: 1.26.4
aiohttp: 3.10.10
fastapi: 0.115.2
hf_transfer: 0.1.8
huggingface_hub: 0.26.0
interegular: 0.3.3
packaging: 24.1
PIL: 11.0.0
psutil: 6.1.0
pydantic: 2.9.2
uvicorn: 0.32.0
uvloop: 0.21.0
zmq: 26.2.0
vllm: 0.5.5
multipart: 0.0.12
openai: 1.52.0
tiktoken: 0.8.0
anthropic: 0.36.2

Hypervisor vendor: KVM
ulimit soft: 1048576

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant