4.5. Qwen 系列¶

介绍¶

Qwen 系列大模型是由阿里巴巴集团旗下的通义实验室（Tongyi Lab）研发的一系列大型语言模型。该系列以“通义千问 ”为核心产品名称，覆盖了从基础语言理解、文本生成到多模态、代码生成等广泛场景。

Qwen3-8B¶

本模型推理及性能测试需要1张enflame gcu。

模型下载¶

url: Qwen3-8B
branch: main
commit id: 7a760cb9

git lfs install
git clone https://www.modelscope.cn/Qwen/Qwen3-8B.git

requirements¶

python3 -m pip install transformers == 4.51.0

export ENFLAME_TORCH_GCU_ENABLE_AUTO_MIGRATION=1

注：

环境要求：python3.10；transformers >= 4.51.0 均可；

设置环境变量ENFLAME_TORCH_GCU_ENABLE_AUTO_MIGRATION的目的：用于开启torch-gcu自动迁移功能；

online推理示例¶

python3 -m sglang.launch_server --model-path [ path of qwen3-8b ] --host 0.0.0.0 --port 8089  --dp-size 1 --tp-size 1 --trust-remote-code --reasoning-parser qwen3

注：

--port：可以配置为本机未被占用的任意端口；

--context-length：可以配置模型可生成最大token的数量；

--reasoning-parser: 用于指定在推理时启用“推理内容分离”功能，并选择对应的reasoning parser（如 deepseek-r1、qwen3）。启用后模型输出会自动将推理过程（reasoning_content）与最终答案（content）分离，便于结构化处理和下游应用，常用于 DeepSeek R1、Qwen3 等带 <think>...</think> 标签的推理模型；【详见官方文档说明】

⚠️ 要获得结构化分离结果，必须在 server 启动时指定 --reasoning-parser，否则 /separate_reasoning 接口无法生效；

选择模式
- 思考模式
```
enable_thinking=True
```
```
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=True  # True is the default value for enable_thinking
)
```
  默认情况下，Qwen3 启用了类似于 QwQ-32B 的思考能力。这意味着模型将利用其推理能力来提高生成响应的质量。例如，在显式设置 enable_thinking=True 或在 tokenizer.apply_chat_template 中保留默认值时，模型将进入思考模式。
  对于思考模式，请使用 Temperature=0.6、TopP=0.95、TopK=20 和 MinP=0（这是 generation_config.json 中的默认设置）。不要使用贪心解码，因为它可能导致性能下降和无尽的重复。有关更详细的指导，请参阅官网文档最佳实践部分。
- 非思考模式
```
enable_thinking=False
```
```
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=False  # Setting enable_thinking=False disables thinking mode
)
```
  在这种模式下，模型不会生成任何思考内容。
  对于非思考模式，我们建议使用 Temperature=0.7、TopP=0.8、TopK=20 和 MinP=0。有关更详细的指导，请参阅官网文档最佳实践部分。
thinking模式示例(推理内容分离)

from transformers import AutoTokenizer
import requests
from sglang.utils import wait_for_server, print_highlight, terminate_process

port = 8089
messages = [
    {
        "role": "user",
        "content": "What is 1+3?",
    }
]

tokenizer = AutoTokenizer.from_pretrained("/xxx/llm_weights/Qwen3-8B/")
input = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=True  # True is the default value for enable_thinking
)

gen_url = f"http://localhost:{port}/generate"
gen_data = {
    "text": input,
    "sampling_params": {
        "skip_special_tokens": False,
        "max_new_tokens": 1024,
        "temperature": 0.6,
        "top_p": 0.95,
    },
}
gen_response = requests.post(gen_url, json=gen_data).json()["text"]

parse_url = f"http://localhost:{port}/separate_reasoning"
separate_reasoning_data = {
    "text": gen_response,
    "reasoning_parser": "qwen3",
}
separate_reasoning_response_json = requests.post(
    parse_url, json=separate_reasoning_data
).json()
print_highlight("==== Reasoning ====")
print_highlight(separate_reasoning_response_json["reasoning_text"])
print_highlight("==== Text ====")
print_highlight(separate_reasoning_response_json["text"])

输出

==== Reasoning ====
Okay, let's see. The user is asking "What is 1+3?" That seems straightforward, but maybe they want a detailed explanation. Let me break it down.

First, I know that addition is one of the basic arithmetic operations. So 1 plus 3 means combining the quantities of 1 and 3. Let me visualize it. If I have one object and then add three more, how many do I have in total? Let's count: 1, 2, 3, 4. So that's four.

Wait, but maybe they want a more formal approach. In mathematics, addition is defined as combining two numbers to get their sum. The numbers 1 and 3 are both positive integers. Adding them together would result in 4.

Is there any other way to think about this? Maybe using number lines. Starting at 1 and moving 3 units to the right would land me at 4. That's another way to visualize it.

Alternatively, using algebraic terms, if I let a = 1 and b = 3, then a + b = 4.

I should also consider if there's any context where this might not be the case, but in standard arithmetic, 1 + 3 is definitely 4. Maybe the user is testing if I know basic math, or they might have a trick question. But I don't see any trick here.

Let me double-check. 1 + 3. Yes, 1 plus 3 equals 4. No other possible answer in standard arithmetic. So the answer should be 4.

I think that's all. It's a simple question, but it's good to make sure there's no misunderstanding. Maybe the user is just confirming their own knowledge or starting a conversation. Either way, the answer is 4.

==== Text ====
The sum of 1 and 3 is calculated by combining the two numbers.

**Step-by-Step Explanation:**
1. Start with the number 1.
2. Add 3 to it: $1 + 3$.
3. Count the total: $1 + 3 = 4$.

**Answer:**
$1 + 3 = \boxed{4}$

非thinking模式示例

from transformers import AutoTokenizer
import requests

from transformers import AutoTokenizer
import requests
from sglang.utils import wait_for_server, print_highlight, terminate_process

port = 8089

messages = [
    {
        "role": "user",
        "content": "What is 1+3?",
    }
]

tokenizer = AutoTokenizer.from_pretrained("/xxx/llm_weights/Qwen3-8B/")
input = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=False
)

gen_url = f"http://localhost:{port}/generate"
gen_data = {
    "text": input,
    "sampling_params": {
        "skip_special_tokens": False,
        "max_new_tokens": 1024,
        "temperature": 0.7,
        "top_p": 0.8,
    },
}
gen_response = requests.post(gen_url, json=gen_data).json()["text"]

print_highlight("==== Original Output ====")
print_highlight(gen_response)

输出

==== Original Output ====
1 + 3 equals 4.

性能测试¶

参考online示例，先启动sglang server.

启动server

# 启动server
python3 -m sglang.launch_server --model-path [ path of qwen3-8b ] --host 0.0.0.0 --port 8089  --dp-size 1 --tp-size 1 --trust-remote-code

性能测试

# 性能测试
python3 -m sglang.gcu_bench_serving --backend sglang --dataset-name random --random-range-ratio 1.0 --num-prompts 16 --random-input-len 1024 --random-output-len 1024 --host 0.0.0.0 --port 8089

注：

可以通过--dataset-path 指定数据集的存储路径，否则默认从/tmp/下读取；

默认使用sharegpt数据集，数据集文件名为：ShareGPT_V3_unfiltered_cleaned_split.json，可以通过--dataset-name指定其他数据集；

可以通过 --num-prompts、--random-input-len、--random-output-len 等参数自定义测试规模和输入输出长度。详细参数和用法见官方文档和脚本注释；

精度验证¶

安装lm-eval

python3.10 -m pip install lm_eval[all]

启动server

# 启动server：
python3.10 -m sglang.launch_server --model-path [ path of qwen3-8b ] --host 0.0.0.0 --port 8089  --dp-size 1 --tp-size 1 --trust-remote-code --reasoning-parser qwen3

精度测试

# mmlu精度测试
lm_eval --model local-completions --model_args model=[ path of qwen3-8b ],tokenizer=[ path of qwen3-8b ],base_url=http://0.0.0.0:8089/v1/completions,max_retries=3, --tasks ifeval --trust_remote_code --batch_size 8  --gen_kwargs temperature=0 --seed 0

输出

|Tasks |Version|Filter|n-shot|        Metric         |   |Value |   |Stderr|
|------|------:|------|-----:|-----------------------|---|-----:|---|------|
|ifeval|      4|none  |     0|inst_level_loose_acc   |↑  |0.4161|±  |   N/A|
|      |       |none  |     0|inst_level_strict_acc  |↑  |0.3885|±  |   N/A|
|      |       |none  |     0|prompt_level_loose_acc |↑  |0.2754|±  |0.0192|
|      |       |none  |     0|prompt_level_strict_acc|↑  |0.2421|±  |0.0184|

Qwen2.5-VL-3B-Instruct¶

本模型推理及性能测试需要1张enflame gcu。

模型下载¶

url: Qwen2.5-VL-3B-Instruct
branch: main
commit id: 66285546d2b821cf421d4f5eb2576359d3770cd3

将上述url路径下的内容下载到本地Qwen2.5-VL-3B-Instruct文件夹中。

requirements¶

python3 -m pip install transformers==4.48.3 datasets==3.5.0

export ENFLAME_TORCH_GCU_ENABLE_AUTO_MIGRATION=1

注：

环境要求：python3.10；transformers >= 4.48.3 均可；

环境变量ENFLAME_TORCH_GCU_ENABLE_AUTO_MIGRATION=1用于开启torch-gcu自动迁移功能；

online推理示例¶

启动server

python3 -m sglang.launch_server --model-path [ path of Qwen2.5-VL-3B-Instruct ] --host 0.0.0.0 --port 8089  --dp-size 1 --tp-size 1 --trust-remote-code

注：

--port：可以配置为本机未被占用的任意端口；

--context-length：可以配置模型可生成最大token的数量；

client发起请求

import requests
from sglang.utils import print_highlight

url = f"http://localhost:8089/v1/chat/completions"
data = {
    "model": "[ path of Qwen2.5-VL-3B-Instruct ]",
    "messages": [{"role": "user", "content": "What is the capital of France?"}],
}

response = requests.post(url, json=data)
print(response.json())

性能测试¶

参考online示例，先启动sglang server.

启动server

python3 -m sglang.launch_server --model-path [ path of Qwen2.5-VL-3B-Instruct ] --host 0.0.0.0 --port 8089  --dp-size 1 --tp-size 1 --trust-remote-code

性能测试

python3 -m sglang.gcu_bench_serving --backend sglang --dataset-name random --random-range-ratio 1.0 --num-prompts 16 --random-input-len 1024 --random-output-len 1024   --host 0.0.0.0 --port 8089  --dataset-path [ path of dataset ]

注：

可以通过--dataset-path指定数据集的存储路径，否则默认从/tmp/下读取;

默认使用sharegpt数据集：
数据集文件名为：ShareGPT_V3_unfiltered_cleaned_split.json；
下载地址：https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/blob/main/
ShareGPT_V3_unfiltered_cleaned_split.json；
可以通过--dataset-name指定其他数据集；

可以通过 --num-prompts、--random-input-len、--random-output-len 等参数自定义测试规模和输入输出长度。详细参数和用法见官方文档和脚本注释；

精度验证¶

安装lm-eval

python3.10 -m pip install 'evalscope[all]'

启动server

python3.10 -m sglang.launch_server --model-path [ path of Qwen2.5-VL-3B-Instruct ] \
    --host 0.0.0.0 --port 30000  --dp-size 1 --tp-size 1 --trust-remote-code --mem-fraction-static 0.7 --chat-template qwen2-vl \
    --disable-radix-cache --chunked-prefill-size -1

精度测试

from evalscope import TaskConfig, run_task
task_cfg_dict = TaskConfig(
    work_dir='outputs',
    eval_backend='VLMEvalKit',
    eval_config={
        'data': ['MMMU_DEV_VAL'],
        'mode': 'all',
        'model': [ 
            {'api_base': 'http://localhost:30000/v1/chat/completions',
            'key': 'EMPTY',
            'name': 'CustomAPIModel',
            'temperature': 0.0,
            'type': 'Qwen2.5-VL-3B-Instruct',
            'img_size': -1,
            'video_llm': False,
            'max_tokens': 1024,}
            ],
        'reuse': False,
        'nproc': 16,
        'judge': 'exact_matching'},
        timeout=3600,
)

run_task(task_cfg_dict)

结果查看 精度结果打印在终端

-----------------------------------  ------------------  -------------------
split                                dev                 validation
Overall                              0.42                0.42
Accounting                           0.6                 0.3
Agriculture                          0.4                 0.5333333333333333
Architecture_and_Engineering         0.0                 0.23333333333333334
Art                                  0.2                 0.3333333333333333
Art_Theory                           0.4                 0.5333333333333333
Basic_Medical_Science                1.0                 0.6
Biology                              0.4                 0.4
Chemistry                            0.4                 0.3
Clinical_Medicine                    0.2                 0.6
Computer_Science                     0.4                 0.36666666666666664
Design                               0.8                 0.5333333333333333
Diagnostics_and_Laboratory_Medicine  0.6                 0.43333333333333335
Economics                            0.4                 0.3333333333333333
Electronics                          0.4                 0.3
Energy_and_Power                     0.6                 0.2
Finance                              0.6                 0.26666666666666666
Geography                            0.0                 0.4
History                              0.8                 0.5333333333333333
Literature                           0.6                 0.9
Manage                               0.4                 0.4
Marketing                            0.0                 0.5
Materials                            0.4                 0.3333333333333333
Math                                 0.4                 0.36666666666666664
Mechanical_Engineering               0.0                 0.2
Music                                0.0                 0.3
Pharmacy                             0.4                 0.5
Physics                              0.6                 0.3
Psychology                           0.8                 0.6
Public_Health                        0.0                 0.5333333333333333
Sociology                            0.8                 0.4666666666666667
Art & Design                         0.35                0.425
Business                             0.4                 0.36
Health & Medicine                    0.44                0.5333333333333333
Humanities & Social Science          0.75                0.625
Science                              0.36                0.35333333333333333
Tech & Engineering                   0.3142857142857143  0.30952380952380953
-----------------------------------  ------------------  -------------------

Qwen3-32B¶

本模型推理及性能测试需要2张enflame gcu。

模型下载¶

url: Qwen3-32B
branch: main
commit id: d47b0d4

git lfs install
git clone https://www.modelscope.cn/Qwen/Qwen3-32B.git

requirements¶

python3 -m pip install transformers == 4.51.0

export ENFLAME_TORCH_GCU_ENABLE_AUTO_MIGRATION=1

注：

环境要求：python3.10；transformers >= 4.51.0 均可；

设置环境变量ENFLAME_TORCH_GCU_ENABLE_AUTO_MIGRATION的目的：用于开启torch-gcu自动迁移功能；

online推理示例¶

python3 -m sglang.launch_server --model-path [ path of qwen3-32b ] --host 0.0.0.0 --port 8089  --dp-size 1 --tp-size 2 --trust-remote-code --reasoning-parser qwen3

注：

--port：可以配置为本机未被占用的任意端口；

--context-length：可以配置模型可生成最大token的数量；

--reasoning-parser: 用于指定在推理时启用“推理内容分离”功能，并选择对应的reasoning parser（如 deepseek-r1、qwen3）。启用后模型输出会自动将推理过程（reasoning_content）与最终答案（content）分离，便于结构化处理和下游应用，常用于 DeepSeek R1、Qwen3 等带标签的推理模型；【详见官方文档说明】

⚠️ 要获得结构化分离结果，必须在 server 启动时指定 --reasoning-parser，否则 /separate_reasoning 接口无法生效；

选择模式
- 思考模式
```
enable_thinking=True
```
```
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=True  # True is the default value for enable_thinking
)
```
  默认情况下，Qwen3 启用了类似于 QwQ-32B 的思考能力。这意味着模型将利用其推理能力来提高生成响应的质量。例如，在显式设置 enable_thinking=True 或在 tokenizer.apply_chat_template 中保留默认值时，模型将进入思考模式。
  对于思考模式，请使用 Temperature=0.6、TopP=0.95、TopK=20 和 MinP=0（这是 generation_config.json 中的默认设置）。不要使用贪心解码，因为它可能导致性能下降和无尽的重复。有关更详细的指导，请参阅官网文档最佳实践部分。
- 非思考模式
```
enable_thinking=False
```
```
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=False  # Setting enable_thinking=False disables thinking mode
)
```
  在这种模式下，模型不会生成任何思考内容。
  对于非思考模式，我们建议使用 Temperature=0.7、TopP=0.8、TopK=20 和 MinP=0。有关更详细的指导，请参阅官网文档最佳实践部分。
thinking模式示例(推理内容分离)

from transformers import AutoTokenizer
import requests
from sglang.utils import wait_for_server, print_highlight, terminate_process

port = 8089
messages = [
    {
        "role": "user",
        "content": "What is 1+3?",
    }
]

tokenizer = AutoTokenizer.from_pretrained("/xxx/llm_weights/Qwen3-32B/")
input = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=True  # True is the default value for enable_thinking
)

gen_url = f"http://localhost:{port}/generate"
gen_data = {
    "text": input,
    "sampling_params": {
        "skip_special_tokens": False,
        "max_new_tokens": 1024,
        "temperature": 0.6,
        "top_p": 0.95,
    },
}
gen_response = requests.post(gen_url, json=gen_data).json()["text"]

parse_url = f"http://localhost:{port}/separate_reasoning"
separate_reasoning_data = {
    "text": gen_response,
    "reasoning_parser": "qwen3",
}
separate_reasoning_response_json = requests.post(
    parse_url, json=separate_reasoning_data
).json()
print_highlight("==== Reasoning ====")
print_highlight(separate_reasoning_response_json["reasoning_text"])
print_highlight("==== Text ====")
print_highlight(separate_reasoning_response_json["text"])

输出

<|im_start|>user
What is 1+3?<|im_end|>
<|im_start|>assistant

==== Reasoning ====
Okay, so the user is asking "What is 1+3?" Hmm, that seems straightforward, but maybe I should break it down to make sure I'm not missing anything.

Let me start by recalling basic addition. When you add 1 and 3 together, you're combining the quantities.So 1 plus 3... Let me visualize it. If I have one object and then add three more objects, how many do I have in total? Let's count: 1, then 2, 3, 4. So that's four. Wait, is there any chance this is a trick question? Sometimes people ask simple questions to test if you're paying attention or to see if there's a different interpretation.  

But in standard arithmetic, 1+3 is definitely 4. Maybe they want to see the process? Let me think again. 1 is a single unit, and 3 is three units. Adding them together gives four units. There's no ambiguity here unless there's some context I'm missing, like different number bases or something. For example, in base 10, which is standard, 1+3 is 4.  

If it were in another base, like base 4, then 1+3 would be 10, but the question doesn't specify a base, so I should assume base 10. Also, maybe they want a written explanation? Let me confirm once more. 1 plus 3: 1 + 3 = 4. Yes, that's correct. I don't see any reason to doubt that. So the answer should be 4.

==== Text ====
The sum of 1 and 3 is **4**.

In standard arithmetic (base 10), adding 1 and 3 results in 4. If there were a different context (e.g., a non-decimal number system), the question would typically specify that.

**Answer:** 4.

非thinking模式示例

from transformers import AutoTokenizer
import requests

from transformers import AutoTokenizer
import requests
from sglang.utils import wait_for_server, print_highlight, terminate_process

port = 8089

messages = [
    {
        "role": "user",
        "content": "What is 1+3?",
    }
]

tokenizer = AutoTokenizer.from_pretrained("/xxx/llm_weights/Qwen3-32B/")
input = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=False
)

gen_url = f"http://localhost:{port}/generate"
gen_data = {
    "text": input,
    "sampling_params": {
        "skip_special_tokens": False,
        "max_new_tokens": 1024,
        "temperature": 0.7,
        "top_p": 0.8,
    },
}
gen_response = requests.post(gen_url, json=gen_data).json()["text"]

print_highlight("==== Original Output ====")
print_highlight(gen_response)

输出

==== Original Output ====
1 + 3 equals 4.

性能测试¶

参考online示例，先启动sglang server.

启动server

# 启动server
python3 -m sglang.launch_server --model-path [ path of qwen3-32b ] --host 0.0.0.0 --port 8089  --dp-size 1 --tp-size 2 --trust-remote-code

性能测试

# 性能测试
python3 -m sglang.gcu_bench_serving --backend sglang --dataset-name random --random-range-ratio 1.0 --num-prompts 16 --random-input-len 1024 --random-output-len 1024 --host 0.0.0.0 --port 8089

注：

可以通过 --dataset-path 指定数据集的存储路径，否则默认从/tmp/下读取；

默认使用sharegpt数据集，数据集文件名为：ShareGPT_V3_unfiltered_cleaned_split.json，可以通过--dataset-name指定其他数据集；

可以通过 --num-prompts、--random-input-len、--random-output-len 等参数自定义测试规模和输入输出长度。详细参数和用法见官方文档和脚本注释；

精度验证¶

安装lm-eval

python3.10 -m pip install lm_eval[all]

精度测试

# 启动server
python3 -m sglang.launch_server --model [ path of qwen3-32b ] --host 0.0.0.0 --port 8089  --dp-size 1 --tp-size 4 --trust-remote-code --reasoning-parser qwen3

# mmlu精度测试
lm_eval --model local-completions --model_args model=[ path of qwen3-32b ],tokenizer=[ path of qwen3-32b ],base_url=http://0.0.0.0:8089/v1/completions,max_retries=3,max_length=32768, --tasks mmlu --trust_remote_code --batch_size 1 --num_fewshot 0 --seed 0

输出

|Groups           |Version|Filter|n-shot|Metric|   |Value |   |Stderr|
|-----------------|------:|------|-----:|------|---|-----:|---|------|
|mmlu             |      2|none  |      |acc   |↑  |0.8083|±  |0.0032|
|- humanities     |      2|none  |      |acc   |↑  |0.7233|±  |0.0063|
|- other          |      2|none  |      |acc   |↑  |0.8375|±  |0.0064|
|- social sciences|      2|none  |      |acc   |↑  |0.8915|±  |0.0055|
|- stem           |      2|none  |      |acc   |↑  |0.8252|±  |0.0066|

QwQ-32B¶

本模型推理及性能测试需要2张enflame gcu。

模型下载¶

url: QwQ-32B
branch: main
commit id: 976055f8c83f394f35dbd3ab09a285a984907bd0

将上述url路径下的内容下载到本地QwQ-32B文件夹中。

requirements¶

python3 -m pip install transformers == 4.48.3
python3 -m pip install lm_eval == 0.4.8

export ENFLAME_TORCH_GCU_ENABLE_AUTO_MIGRATION=1

注：

环境要求：python3.10; transformers >= 4.48.3; lm_eval >= 0.4.8 均可；

环境变量ENFLAME_TORCH_GCU_ENABLE_AUTO_MIGRATION=1用于开启torch-gcu自动迁移功能；

online推理示例¶

启动server

# 启动server
python3 -m sglang.launch_server --model-path [ path of QwQ-32B ] --host 0.0.0.0 --port 8089  --dp-size 1 --tp-size 2 --trust-remote-code --reasoning-parser deepseek-r1

注：

--port：可以配置为本机未被占用的任意端口；

--context-length：可以配置模型可生成最大token的数量；

client发起请求

# client发起请求：
import requests
from sglang.utils import print_highlight

url = f"http://localhost:{port}/v1/chat/completions"
data = {
    "model": "[ path of QwQ-32B ]",
    "messages": [{"role": "user", "content": "What is the capital of France?"}],
}

response = requests.post(url, json=data)
print_highlight(response.json())

性能测试¶

参考online示例，先启动sglang server.

启动server

# 启动server
python3 -m sglang.launch_server --model-path [ path of QwQ-32B ] --host 0.0.0.0 --port 8089  --dp-size 1 --tp-size 2 --trust-remote-code --reasoning-parser deepseek-r1

性能测试

# 性能测试
python3 -m sglang.gcu_bench_serving --backend sglang --dataset-name random --random-range-ratio 1.0 --num-prompts 16 --random-input-len 1024 --random-output-len 1024   --host 0.0.0.0 --port 8089

注：

可以通过--dataset-path指定数据集的存储路径，否则默认从/tmp/下读取；

默认使用sharegpt数据集，数据集文件名为：ShareGPT_V3_unfiltered_cleaned_split.json，可以通过--dataset-name指定其他数据集；

可以通过 --num-prompts、--random-input-len、--random-output-len 等参数自定义测试规模和输入输出长度。详细参数和用法见官方文档和脚本注释；

精度验证¶

安装lm-eval

python3.10 -m pip install lm_eval[all]

精度验证

#启动server
python3 -m sglang.launch_server --model-path [ path of QwQ-32B ] --context-length 32768 \
    --host 0.0.0.0 --port 30000 --dp-size 1 --tp-size 4 --dtype bfloat16 --chunked-prefill-size 16384 --mem-fraction-static 0.8 \
    --max-prefill-tokens 32768 --reasoning-parser deepseek-r1

# 精度测试
lm_eval --model local-completions --model_args model=[ path of QwQ-32B ],tokenizer=[ path of QwQ-32B ],base_url=http://0.0.0.0:30000/v1/completions,max_retries=3, --tasks ifeval --trust_remote_code  --gen_kwargs temperature=0 --seed 0 --batch_size 5 --num_fewshot 5

注：

--tasks ifeval 指定使用ifeval 任务进行测试；

结果查看

测试结果输出在终端

|Tasks |Version|Filter|n-shot|        Metric         |   |Value |   |Stderr|
|------|------:|------|-----:|-----------------------|---|-----:|---|------|
|ifeval|      4|none  |     0|inst_level_loose_acc   |↑  |0.4281|±  |   N/A|
|      |       |none  |     0|inst_level_strict_acc  |↑  |0.3861|±  |   N/A|
|      |       |none  |     0|prompt_level_loose_acc |↑  |0.2957|±  |0.0196|
|      |       |none  |     0|prompt_level_strict_acc|↑  |0.2440|±  |0.0185|