3.20. Qwen¶

Qwen-1_8B-Chat¶

模型下载¶

url: Qwen-1_8B-Chat
branch: main
commit id: 1d0f68d

将上述url设定的路径下的内容全部下载到Qwen-1_8B-Chat文件夹中。

安装依赖¶

pip3 install tiktoken

批量离线推理¶

python3 -m vllm_utils.benchmark_test \
 --model=[path of hf_Qwen-1_8B-Chat_model] \
 --output-len=20 \
 --demo=te \
 --dtype=bfloat16

性能测试¶

python3 -m vllm_utils.benchmark_test --perf \
 --model=[path of hf_Qwen-1_8B-Chat_model] \
 --max-model-len=8192 \
 --tokenizer=[path of hf_Qwen-1_8B-Chat_model] \
 --input-len=4096 \
 --output-len=4096 \
 --num-prompts=1 \
 --block-size=64 \
 --dtype=bfloat16 \
 --enforce-eager

注：

本模型支持的max-model-len为8192；
input-len、output-len和num-prompts可按需调整；
dtype可按需调整；
配置 output-len为1时,输出内容中的latency即为time_to_first_token_latency;

Qwen-7B¶

模型下载¶

url: Qwen-7B
branch: main
commit id: ef3c5c9

将上述url设定的路径下的内容全部下载到qwen_7b文件夹中。

安装依赖¶

pip3 install tiktoken

批量离线推理¶

python3 -m vllm_utils.benchmark_test \
 --model=[path of hf_qwen_model] \
 --output-len=16 \
 --demo=te \
 --dtype=float16

性能测试¶

python3 -m vllm_utils.benchmark_test --perf \
 --model=[path of hf_qwen_model] \
 --max-model-len=8192 \
 --tokenizer=[path of hf_qwen_model] \
 --input-len=4096 \
 --output-len=4096 \
 --num-prompts=1 \
 --block-size=64 \
 --dtype=float16

注：

本模型支持的max-model-len为8192；
input-len、output-len和num-prompts可按需调整；
dtype可按需调整；
配置 output-len为1时,输出内容中的latency即为time_to_first_token_latency;

Qwen-7B-Chat¶

模型下载¶

url: Qwen-7B-Chat
branch: main
commit id: 8867b2a8cc5e83bce0be47bb4155a9427dc23dd0

将上述url设定的路径下的内容全部下载到Qwen-7B-Chat文件夹中。

安装依赖¶

pip3 install tiktoken

批量离线推理¶

python3 -m vllm_utils.benchmark_test \
 --model=[path of hf_Qwen-7B-Chat_model] \
 --output-len=20 \
 --demo=te \
 --dtype=bfloat16

性能测试¶

python3 -m vllm_utils.benchmark_test --perf \
 --model=[path of hf_Qwen-7B-Chat_model] \
 --max-model-len=1024 \
 --tokenizer=[path of hf_Qwen-7B-Chat_model] \
 --input-len=512 \
 --output-len=512 \
 --num-prompts=1 \
 --block-size=64 \
 --dtype=bfloat16 \
 --enforce-eager

注：

本模型支持的max-model-len为8192；
input-len、output-len和num-prompts可按需调整；
dtype可按需调整；
配置 output-len为1时,输出内容中的latency即为time_to_first_token_latency；

Qwen-14B-Chat¶

模型下载¶

url: Qwen-14B-Chat
branch: main
commit id: cdaff79

将上述url设定的路径下的内容全部下载到Qwen-14B-Chat文件夹中。

安装依赖¶

pip3 install tiktoken

批量离线推理¶

python3 -m vllm_utils.benchmark_test \
 --model=[path of hf_Qwen-14B-Chat_model] \
 --output-len=20 \
 --demo=te \
 --dtype=float16

性能测试¶

python3 -m vllm_utils.benchmark_test --perf \
 --model=[path of hf_Qwen-14B-Chat_model] \
 --max-model-len=2048 \
 --tokenizer=[path of hf_Qwen-14B-Chat_model] \
 --input-len=1024 \
 --output-len=1024 \
 --num-prompts=1 \
 --block-size=64 \
 --dtype=float16 \
 --enforce-eager

注：

本模型支持的max-model-len为2048；
input-len、output-len和num-prompts可按需调整；
dtype可按需调整；
配置 output-len为1时,输出内容中的latency即为time_to_first_token_latency;

Qwen-72B-Chat¶

本模型推理及性能测试需要四张enflame gcu。

模型下载¶

url: Qwen-72B-Chat
branch: main
commit id: 6eb5569

将上述url设定的路径下的内容全部下载到Qwen-72B-Chat文件夹中。

安装依赖¶

pip3 install tiktoken

批量离线推理¶

python3 -m vllm_utils.benchmark_test \
 --model=[path of hf_Qwen-72B-Chat_model] \
 --tensor-parallel-size=4 \
 --output-len=256 \
 --demo=te \
 --dtype=float16 \
 --max-model-len=2048 \
 --gpu-memory-utilization 0.945

性能测试¶

python3 -m vllm_utils.benchmark_test --perf \
 --model=[path of hf_Qwen-72B-Chat_model] \
 --tensor-parallel-size=4 \
 --max-model-len=2048 \
 --tokenizer=[path of hf_Qwen-72B-Chat_model] \
 --input-len=1024 \
 --output-len=1024 \
 --num-prompts=1 \
 --block-size=64 \
 --dtype=float16 \
 --gpu-memory-utilization 0.945

注：

本模型在ecc off模式下四卡支持的max-model-len为8192，ecc on模式下四卡支持的max-model-len为2048；
input-len、output-len和num-prompts可按需调整；
dtype可按需调整；
配置 output-len为1时,输出内容中的latency即为time_to_first_token_latency;

Qwen1.5-7B¶

模型下载¶

url: Qwen1.5-7B
branch: main
commit id: e52fa2e

将上述url设定的路径下的内容全部下载到Qwen1.5-7B文件夹中。

安装依赖¶

pip3 install tiktoken

批量离线推理¶

python3 -m vllm_utils.benchmark_test \
 --model=[path of hf_Qwen1.5-7B_model] \
 --output-len=20 \
 --demo=te \
 --dtype=float16 \
 --max-model-len=16384

性能测试¶

python3 -m vllm_utils.benchmark_test --perf \
 --model=[path of hf_Qwen1.5-7B_model] \
 --max-model-len=16384 \
 --tokenizer=[path of hf_Qwen1.5-7B_model] \
 --input-len=8192 \
 --output-len=8192 \
 --num-prompts=1 \
 --block-size=64 \
 --dtype=float16

注：

本模型支持的max-model-len为16384；
input-len、output-len和num-prompts可按需调整；
配置 output-len为1时,输出内容中的latency即为time_to_first_token_latency;

Qwen1.5-14B-Chat¶

模型下载¶

url: Qwen1.5-14B-Chat
branch: main
commit id: 17e11c306ed235e970c9bb8e5f7233527140cdcf

将上述url设定的路径下的内容全部下载到Qwen1.5-14B-Chat文件夹中。

安装依赖¶

pip3 install tiktoken

批量离线推理¶

python3 -m vllm_utils.benchmark_test \
 --model=[path of hf_Qwen1.5-14B-Chat_model] \
 --output-len=20 \
 --demo=te \
 --dtype=float16 \
 --max-model-len=8192

性能测试¶

python3 -m vllm_utils.benchmark_test --perf \
 --model=[path of hf_Qwen1.5-14B-Chat_model] \
 --max-model-len=8192 \
 --tokenizer=[path of hf_Qwen1.5-14B-Chat_model] \
 --input-len=4096 \
 --output-len=4096 \
 --num-prompts=1 \
 --block-size=64 \
 --dtype=float16

注：

本模型支持的max-model-len为8192；
input-len、output-len和num-prompts可按需调整；
配置 output-len为1时,输出内容中的latency即为time_to_first_token_latency;

Qwen1.5-32B¶

本模型推理及性能测试需要2张enflame gcu。

模型下载¶

url: Qwen1.5-32B
branch: main
commit id: cefef80dc06a65f89d1d71d0adbc56d335ca2490

将上述url设定的路径下的内容全部下载到Qwen1.5-32B文件夹中。

安装依赖¶

pip3 install tiktoken

批量离线推理¶

python3 -m vllm_utils.benchmark_test \
 --model=[path of hf_Qwen1.5-32B_model] \
 --tensor-parallel-size=2 \
 --output-len=20 \
 --demo=te \
 --dtype=float16 \
 --max-model-len=2048

性能测试¶

python3 -m vllm_utils.benchmark_test --perf \
 --model=[path of hf_Qwen1.5-32B_model] \
 --tensor-parallel-size=2 \
 --max-model-len=4096 \
 --tokenizer=[path of hf_Qwen1.5-32B_model] \
 --input-len=2048 \
 --output-len=2048 \
 --num-prompts=1 \
 --block-size=64 \
 --dtype=float16

注：

本模型支持的max-model-len为4096；
input-len、output-len和num-prompts可按需调整；
配置 output-len为1时,输出内容中的latency即为time_to_first_token_latency;

Qwen1.5-72B-Chat¶

本模型推理及性能测试需要4张enflame gcu。

模型下载¶

url: Qwen1.5-72B-Chat
branch: main
commit id: 1a6ccc1215278f962c794b1848c710c29ef4053d

将上述url设定的路径下的内容全部下载到Qwen1.5-72B-Chat文件夹中。

安装依赖¶

pip3 install tiktoken

批量离线推理¶

python3 -m vllm_utils.benchmark_test \
 --model=[path of hf_Qwen1.5-72B-Chat_model] \
 --tensor-parallel-size=4 \
 --output-len=20 \
 --demo=te \
 --dtype=float16 \
 --max-model-len=2048 \
 --gpu-memory-utilization=0.945

性能测试¶

python3 -m vllm_utils.benchmark_test --perf \
 --model=[path of hf_Qwen1.5-72B-Chat_model] \
 --tensor-parallel-size=4 \
 --max-model-len=2048 \
 --tokenizer=[path of hf_Qwen1.5-72B-Chat_model] \
 --input-len=1024 \
 --output-len=1024 \
 --num-prompts=1 \
 --block-size=64 \
 --dtype=float16 \
 --gpu-memory-utilization=0.945

注：

本模型支持的max-model-len为2048；
input-len、output-len和num-prompts可按需调整；
配置 output-len为1时,输出内容中的latency即为time_to_first_token_latency;

Qwen1.5-14B-Chat-w8a16¶

本模型推理及性能测试需要1张enflame gcu。

模型下载¶

如需要下载权重，请联系商务人员开通EGC权限进行下载

下载QWen1.5-14b-chat-w8a16.tar文件并解压，将压缩包内的内容全部拷贝到QWen1.5-14b-chat_w8a16文件夹中。
QWen1.5-14b-chat_w8a16目录结构如下所示：

QWen1.5-14b-chat_w8a16/
            ├── config.json
            ├── generation_config.json
            ├── model.safetensors
            ├── quantize_config.json
            ├── tokenizer.json
            ├── tokenizer_config.json
            ├── merges.txt
            ├── tops_quantize_info.json
            └── vocab.json

批量离线推理¶

python3 -m vllm_utils.benchmark_test \
 --model=[path of QWen1.5-14b-chat_w8a16] \
 --demo=te \
 --dtype=float16 \
 --quantization=w8a16 \
 --output-len=20 \
 --max-model-len=2048

性能测试¶

python3 -m vllm_utils.benchmark_test --perf \
 --model=[path of QWen1.5-14b-chat_w8a16] \
 --input-len=4096 \
 --output-len=4096 \
 --num-prompts=1 \
 --block-size=64 \
 --max-model-len=8192 \
 --dtype=float16 \
 --quantization=w8a16 \
 --enforce-eager

注:

单张gcu上可以支持的max-model-len为16k，若需使用到模型自身支持的32k的max-model-len，则需设置--tensor-parallel-size=2；
input-len、output-len和num-prompts可按需调整；
配置 output-len为1时,输出内容中的latency即为time_to_first_token_latency;

Qwen-14B-Chat-w8a16¶

本模型推理及性能测试需要1张enflame gcu。

模型下载¶

如需要下载权重，请联系商务人员开通EGC权限进行下载

下载Qwen-14B-Chat-w8a16.tar文件并解压，将压缩包内的内容全部拷贝到Qwen-14B-Chat_w8a16文件夹中。
Qwen-14B-Chat_w8a16目录结构如下所示：

Qwen-14B-Chat_w8a16/
  ├── config.json
  ├── configuration_qwen.py
  ├── model.safetensors
  ├── quantize_config.json
  ├── qwen.tiktoken
  ├── tokenization_qwen.py
  ├── tokenizer_config.json
  └── tops_quantize_info.json

批量离线推理¶

python3 -m vllm_utils.benchmark_test \
 --model=[path of Qwen-14B-Chat_w8a16] \
 --demo=te \
 --dtype=float16 \
 --quantization=w8a16 \
 --output-len=20

性能测试¶

python3 -m vllm_utils.benchmark_test --perf \
 --model=[path of Qwen-14B-Chat_w8a16] \
 --input-len=1024 \
 --output-len=1024 \
 --num-prompts=1 \
 --block-size=64 \
 --max-model-len=2048 \
 --dtype=float16 \
 --quantization=w8a16 \
 --enforce-eager

注:

单张gcu上可以支持的max-model-len为2048；
input-len、output-len和num-prompts可按需调整；
配置 output-len为1时,输出内容中的latency即为time_to_first_token_latency;

Qwen-72B-Chat-w8a16¶

本模型推理及性能测试需要4张enflame gcu。

模型下载¶

如需要下载权重，请联系商务人员开通EGC权限进行下载

下载Qwen-72B-Chat-w8a16.tar文件并解压，将压缩包内的内容全部拷贝到Qwen-72B-Chat_w8a16文件夹中。
Qwen-72B-Chat_w8a16目录结构如下所示：

Qwen-72B-Chat_w8a16/
  ├── config.json
  ├── configuration_qwen.py
  ├── model.safetensors
  ├── quantize_config.json
  ├── qwen.tiktoken
  ├── tokenization_qwen.py
  ├── tokenizer_config.json
  └── tops_quantize_info.json

批量离线推理¶

python3 -m vllm_utils.benchmark_test \
 --model=[path of Qwen-72B-Chat_w8a16] \
 --demo=te \
 --dtype=float16 \
 --quantization=w8a16 \
 --output-len=20 \
 --max-model-len=2048 \
 --tensor-parallel-size=4

性能测试¶

python3 -m vllm_utils.benchmark_test --perf \
 --model=[path of Qwen-72B-Chat_w8a16] \
 --input-len=1024 \
 --output-len=1024 \
 --num-prompts=1 \
 --block-size=64 \
 --max-model-len=2048 \
 --dtype=float16 \
 --quantization=w8a16 \
 --tensor-parallel-size=4

注:

gcu4卡上可以支持的max-model-len为2048；
input-len、output-len和num-prompts可按需调整；
配置 output-len为1时,输出内容中的latency即为time_to_first_token_latency;

Qwen1.5-32B-w8a16¶

本模型推理及性能测试需要1张enflame gcu。

模型下载¶

如需要下载权重，请联系商务人员开通EGC权限进行下载

下载qwen1.5-32b-w8a16.tar文件并解压，将压缩包内的内容全部拷贝到qwen1.5_32b_w8a16文件夹中。
qwen1.5_32b_w8a16目录结构如下所示：

qwen1.5_32b_w8a16/
            ├── config.json
            ├── generation_config.json
            ├── model.safetensors
            ├── quantize_config.json
            ├── merges.txt
            ├── vocab.json
            ├── tokenizer.json
            ├── tokenizer_config.json
            └── tops_quantize_info.json

批量离线推理¶

python3 -m vllm_utils.benchmark_test \
 --model=[path of qwen_1_5_32b_w8a16] \
 --demo=te \
 --output-len=256 \
 --dtype=float16  \
 --quantization w8a16 \
 --max-model-len 4096

性能测试¶

python3 -m vllm_utils.benchmark_test --perf \
 --model=[path of qwen_1_5_32b_w8a16] \
 --max-model-len=4096 \
 --tokenizer=[path of qwen_1_5_32b_w8a16] \
 --input-len=2048 \
 --output-len=2048 \
 --num-prompts=1 \
 --block-size=64 \
 --dtype=float16 \
 --quantization w8a16

注：

本模型支持的max-model-len为32768；
input-len、output-len和num-prompts可按需调整；
配置 output-len为1时,输出内容中的latency即为time_to_first_token_latency;

serving模式¶

# 启动服务端
python3 -m vllm.entrypoints.openai.api_server --model=[path of qwen1.5_32b_w8a16]  \
 --max-model-len=4096  \
 --disable-log-requests  \
 --gpu-memory-utilization=0.945  \
 --block-size=64 \
 --dtype=float16 \
 --quantization w8a16

# 启动客户端
python3 -m vllm_utils.benchmark_serving --backend=vllm  \
 --dataset-name=random  \
 --model=[path of qwen1.5_32b_w8a16]  \
 --num-prompts=10  \
 --random-input-len=4   \
 --random-output-len=300  \
 --trust-remote-code

注：

为保证输入输出长度固定，数据集使用随机数测试；
num-prompts, random-input-len和random-output-len可按需调整；

Qwen1.5-MoE-A2.7B¶

本模型推理及性能测试需要1张enflame gcu。

模型下载¶

url: Qwen1.5-MoE-A2.7B
branch: main
commit id: 1a758c5

将上述url设定的路径下的内容全部下载到Qwen1.5-MoE-A2.7B文件夹中。

批量离线推理¶

python3 -m vllm_utils.benchmark_test \
 --model=[path of Qwen1.5-MoE-A2.7B] \
 --output-len=256 \
 --demo=te \
 --dtype=bfloat16 \
 --max-model-len=8192

性能测试¶

python3 -m vllm_utils.benchmark_test --perf \
 --model=[path of Qwen1.5-MoE-A2.7B] \
 --max-model-len=8192 \
 --tokenizer=[path of Qwen1.5-MoE-A2.7B] \
 --input-len=4096 \
 --output-len=4096 \
 --num-prompts=1 \
 --block-size=64 \
 --dtype=bfloat16

注：

本模型支持的max-model-len为8192；
input-len、output-len和num-prompts可按需调整；
配置 output-len为1时,输出内容中的latency即为time_to_first_token_latency;

Qwen2-7B¶

本模型推理及性能测试需要1张enflame gcu。

模型下载¶

url: Qwen2-7B
branch: main
commit id: da7ff8fb

将上述url设定的路径下的内容全部下载到Qwen2-7B文件夹中。

批量离线推理¶

python3 -m vllm_utils.benchmark_test \
    --demo='te' \
    --model= [path of Qwen2-7B] \
    --tokenizer= [path of Qwen2-7B] \
    --num-prompts 1 \
    --max-model-len=32768 \
    --block-size=64 \
    --output-len=256 \
    --device=gcu \
    --dtype=float16 \
    --tensor-parallel-size=1 \
    --gpu-memory-utilization=0.945

性能测试¶

python3 -m vllm_utils.benchmark_test \
    --perf \
    --model [path of Qwen2-7B] \
    --tensor-parallel-size 1 \
    --max-model-len=32768 \
    --input-len=8000 \
    --output-len=8000 \
    --dtype=float16 \
    --device gcu \
    --num-prompts=1 \
    --block-size=64 \
    --gpu-memory-utilization=0.945

注:

单张gcu上可以支持的max-model-len为32768；
input-len、output-len和num-prompts可按需调整；
配置 output-len为1时,输出内容中的latency即为time_to_first_token_latency;

Qwen-7B-Instruct¶

本模型推理及性能测试需要1张enflame gcu。

模型下载¶

url: Qwen2-7B-Instruct
branch: main
commit id: 39c0a5ab

将上述url设定的路径下的内容全部下载到Qwen2-7B-Instruct文件夹中。

批量离线推理¶

python3 -m vllm_utils.benchmark_test \
    --demo='te' \
    --model=[path of Qwen2-7B-Instruct] \
    --tokenizer=[path of Qwen2-7B-Instruct] \
    --num-prompts 1 \
    --block-size=64 \
    --output-len=256 \
    --device=gcu \
    --dtype=bfloat16 \
    --tensor-parallel-size=1 \
    --max-model-len=32768 \
    --gpu-memory-utilization=0.945

性能测试¶

python3 -m vllm_utils.benchmark_test \
    --perf \
    --model [path of Qwen2-7B-Instruct] \
    --tensor-parallel-size 1 \
    --max-model-len=32768 \
    --input-len= 8000 \
    --output-len=8000 \
    --dtype=bfloat16 \
    --device gcu \
    --num-prompts=1 \
    --block-size=64 \
    --gpu-memory-utilization=0.945

注:

单张gcu上可以支持的max-model-len为32768；
input-len、output-len和num-prompts可按需调整；
配置 output-len为1时,输出内容中的latency即为time_to_first_token_latency;

Qwen2-72B-padded-w8a16¶

本模型推理及性能测试需要1张enflame gcu。

模型下载¶

如需要下载权重，请联系商务人员开通EGC权限进行下载

下载Qwen2-72B-padded-w8a16.tar文件并解压，将压缩包内的内容全部拷贝到Qwen2-72B-padded-w8a16文件夹中。
Qwen2-72B-padded-w8a16目录结构如下所示：

Qwen2-72B-padded-w8a16
├── config.json
├── generation_config.json
├── merges.txt
├── model.safetensors
├── quantize_config.json
├── tokenizer_config.json
├── tokenizer.json
├── tops_quantize_info.json
└── vocab.json

批量离线推理¶

python3 -m vllm_utils.benchmark_test \
    --demo='te' \
    --model=[path of Qwen2-72B-padded-w8a16 ] \
    --tokenizer=[path of Qwen2-72B-padded-w8a16] \
    --num-prompts 1 \
    --max-model-len=32768 \
    --block-size=64 \
    --output-len=256 \
    --device=gcu \
    --dtype=float16 \
    --tensor-parallel-size=8 \
    --gpu-memory-utilization=0.945

性能测试¶

python3 -m vllm_utils.benchmark_test \
    --perf \
    --model [path of Qwen2-72B-padded-w8a16]\
    --tensor-parallel-size 8 \
    --max-model-len=32768 \
    --input-len=8000 \
    --output-len=8000  \
    --dtype=float16 \
    --device gcu \
    --num-prompts=1  \
    --block-size=64 \
    --gpu-memory-utilization=0.945

注：

本模型支持的max-model-len为32768；
input-len、output-len和num-prompts可按需调整；
配置 output-len为1时,输出内容中的latency即为time_to_first_token_latency;

Qwen2-72B-Instruct¶

本模型推理及性能测试需要8张enflame gcu。

模型下载¶

url: Qwen2-72B-Instruct
branch: main
commit id: da7ff8fb

将上述url设定的路径下的内容全部下载到Qwen2-72B-Instruct文件夹中。

批量离线推理¶

python3 -m vllm_utils.benchmark_test \
    --demo='te' \
    --model=[path of Qwen2-72B-Instruct]  \
    --tokenizer=[path of Qwen2-72B-Instruct]  \
    --num-prompts 1 \
    --block-size=64 \
    --max-model-len=32768 \
    --output-len=256 \
    --device=gcu \
    --dtype=float16 \
    --tensor-parallel-size=8 \
    --gpu-memory-utilization=0.945

性能测试¶

python3 -m vllm_utils.benchmark_test \
    --perf \
    --model [path of Qwen2-72B-Instruct] \
    --tensor-parallel-size 8 \
    --max-model-len=32768 \
    --input-len=8000 \
    --output-len=8000 \
    --dtype=float16 \
    --device gcu \
    --num-prompts=1 \
    --block-size=64 \
    --gpu-memory-utilization=0.945

注：

本模型支持的max-model-len为32768；
input-len、output-len和num-prompts可按需调整；
配置 output-len为1时,输出内容中的latency即为time_to_first_token_latency;

Qwen2-1.5B-Instruct¶

本模型推理及性能测试需要1张enflame gcu。

模型下载¶

url: Qwen2-1.5B-Instruct
branch: main
commit id: ba1cf18

将上述url设定的路径下的内容全部下载到Qwen2-1.5B-Instruct文件夹中。

批量离线推理¶

python3 -m vllm_utils.benchmark_test \
    --demo='te' \
    --model=[path of Qwen2-1.5B-Instruct] \
    --tokenizer=[path of Qwen2-1.5B-Instruct] \
    --num-prompts 1 \
    --block-size=64 \
    --output-len=256 \
    --device=gcu \
    --dtype=bfloat16 \
    --tensor-parallel-size=1 \
    --max-model-len=32768 \

性能测试¶

python3 -m vllm_utils.benchmark_test \
    --perf \
    --model [path of Qwen2-1.5B-Instruct] \
    --tensor-parallel-size 1 \
    --max-model-len=32768 \
    --input-len= 8000 \
    --output-len=8000 \
    --dtype=bfloat16 \
    --device gcu \
    --num-prompts=1 \
    --block-size=64 \

注:

单张gcu上可以支持的max-model-len为32768；
input-len、output-len和num-prompts可按需调整；
配置 output-len为1时,输出内容中的latency即为time_to_first_token_latency;

Qwen1.5-4B¶

本模型推理及性能测试需要1张enflame gcu。

模型下载¶

url: Qwen1.5-4B
branch: main
commit id: a66363a0c24e2155c561e4b53c658b1d3965474e

将上述url设定的路径下的内容全部下载到Qwen1.5-4B文件夹中。

安装依赖¶

pip3 install tiktoken

批量离线推理¶

python3 -m vllm_utils.benchmark_test \
 --model=[path of Qwen1.5-4B_model] \
 --tensor-parallel-size=1 \
 --output-len=256 \
 --demo=te \
 --max-model-len=32768 \
 --dtype=bfloat16

性能测试¶

python3 -m vllm_utils.benchmark_test --perf \
 --model=[path of Qwen1.5-4B_model] \
 --max-model-len=32768 \
 --tokenizer=[path of Qwen1.5-4B_model] \
 --input-len=2048 \
 --output-len=1024 \
 --tensor-parallel-size=1 \
 --num-prompts=1 \
 --block-size=64 \
 --dtype=bfloat16

注：

本模型支持的max-model-len为32768；
input-len、output-len和num-prompts可按需调整；
配置 output-len为1时,输出内容中的latency即为time_to_first_token_latency;

Qwen1.5-4B-Chat¶

本模型推理及性能测试需要1张enflame gcu。

模型下载¶

url: Qwen1.5-4B-Chat
branch: main
commit id: a7a4d4945d28bac955554c9abd2f74a71ebbf22f

将上述url设定的路径下的内容全部下载到Qwen1.5-4B-Chat文件夹中。

安装依赖¶

pip3 install tiktoken

批量离线推理¶

python3 -m vllm_utils.benchmark_test \
 --model=[path of Qwen1.5-4B-Chat_model] \
 --max-model-len=32768 \
 --tensor-parallel-size=1 \
 --output-len=256 \
 --demo=te \
 --dtype=bfloat16

性能测试¶

python3 -m vllm_utils.benchmark_test --perf \
 --model=[path of Qwen1.5-4B-Chat_model] \
 --tokenizer=[path of Qwen1.5-4B-Chat_model] \
 --input-len=2048 \
 --output-len=1024 \
 --tensor-parallel-size=1 \
 --max-model-len=32768 \
 --num-prompts=1 \
 --block-size=64 \
 --dtype=bfloat16

注：

本模型支持的max-model-len为32768；
input-len、output-len和num-prompts可按需调整；
配置 output-len为1时,输出内容中的latency即为time_to_first_token_latency;

Qwen1.5-32B-Chat-w8a16¶

本模型推理及性能测试需要2张enflame gcu。

模型下载¶

如需要下载权重，请联系商务人员开通EGC权限进行下载

下载Qwen1.5-32B-Chat-w8a16.tar文件并解压，将压缩包内的内容全部拷贝到Qwen1.5-32B-Chat-w8a16文件夹中。
Qwen1.5-32B-Chat-w8a16目录结构如下所示：

Qwen1.5-32B-Chat-w8a16
  ├── config.json
  ├── model.safetensors
  ├── quantize_config.json
  ├── tokenizer_config.json
  ├── tokenizer.json
  ├── tops_quantize_info.json
  └── vocab.json

批量离线推理¶

python3 -m vllm_utils.benchmark_test \
 --model=[path of Qwen1.5-32B-Chat-w8a16] \
 --demo=te \
 --dtype=float16 \
 --quantization=w8a16 \
 --output-len=256 \
 --tensor-parallel-size=2 \
 --max-model-len=32768

性能测试¶

python3 -m vllm_utils.benchmark_test --perf \
 --model=[path of Qwen1.5-32B-Chat-w8a16] \
 --input-len=512 \
 --output-len=128 \
 --num-prompts=16 \
 --block-size=64 \
 --tensor-parallel-size=2 \
 --max-model-len=32768 \
 --dtype=float16 \
 --quantization=w8a16

注:

两卡gcu上可以支持的max-model-len为32768；
input-len、output-len和num-prompts可按需调整；
配置 output-len为1时,输出内容中的latency即为time_to_first_token_latency;

serving模式¶

# 启动服务端
python3 -m vllm.entrypoints.openai.api_server --model=[path of Qwen1.5-32B-Chat-w8a16]  \
 --tensor-parallel-size 2 \
 --max-model-len=4096  \
 --disable-log-requests  \
 --gpu-memory-utilization=0.945  \
 --block-size=64 \
 --dtype=float16 \
 --quantization w8a16

# 启动客户端
python3 -m vllm_utils.benchmark_serving --backend=vllm  \
 --dataset-name=random  \
 --model=[path of Qwen1.5-32B-Chat-w8a16]  \
 --num-prompts=10  \
 --random-input-len=4   \
 --random-output-len=300  \
 --trust-remote-code

注：

为保证输入输出长度固定，数据集使用随机数测试；
num-prompts, random-input-len和random-output-len可按需调整；

Qwen1.5-72B-w8a16¶

本模型推理及性能测试需要8张enflame gcu。

模型下载¶

如需要下载权重，请联系商务人员开通EGC权限进行下载

下载Qwen1.5-72B-w8a16.tar文件并解压，将压缩包内的内容全部拷贝到Qwen1.5-72B-w8a16文件夹中。
Qwen1.5-72B-w8a16目录结构如下所示：

Qwen1.5-72B-w8a16
  ├── config.json
  ├── model.safetensors
  ├── quantize_config.json
  ├── tokenizer_config.json
  ├── tokenizer.json
  ├── tops_quantize_info.json
  └── vocab.json

批量离线推理¶

python3 -m vllm_utils.benchmark_test \
 --model=[path of Qwen1.5-72B-w8a16_model] \
 --demo=te \
 --dtype=float16 \
 --quantization=w8a16 \
 --tensor-parallel-size=8 \
 --output-len=256 \
 --max-model-len=32768

性能测试¶

python3 -m vllm_utils.benchmark_test --perf \
 --model=[path of Qwen1.5-72B-w8a16_model] \
 --input-len=512 \
 --output-len=128 \
 --num-prompts=1 \
 --block-size=64 \
 --max-model-len=32768 \
 --tensor-parallel-size=8 \
 --dtype=float16 \
 --quantization=w8a16

注:

八张gcu上可以支持的max-model-len为32768；
input-len、output-len和num-prompts可按需调整；
配置 output-len为1时,输出内容中的latency即为time_to_first_token_latency;

Qwen1.5-72B-Chat-w8a16¶

本模型推理及性能测试需要8张enflame gcu。

模型下载¶

如需要下载权重，请联系商务人员开通EGC权限进行下载

下载Qwen1.5-72B-Chat-w8a16.tar文件并解压，将压缩包内的内容全部拷贝到Qwen1.5-72B-Chat-w8a16文件夹中。
Qwen1.5-72B-Chat-w8a16目录结构如下所示：

Qwen1.5-72B-Chat-w8a16
  ├── config.json
  ├── generation_config.json
  ├── merges.txt
  ├── model.safetensors
  ├── quantize_config.json
  ├── tokenizer_config.json
  ├── tokenizer.json
  ├── tops_quantize_info.json
  └── vocab.json

批量离线推理¶

python3 -m vllm_utils.benchmark_test \
 --model=[path of Qwen1.5-72B-Chat-w8a16_model] \
 --demo=te \
 --dtype=float16 \
 --tensor-parallel-size=8 \
 --quantization=w8a16 \
 --output-len=256 \
 --max-model-len=32768

性能测试¶

python3 -m vllm_utils.benchmark_test --perf \
 --model=[path of Qwen1.5-72B-Chat-w8a16_model] \
 --input-len=512 \
 --output-len=128 \
 --num-prompts=1 \
 --block-size=64 \
 --tensor-parallel-size=8 \
 --max-model-len=32768 \
 --dtype=float16 \
 --quantization=w8a16

注:

八卡gcu上可以支持的max-model-len为32768；
input-len、output-len和num-prompts可按需调整；
配置 output-len为1时,输出内容中的latency即为time_to_first_token_latency;

Qwen1.5-32B-w4a16¶

本模型推理及性能测试需要1张enflame gcu。

模型下载¶

如需要下载权重，请联系商务人员开通EGC权限进行下载

下载Qwen1.5-32B-w4a16.tar文件并解压，将压缩包内的内容全部拷贝到Qwen1.5_32B_w4a16_gptq文件夹中。
Qwen1.5_32B_w4a16_gptq目录结构如下所示：

Qwen1.5_32B_w4a16_gptq/
  ├── config.json
  ├── vocab.json
  ├── generation_config.json
  ├── model.safetensors
  ├── quantize_config.json
  ├── merges.txt
  ├── tokenizer.json
  ├── tokenizer_config.json
  └── tops_quantize_info.json

批量离线推理¶

python3 -m vllm_utils.benchmark_test \
 --model=[path of Qwen1.5_32B_w4a16_gptq] \
 --demo=te \
 --dtype=float16 \
 --quantization=gptq \
 --output-len=256

性能测试¶

python3 -m vllm_utils.benchmark_test --perf \
 --model=[path of Qwen1.5_32B_w4a16_gptq] \
 --input-len=1024 \
 --output-len=500 \
 --num-prompts=1 \
 --block-size=64 \
 --max-model-len=32768 \
 --dtype=float16 \
 --quantization=gptq

注：

本模型支持的max-model-len为32768；
input-len、output-len和num-prompts可按需调整；
配置 output-len为1时,输出内容中的latency即为time_to_first_token_latency;

qwen2-72b-instruct-gptq-int4¶

本模型推理及性能测试需要4张enflame gcu。

模型下载¶

url: qwen2-72b-instruct-gptq-int4
branch: master
commit id: c7e75f6b

将上述url设定的路径下的内容全部下载到qwen2-72b-instruct-gptq-int4文件夹中。

批量离线推理¶

python3 -m vllm_utils.benchmark_test \
 --model=[path of qwen2-72b-instruct-gptq-int4] \
 --tensor-parallel-size=4 \
 --max-model-len=32768 \
 --output-len=512 \
 --demo=te \
 --dtype=float16 \
 --device=gcu \
 --quantization=gptq

性能测试¶

python3 -m vllm_utils.benchmark_test --perf \
 --model=[path of qwen2-72b-instruct-gptq-int4] \
 --max-model-len=32768 \
 --tokenizer=[path of qwen2-72b-instruct-gptq-int4] \
 --input-len=1024 \
 --output-len=500 \
 --num-prompts=1 \
 --tensor-parallel-size=4 \
 --block-size=64 \
 --quantization=gptq \
 --device=gcu

注：

本模型支持的max-model-len为32768；
input-len、output-len和num-prompts可按需调整；
配置 output-len为1时,输出内容中的latency即为time_to_first_token_latency;

Qwen1.5-32B-w4a16c8¶

本模型推理及性能测试需要1张enflame gcu。

模型下载¶

如需要下载权重，请联系商务人员开通EGC权限进行下载

下载Qwen1.5-32B-w4a16c8.tar文件以及并解压，将压缩包内的内容全部拷贝到Qwen1.5_32B_w4a16c8文件夹中。
Qwen1.5_32B_w4a16c8目录结构如下所示：

Qwen1.5_32B_w4a16c8/
  ├── config.json
  ├── vocab.json
  ├── generation_config.json
  ├── model.safetensors
  ├── quantize_config.json
  ├── int8_kv_cache.json
  ├── merges.txt
  ├── tokenizer.json
  ├── tokenizer_config.json
  └── tops_quantize_info.json

批量离线推理¶

python3 -m vllm_utils.benchmark_test \
 --model=[path of Qwen1.5_32B_w4a16c8] \
 --demo=te \
 --dtype=float16 \
 --quantization-param-path=[path of int8_kv_cache.json] \
 --kv-cache-dtype=int8
 --output-len=256

性能测试¶

python3 -m vllm_utils.benchmark_test --perf \
 --model=[path of Qwen1.5_32B_w4a16c8] \
 --input-len=1024 \
 --output-len=500 \
 --num-prompts=1 \
 --block-size=64 \
 --max-model-len=32768 \
 --dtype=float16 \
 --quantization-param-path=[path of int8_kv_cache.json] \
 --kv-cache-dtype=int8

注：

本模型支持的max-model-len为32768；
input-len、output-len和num-prompts可按需调整；
配置 output-len为1时,输出内容中的latency即为time_to_first_token_latency;

Qwen2-72B-Instruct-w4a16c8¶

本模型推理及性能测试需要2张enflame gcu。

模型下载¶

url: Qwen2-72B-Instruct-GPTQ-Int4
branch: main
commit id: 9d456ed
另外需要下载int8_kv_cache.json, 联系商务人员开通EGC权限进行下载

将下载的Qwen2-72B-Instruct-GPTQ-Int4和int8_kv_cache.json放入Qwen2_72B_Instruct_w4a16c8文件夹中。

批量离线推理¶

python3 -m vllm_utils.benchmark_test \
 --model=[path of Qwen2_72B_Instruct_w4a16c8] \
 --demo=te \
 --dtype=float16 \
 --quantization-param-path=[path of int8_kv_cache.json] \
 --kv-cache-dtype=int8 \
 --output-len=256

性能测试¶

python3 -m vllm_utils.benchmark_test --perf \
 --model=[path of Qwen2_72B_Instruct_w4a16c8] \
 --input-len=1024 \
 --output-len=500 \
 --num-prompts=1 \
 --block-size=64 \
 --max-model-len=32768 \
 --dtype=float16 \
 --quantization-param-path=[path of int8_kv_cache.json] \
 --kv-cache-dtype=int8

注：

本模型支持的max-model-len为32768；
input-len、output-len和num-prompts可按需调整；
配置 output-len为1时,输出内容中的latency即为time_to_first_token_latency;

Qwen1.5-72B-w4a16¶

本模型推理及性能测试需要8张enflame gcu。

模型下载¶

url: Qwen1.5-72B-Chat-awq
branch: master
commit id: 4b52e410

批量离线推理¶

python3 -m vllm_utils.benchmark_test \
 --model=[path of Qwen1.5-72B-Chat-awq] \
 --demo=te \
 --dtype=float16 \
 --tensor-parallel-size 8 \
 --output-len=512 \
 --device gcu \
 --enforce-eager

性能测试¶

python3 -m vllm_utils.benchmark_test --perf \
 --model=[path of Qwen1.5-72B-Chat-awq] \
 --tensor-parallel-size 8 \
 --input-len=20480 \
 --output-len=1024 \
 --num-prompts=1 \
 --block-size=64 \
 --max-model-len=32768 \
 --dtype=float16 \
 --device gcu \
 --enforce-eager

注：

本模型支持的max-model-len为32768；
input-len、output-len和num-prompts可按需调整；
配置 output-len为1时,输出内容中的latency即为time_to_first_token_latency;

Qwen2-57B-A14B¶

本模型推理及性能测试需要4张enflame gcu。

模型下载¶

url: Qwen2-57B-A14B
branch: master
commit id: d8cb5700

批量离线推理¶

python3 -m vllm_utils.benchmark_test \
 --model=[path of Qwen2-57B-A14B] \
 --demo=te \
 --dtype=bfloat16 \
 --tensor-parallel-size 4 \
 --max-model-len=8192 \
 --output-len=512 \
 --device gcu

性能测试¶

python3 -m vllm_utils.benchmark_test --perf \
 --model=[path of Qwen2-57B-A14B] \
 --tensor-parallel-size 4 \
 --input-len=4096 \
 --output-len=4096 \
 --num-prompts=1 \
 --block-size=64 \
 --max-model-len=8192 \
 --dtype=bfloat16 \
 --device gcu

注：

本模型支持的max-model-len为8192；
input-len、output-len和num-prompts可按需调整；
配置 output-len为1时,输出内容中的latency即为time_to_first_token_latency;

Qwen1.5-110B-Chat-w8a16¶

本模型推理及性能测试需要8张enflame gcu。

模型下载¶

如需要下载权重，请联系商务人员开通EGC权限进行下载

下载QWen1.5-110B-Chat-w8a16.tar文件并解压，将压缩包内的内容全部拷贝到QWen1.5-110B-Chat_w8a16文件夹中。
QWen1.5-110B-Chat_w8a16目录结构如下所示：

QWen1.5-110B-Chat_w8a16/
            ├── config.json
            ├── generation_config.json
            ├── model.safetensors
            ├── quantize_config.json
            ├── tokenizer.json
            ├── tokenizer_config.json
            ├── merges.txt
            ├── tops_quantize_info.json
            └── vocab.json

批量离线推理¶

python3 -m vllm_utils.benchmark_test \
 --model=[path of Qwen1.5-110B-Chat] \
 --tensor-parallel-size=8 \
 --output-len=256 \
 --demo=te \
 --max-model-len=32768 \
 --dtype=float16

性能测试¶

python3 -m vllm_utils.benchmark_test --perf \
 --model=[path of Qwen1.5-110B-Chat] \
 --max-model-len=32768 \
 --tokenizer=[path of Qwen1.5-110B-Chat] \
 --input-len=1000 \
 --output-len=3000 \
 --tensor-parallel-size=8 \
 --num-prompts=1 \
 --block-size=64 \
 --dtype=float16

注：

本模型支持的max-model-len为32768；
input-len、output-len和num-prompts可按需调整；
配置 output-len为1时,输出内容中的latency即为time_to_first_token_latency;

Qwen1.5-32B-Chat-w4a16c8¶

本模型推理及性能测试需要1张enflame gcu。

模型下载¶

url: Qwen1.5-32B-Chat-GPTQ-Int4
branch: main
commit id: 226cd6ec86d885563fb5c7c2c4560a035564f20f
另外需要下载int8_kv_cache.json, 联系商务人员开通EGC权限进行下载

将下载的Qwen1.5-32B-Chat-GPTQ-Int4和int8_kv_cache.json放入Qwen1.5_32B_Chat_w4a16c8文件夹中。

批量离线推理¶

python3 -m vllm_utils.benchmark_test \
 --model=[path of Qwen1.5_32B_Chat_w4a16c8] \
 --demo=te \
 --dtype=float16 \
 --quantization-param-path=[path of int8_kv_cache.json] \
 --kv-cache-dtype=int8 \
 --output-len=256

性能测试¶

python3 -m vllm_utils.benchmark_test --perf \
 --model=[path of Qwen1.5_32B_Chat_w4a16c8] \
 --input-len=1024 \
 --output-len=500 \
 --num-prompts=1 \
 --block-size=64 \
 --max-model-len=32768 \
 --dtype=float16 \
 --quantization-param-path=[path of int8_kv_cache.json] \
 --kv-cache-dtype=int8

注：

本模型支持的max-model-len为32768；
input-len、output-len和num-prompts可按需调整；
配置 output-len为1时,输出内容中的latency即为time_to_first_token_latency;