4.2. DeepSeek 系列¶

介绍¶

DeepSeek 系列大模型是由杭州深度求索公司开发的一系列语言模型。该系列在中文和英文场景下均表现出色，尤其在推理、代码生成、多轮对话等方面具有较强竞争力。

DeepSeekR1¶

本模型推理及性能测试需要32张enflame gcu。

模型下载¶

url:DeepSeek-R1-awq

requirements¶

python3 -m pip install transformers==4.48.3
python3 -m pip install langdetect==1.0.9
python3 -m pip install immutabledict==4.2.1

export ENFLAME_TORCH_GCU_ENABLE_AUTO_MIGRATION=1

注:

环境要求：python3.10；transformers >= 4.48.3 均可；

环境变量ENFLAME_TORCH_GCU_ENABLE_AUTO_MIGRATION=1用于开启torch-gcu自动迁移功能；

--dist-init-addr选择一台机器作为主节点，填入主节点host ip和任意一个未被占用的端口；

ifconfig -a 从结果中选择包含inet字段且内容与机器实际ip一致的字段，填入GLOO_SOCKET_IFNAME；

online推理示例¶

启动server

#187 master node
env GLOO_SOCKET_IFNAME=[GLOO_SOCKET_IFNAME] TORCH_ECCL_AVOID_RECORD_STREAMS=1 ENFLAME_TORCH_GCU_ENABLE_AUTO_MIGRATION=1 DS_V3_PARALLEL=True DS_V3_PARALLEL_MAX_TOTAL_TOKENS=1024 \
python3.10 -m sglang.launch_server --model-path [path of deepseek-r1-awq] --host 0.0.0.0 --port 5000 --dist-init-addr [master node ip]:8746  \
--node-rank 0 --nnodes 4 --dp-size 8 --tp-size 4 --ep-size 32 --enable-ep-moe --trust-remote-code --cuda-graph-max-bs 64 --mem-fraction-static 0.65 --disable-radix-cache  \
--max-prefill-tokens 4096 --chunked-prefill-size -1

#188
env  GLOO_SOCKET_IFNAME=[GLOO_SOCKET_IFNAME] TORCH_ECCL_AVOID_RECORD_STREAMS=1 ENFLAME_TORCH_GCU_ENABLE_AUTO_MIGRATION=1 DS_V3_PARALLEL=True DS_V3_PARALLEL_MAX_TOTAL_TOKENS=1024 \
OUTLINES_CACHE_DIR=/home/deepseek_r1_test/.cache/outlines_2 python3.10 -m sglang.launch_server --model-path [path of deepseek-r1-awq] \
--host 0.0.0.0 --port 5000 --dist-init-addr [master node ip]:8746  --node-rank 1 --nnodes 4 --dp-size 8 --tp-size 4 --ep-size 32 --enable-ep-moe --trust-remote-code \
--cuda-graph-max-bs 64 --mem-fraction-static 0.65 --disable-radix-cache --max-prefill-tokens 4096 --chunked-prefill-size -1

#186
env  GLOO_SOCKET_IFNAME=[GLOO_SOCKET_IFNAME] TORCH_ECCL_AVOID_RECORD_STREAMS=1 ENFLAME_TORCH_GCU_ENABLE_AUTO_MIGRATION=1 DS_V3_PARALLEL=True DS_V3_PARALLEL_MAX_TOTAL_TOKENS=1024 \
OUTLINES_CACHE_DIR=/home/deepseek_r1_test/.cache/outlines_3 python3.10 -m sglang.launch_server --model-path [path of deepseek-r1-awq] \
--host 0.0.0.0 --port 5000 --dist-init-addr [master node ip]:8746  --node-rank 2 --nnodes 4 --dp-size 8 --tp-size 4 --ep-size 32 --enable-ep-moe --trust-remote-code \
--cuda-graph-max-bs 64 --mem-fraction-static 0.65 --disable-radix-cache  --max-prefill-tokens 4096 --chunked-prefill-size -1

#189
env  GLOO_SOCKET_IFNAME=[GLOO_SOCKET_IFNAME] TORCH_ECCL_AVOID_RECORD_STREAMS=1 ENFLAME_TORCH_GCU_ENABLE_AUTO_MIGRATION=1 DS_V3_PARALLEL=True \
DS_V3_PARALLEL_MAX_TOTAL_TOKENS=1024 OUTLINES_CACHE_DIR=/home/deepseek_r1_test/.cache/outlines_4 \
python3.10 -m sglang.launch_server --model-path [path of deepseek-r1-awq] --host 0.0.0.0 --port 5000 --dist-init-addr [master node ip]:8746 \
--node-rank 3 --nnodes 4 --dp-size 8 --tp-size 4 --ep-size 32 --enable-ep-moe --trust-remote-code \
--cuda-graph-max-bs 64 --mem-fraction-static 0.65 --disable-radix-cache  --max-prefill-tokens 4096 --chunked-prefill-size -1

注：

--port：可以配置为本机未被占用的任意端口；

--context-length：可以配置模型可生成最大token的数量;

启动router，和master server部署在同一台机器上

#router:
#187
python3.10 -m sglang_router.launch_router --worker-urls http://[master node ip]:5000 --host 0.0.0.0 --port 30000

注：

--port：可以配置为本机未被占用的任意端口；

--context-length：可以配置模型可生成最大token的数量；

--dist-init-addr: 选择一台机器作为主节点，填入主节点host ip和任意一个未被占用的端口；

ifconfig -a 从结果中选择包含inet字段且内容与机器实际ip一致的字段，填入GLOO_SOCKET_IFNAME；

client发起请求

import requests
from sglang.utils import print_highlight

url = f"http://localhost:30000/v1/chat/completions"
data = {
    "model": "[ path of DeepSeek-R1-awq ]",
    "messages": [{"role": "user", "content": "What is the capital of France?"}],
}

response = requests.post(url, json=data)
print_highlight(response.json())

性能测试¶

参考online示例，先启动sglang server.

启动server

#server：
#187 master node
env GLOO_SOCKET_IFNAME=[GLOO_SOCKET_IFNAME] TORCH_ECCL_AVOID_RECORD_STREAMS=1 ENFLAME_TORCH_GCU_ENABLE_AUTO_MIGRATION=1 DS_V3_PARALLEL=True DS_V3_PARALLEL_MAX_TOTAL_TOKENS=1024 \
python3.10 -m sglang.launch_server --model-path [path of deepseek-r1-awq] --host 0.0.0.0 --port 5000 --dist-init-addr [master node ip]:8746  \
--node-rank 0 --nnodes 4 --dp-size 8 --tp-size 4 --ep-size 32 --enable-ep-moe --trust-remote-code --cuda-graph-max-bs 64 --mem-fraction-static 0.65 \
--disable-radix-cache  --max-prefill-tokens 4096 --chunked-prefill-size -1

#188
env  GLOO_SOCKET_IFNAME=[GLOO_SOCKET_IFNAME] TORCH_ECCL_AVOID_RECORD_STREAMS=1 ENFLAME_TORCH_GCU_ENABLE_AUTO_MIGRATION=1 DS_V3_PARALLEL=True DS_V3_PARALLEL_MAX_TOTAL_TOKENS=1024 OUTLINES_CACHE_DIR=/home/deepseek_r1_test/.cache/outlines_2 \
python3.10 -m sglang.launch_server --model-path [path of deepseek-r1-awq] --host 0.0.0.0 --port 5000 --dist-init-addr [master node ip]:8746  \
--node-rank 1 --nnodes 4 --dp-size 8 --tp-size 4 --ep-size 32 --enable-ep-moe --trust-remote-code --cuda-graph-max-bs 64 --mem-fraction-static 0.65 --disable-radix-cache --max-prefill-tokens 4096 --chunked-prefill-size -1

#186
env  GLOO_SOCKET_IFNAME=[GLOO_SOCKET_IFNAME] TORCH_ECCL_AVOID_RECORD_STREAMS=1 ENFLAME_TORCH_GCU_ENABLE_AUTO_MIGRATION=1 DS_V3_PARALLEL=True DS_V3_PARALLEL_MAX_TOTAL_TOKENS=1024 OUTLINES_CACHE_DIR=/home/deepseek_r1_test/.cache/outlines_3 \
python3.10 -m sglang.launch_server --model-path [path of deepseek-r1-awq] --host 0.0.0.0 --port 5000 --dist-init-addr [master node ip]:8746  \
--node-rank 2 --nnodes 4 --dp-size 8 --tp-size 4 --ep-size 32 --enable-ep-moe --trust-remote-code --cuda-graph-max-bs 64 --mem-fraction-static 0.65 --disable-radix-cache  --max-prefill-tokens 4096 --chunked-prefill-size -1

#189
env  GLOO_SOCKET_IFNAME=[GLOO_SOCKET_IFNAME] TORCH_ECCL_AVOID_RECORD_STREAMS=1 ENFLAME_TORCH_GCU_ENABLE_AUTO_MIGRATION=1 DS_V3_PARALLEL=True DS_V3_PARALLEL_MAX_TOTAL_TOKENS=1024 OUTLINES_CACHE_DIR=/home/deepseek_r1_test/.cache/outlines_4 \
python3.10 -m sglang.launch_server --model-path [path of deepseek-r1-awq] --host 0.0.0.0 --port 5000 --dist-init-addr [master node ip]:8746  \
--node-rank 3 --nnodes 4 --dp-size 8 --tp-size 4 --ep-size 32 --enable-ep-moe --trust-remote-code --cuda-graph-max-bs 64 --mem-fraction-static 0.65 --disable-radix-cache  --max-prefill-tokens 4096 --chunked-prefill-size -1

启动router，和master server部署在同一台机器上

#router:
#187
python3.10 -m sglang_router.launch_router --worker-urls http://[master node ip]:5000 --host 0.0.0.0 --port 30000

client 发起请求

# 性能测试
python3.10 -m sglang.gcu_bench_serving --backend sglang --dataset-name random --random-input 1000 --random-output 700 --random-range-ratio 1 --num-prompts 32 --host [master node ip] --port 30000 --extra-request-body '{"temperature": 0.7, "top_k": 50, "top_p": 0.95, "repetition_penalty": 1.0}' --output-file "deepseekv3_nnode4.jsonl"

注：

可以通过--dataset-path指定数据集的存储路径，否则默认从/tmp/下读取；

默认使用sharegpt数据集，数据集文件名为：ShareGPT_V3_unfiltered_cleaned_split.json，可以通过--dataset-name指定其他数据集；

--num-prompts 即batch size数；

--random-input 即input len长度；

--random-output 即output len长度；

--mem-fraction-static 控制静态内存占比，为权重和KVcache预留比例，剩余部分由activation及其他使用；

--max-prefill-tokens 控制单次prefill最大token数，根据需要调整；

--cuda-graph-max-bs cudagraph需要capture的最大batch size；

--dist-init-addr 选择一台机器作为主节点，填入主节点host ip和任意一个未被占用的端口；

ifconfig -a 从结果中选择包含inet字段且内容与机器实际ip一致的字段，填入GLOO_SOCKET_IFNAME；

精度验证¶

此处以IFEval精度验证为例说明，如需了解更多详细信息或其他基准测试，请参考sglang/benchmark下每个特定基准测试文件夹中的 README 文件。

先启动sglang server.

#187 master node
env GLOO_SOCKET_IFNAME=[GLOO_SOCKET_IFNAME] TORCH_ECCL_AVOID_RECORD_STREAMS=1 ENFLAME_TORCH_GCU_ENABLE_AUTO_MIGRATION=1 DS_V3_PARALLEL=True DS_V3_PARALLEL_MAX_TOTAL_TOKENS=1024 \
python3.10 -m sglang.launch_server --model-path [path of deepseek-r1-awq] --host 0.0.0.0 --port 5000 --dist-init-addr [master node ip]:8746  \
--node-rank 0 --nnodes 4 --dp-size 8 --tp-size 4 --ep-size 32 --enable-ep-moe --trust-remote-code --cuda-graph-max-bs 64 --mem-fraction-static 0.65 --disable-radix-cache  \
--max-prefill-tokens 4096 --chunked-prefill-size -1 --reasoning-parser deepseek-r1

#188
env  GLOO_SOCKET_IFNAME=[GLOO_SOCKET_IFNAME] TORCH_ECCL_AVOID_RECORD_STREAMS=1 ENFLAME_TORCH_GCU_ENABLE_AUTO_MIGRATION=1 DS_V3_PARALLEL=True  \
DS_V3_PARALLEL_MAX_TOTAL_TOKENS=1024 OUTLINES_CACHE_DIR=/home/deepseek_r1_test/.cache/outlines_2 \
python3.10 -m sglang.launch_server --model-path [path of deepseek-r1-awq] --host 0.0.0.0 --port 5000 --dist-init-addr [master node ip]:8746 \
--node-rank 1 --nnodes 4 --dp-size 8 --tp-size 4 --ep-size 32 --enable-ep-moe --trust-remote-code \
--cuda-graph-max-bs 64 --mem-fraction-static 0.65 --disable-radix-cache --max-prefill-tokens 4096 --chunked-prefill-size -1 --reasoning-parser deepseek-r1

#186
env  GLOO_SOCKET_IFNAME=[GLOO_SOCKET_IFNAME] TORCH_ECCL_AVOID_RECORD_STREAMS=1 ENFLAME_TORCH_GCU_ENABLE_AUTO_MIGRATION=1 DS_V3_PARALLEL=True \
DS_V3_PARALLEL_MAX_TOTAL_TOKENS=1024 OUTLINES_CACHE_DIR=/home/deepseek_r1_test/.cache/outlines_3 \
python3.10 -m sglang.launch_server --model-path [path of deepseek-r1-awq] --host 0.0.0.0 --port 5000 --dist-init-addr [master node ip]:8746 \
--node-rank 2 --nnodes 4 --dp-size 8 --tp-size 4 --ep-size 32 --enable-ep-moe --trust-remote-code \
--cuda-graph-max-bs 64 --mem-fraction-static 0.65 --disable-radix-cache  --max-prefill-tokens 4096 --chunked-prefill-size -1 --reasoning-parser deepseek-r1

#189
env  GLOO_SOCKET_IFNAME=[GLOO_SOCKET_IFNAME] TORCH_ECCL_AVOID_RECORD_STREAMS=1 ENFLAME_TORCH_GCU_ENABLE_AUTO_MIGRATION=1 DS_V3_PARALLEL=True \
DS_V3_PARALLEL_MAX_TOTAL_TOKENS=1024 OUTLINES_CACHE_DIR=/home/deepseek_r1_test/.cache/outlines_4 \
python3.10 -m sglang.launch_server --model-path [path of deepseek-r1-awq] --host 0.0.0.0 --port 5000 --dist-init-addr [master node ip]:8746 \
--node-rank 3 --nnodes 4 --dp-size 8 --tp-size 4 --ep-size 32 --enable-ep-moe --trust-remote-code \
--cuda-graph-max-bs 64 --mem-fraction-static 0.65 --disable-radix-cache  --max-prefill-tokens 4096 --chunked-prefill-size -1 --reasoning-parser deepseek-r1

启动router，和master server部署在同一台机器上

#router:
#187
python3.10 -m sglang_router.launch_router --worker-urls http://[master node ip]:5000 --host 0.0.0.0 --port 30000

安装lm-eval:

pip install lm-eval[api]==0.4.8

进行精度验证

lm_eval --model local-chat-completions \
--model_args model=[path of deepseek-r1-awq],tokenizer=[path of deepseek-r1-awq],base_url=http://[master node ip]:30000/v1/chat/completions,num_concurrent=1,max_retries=3,max_length=32768 \
--tasks ifeval \
--output_path /home/lm_eval_datasets_0.8.0_w4a16_bs1/ \
--trust_remote_code --batch_size 1 --log_samples --apply_chat_template --gen_kwargs temperature=0.7,top_k=50,top_p=0.95,repetition_penalty=1.0,max_gen_toks=16384 --seed 0,0,0,0

注：

--tasks ifeval 指定使用ifeval 任务进行测试；

结果查看:

测试结果输出在终端

|Tasks |Version|Filter|n-shot|        Metric         |   |Value |   |Stderr|
|------|------:|------|-----:|-----------------------|---|-----:|---|------|
|ifeval|      4|none  |     0|inst_level_loose_acc   |↑  |0.9173|±  |   N/A|
|      |       |none  |     0|inst_level_strict_acc  |↑  |0.8837|±  |   N/A|
|      |       |none  |     0|prompt_level_loose_acc |↑  |0.8780|±  |0.0141|
|      |       |none  |     0|prompt_level_strict_acc|↑  |0.8318|±  |0.0161|

DeepSeek-R1-Distill-Qwen-14B¶

本模型推理及性能测试需要1张enflame gcu。

模型下载¶

url: DeepSeek-R1-Distill-Qwen-14B
branch: main
commit id: 1df8507178afcc1bef68cd8c393f61a886323761

将上述url路径下的内容下载到本地DeepSeek-R1-Distill-Qwen-14B文件夹中。

requirements¶

python3 -m pip install transformers==4.48.3
python3 -m pip install pynvml==12.0.0
python3 -m pip install lm_eval==0.4.8
python3 -m pip install langdetect==1.0.9
python3 -m pip install immutabledict==4.2.1

export ENFLAME_TORCH_GCU_ENABLE_AUTO_MIGRATION=1

注：

环境要求：python3.10；transformers >= 4.48.3 均可；

环境变量ENFLAME_TORCH_GCU_ENABLE_AUTO_MIGRATION=1用于开启torch-gcu自动迁移功能；

online推理示例¶

启动server

python3 -m sglang.launch_server --model-path [ path of DeepSeek-R1-Distill-Qwen-14B ] --host 0.0.0.0 --port 8089  --dp-size 1 --tp-size 1 --trust-remote-code

注：

--port：可以配置为本机未被占用的任意端口；

--context-length：可以配置模型可生成最大token的数量;

client发起请求

import requests
from sglang.utils import print_highlight

url = f"http://localhost:8089/v1/chat/completions"
data = {
    "model": "[ path of DeepSeek-R1-Distill-Qwen-14B ]",
    "messages": [{"role": "user", "content": "What is the capital of France?"}],
}

response = requests.post(url, json=data)
print_highlight(response.json())

性能测试¶

参考online示例，先启动sglang server.

启动server

python3 -m sglang.launch_server --model-path [ path of DeepSeek-R1-Distill-Qwen-14B ] --host 0.0.0.0 --port 8089  --dp-size 1 --tp-size 1 --trust-remote-code --context-length 32768 --max-prefill-tokens 256 --mem-fraction-static 0.9

性能测试

python3 -m sglang.gcu_bench_serving --backend sglang --dataset-name random --random-range-ratio 1.0 --num-prompts 1 --random-input-len 1024 --random-output-len 1024 --host 0.0.0.0 --port 8089

注：

可以通过--dataset-path指定数据集的存储路径，否则默认从/tmp/下读取；

默认使用sharegpt数据集，数据集文件名为：ShareGPT_V3_unfiltered_cleaned_split.json，可以通过--dataset-name指定其他数据集；

可以通过 --num-prompts、--random-input-len、--random-output-len 等参数自定义测试规模和输入输出长度。详细参数和用法见官方文档和脚本注释；

精度验证¶

此处以MMLU精度验证为例说明，如需了解更多详细信息或其他基准测试，请参考sglang/benchmark下每个特定基准测试文件夹中的 README 文件。

数据集准备:

下载地址:
- url: MMLU
将上述url路径下的内容下载到本地mmlu_dataset文件夹中。
精度测试

lm_eval --model sglang \
    --model_args pretrained=[ path of DeepSeek-R1-Distill-Qwen-14B ],host=0.0.0.0,port=8089,dp_size=1,tp_size=1,dtype=bfloat16,chunked_prefill_size=8192,mem_fraction_static=0.7  \
    --tasks mmlu --num_fewshot 0 --trust_remote_code --output_path ./outputs/mmlu_qwen_14b/ --log_samples --seed 0 --verbosity DEBUG --show_config \
    --batch_size 2

结果查看

测试结果输出在终端

|      Groups      |Version|Filter|n-shot|Metric|   |Value |   |Stderr|
|------------------|------:|------|------|------|---|-----:|---|-----:|
|mmlu              |      2|none  |      |acc   |↑  |0.7298|±  |0.0036|
| - humanities     |      2|none  |      |acc   |↑  |0.6508|±  |0.0066|
| - other          |      2|none  |      |acc   |↑  |0.7776|±  |0.0072|
| - social sciences|      2|none  |      |acc   |↑  |0.8297|±  |0.0067|
| - stem           |      2|none  |      |acc   |↑  |0.7031|±  |0.0079|

DeepSeek-R1-Distill-Llama-8B¶

本模型推理及性能测试需要1张enflame gcu。

模型下载¶

url: DeepSeek-R1-Distill-Llama-8B
branch: main
commit id: 6a6f4aa4197940add57724a7707d069478df56b1

将上述url路径下的内容下载到本地DeepSeek-R1-Distill-Llama-8B文件夹中。

requirements¶

python3 -m pip install transformers==4.48.3 datasets==3.5.0

export ENFLAME_TORCH_GCU_ENABLE_AUTO_MIGRATION=1

注：

环境要求：python3.10；transformers >= 4.48.3 均可；

环境变量ENFLAME_TORCH_GCU_ENABLE_AUTO_MIGRATION=1用于开启torch-gcu自动迁移功能；

online推理示例¶

启动server

python3 -m sglang.launch_server --model-path [ path of DeepSeek-R1-Distill-Llama-8B ] --host 0.0.0.0 --port 8089  --dp-size 1 --tp-size 1 --trust-remote-code

注：

--port：可以配置为本机未被占用的任意端口；

--context-length：可以配置模型可生成最大token的数量；

client发起请求

import requests
from sglang.utils import print_highlight

url = f"http://localhost:8089/v1/chat/completions"
data = {
    "model": "[ path of DeepSeek-R1-Distill-Llama-8B ]",
    "messages": [{"role": "user", "content": "What is the capital of France?"}],
}

response = requests.post(url, json=data)
print_highlight(response.json())

性能测试¶

参考online示例，先启动sglang server.

启动server

python3 -m sglang.launch_server --model-path [ path of DeepSeek-R1-Distill-Llama-8B ] --host 0.0.0.0 --port 8089  --dp-size 1 --tp-size 1 --trust-remote-code

性能测试

python3 -m sglang.gcu_bench_serving --backend sglang --dataset-name random --random-range-ratio 1.0 --num-prompts 16 --random-input-len 1024 --random-output-len 1024   --host 0.0.0.0 --port 8089

注：

可以通过--dataset-path指定数据集的存储路径，否则默认从/tmp/下读取；

默认使用sharegpt数据集，数据集文件名为：ShareGPT_V3_unfiltered_cleaned_split.json，可以通过--dataset-name指定其他数据集；

可以通过 --num-prompts、--random-input-len、--random-output-len 等参数自定义测试规模和输入输出长度。详细参数和用法见官方文档和脚本注释；

精度验证¶

此处以MMLU精度验证为例说明，如需了解更多详细信息或其他基准测试，请参考sglang/benchmark下每个特定基准测试文件夹中的 README 文件。

数据集准备

下载地址:
- url: MMLU
将上述url路径下的内容下载到本地mmlu_dataset文件夹中。
- 下载sglang 源码 https://github.com/sgl-project/sglang
启动server

python3 -m sglang.launch_server  --model-path [ path of DeepSeek-R1-Distill-Llama-8B ] --port 30000  --disable-radix-cache  --chunked-prefill-size -1 --mem-fraction-static 0.8

精度测试

python3 sglang/benchmark/mmlu/bench_sglang.py --nsub 10 --port 30000 --data_dir [ path of MMLU ] # Test 10 subjects

结果查看

# 测试结果存储在当前目录下名为result.jsonl的文件中
cat result.jsonl | grep -oP '"accuracy": \K\d+\.\d+'