3.1. SGLang简介¶

SGLang 是一款面向大语言模型和视觉语言模型的快速服务框架。

SGLang 的关键创新：

RadixAttention

使用Radix Tree来管理KV Cache，在多轮对话中实现前缀共享和复用。影响：在多轮任务中，将缓存命中率提高 3-5 倍，显著降低Latency。

结构化输出支持

通过正则表达式和FSM 状态机实现受限解码，以直接生成结构化数据（例如 JSON）。

受编译器启发的设计

前端 DSL 设计简化了复杂任务的编程。后端运行时优化了调度和资源分配。

SGLang-GCU 软件安装¶

一键自动安装sglang以及所有依赖

./TopsRider*.run -C sgl-kernel -y --python python3.10

Verifying archive integrity...  100%   MD5 checksums are OK. All good.
Uncompressing ENFLAME TOPSRIDER PACKAGE  100%
Logging file: /tmp/topsinstaller/TopsRider20250521-085545.log
[1/17] Install TopsPlatform Package
[2/17] Install Enflame TopsFactor Library
[3/17] Install Enflame TopsAten
[4/17] Install Enflame TopsRider SDK
[5/17] Install Enflame Collective Communications Library
[6/17] Install Benchmark For Enflame Collective Communications Library
[7/17] Install Enflame Tops Graph Compiler
[8/17] Install Enflame Tops Graph Compiler for Python 3.10
[9/17] Install Torch GCU for PyTorch 2.6.x Package
[10/17] Install flash-attn GCU for torch.2.6.0 Python 3.10
[11/17] Install xformers 0.0.29 GCU for Python 3.10
[12/17] Install tops-extension GCU for torch.2.6.0 Python 3.10
[13/17] Install VLLM 0.8.0 GCU for Python 3.9 above
[14/17] Install Trition GCU
[15/17] Install GCU for python 3.10
[16/17] Install Enflame Kernel Library for SGLang Python 3.10
[17/17] Install libtorch for GCU to /usr/local/topsrider/libtorch_gcu
Install Finished. 17 installed.

注：当然也可以通过TopsRider*.run的GUI界面手动选择安装组件

SGLang server参数说明¶

Model,processor and tokenizer¶

NO	参数名	说明
1	model_path	模型权重的路径。可以是本地文件夹或 Hugging Face 仓库 ID。
2	tokenizer_path	tokenizer的路径。默认为model_path。
3	trust_remote_code	是否允许远程代码执行。默认为False。
4	device	运行设备。默认为GCU。

Serving: HTTP & API¶

NO	参数名	说明
1	host	HTTP 服务器的主机。默认值为"127.0.0.1"。
2	port	服务端口。默认为30000。

Parallelism¶

NO	参数名	说明
1	tp_size	张量并行size。默认值为1。
2	dp_size	数据并行size。默认值为1。
3	ep_size	专家并行size。默认值为1。

Multi-node distributed serving¶

NO	参数名	说明
1	dist_init_addr	用于初始化 PyTorch 分布式后端的 TCP 地址
2	nnodes	集群中的节点总数。
3	node_rank	此节点在分布式设置的nnodes中的排名（ID）。

Sampling¶

NO	参数名	说明
1	top_p	从累计概率超过 top_p 的最小排序集合中选择token。当 top_p = 1 时，这减少为从所有token中进行不受限制的采样。
2	top_k	从k个最高概率的token中随机选择。
3	min_p	从概率大于 min_p * 最高概率的token中进行采样
4	temperature	在采样下一个标记时，temperature = 0对应贪婪采样，较高的温度会导致更多的多样性。

高级特性使用¶

Quantization¶

SGLang 支持模型量化，分为离线量化和在线量化两种方式。离线量化直接加载已量化的模型权重（如 GPTQ、AWQ、FP8/INT4），无需额外参数；在线量化则通过 --quantization 或 --torchao-config 参数在加载时动态量化权重，支持多种量化策略。常见量化类型包括 GPTQ、AWQ、FP8、INT4、W8A8【详见官方量化文档】

注意：当前推荐使用离线量化，即直接加载预量化的模型权重，当前主要支持的量化模型为DeepSeekV3 gptq 和 DeepSeekR1 awq

离线量化¶

要加载已量化模型，只需加载模型权重和配置文件。再次强调，如果模型已离线量化，启动引擎时无需添加 --quantization 参数。量化方法将从下载的 Hugging Face 配置文件中解析。例如，DeepSeek V3/R1 模型已经获取到量化过的模型，直接加载使用即可：

启动server

env GLOO_SOCKET_IFNAME=[GLOO_SOCKET_IFNAME] TORCH_ECCL_AVOID_RECORD_STREAMS=1 ENFLAME_TORCH_GCU_ENABLE_AUTO_MIGRATION=1 DS_V3_PARALLEL=True \
DS_V3_PARALLEL_MAX_TOTAL_TOKENS=1024 python3.10 -m sglang.launch_server --model-path [path of deepseek-r1-awq] \
--host 0.0.0.0 --port 5000 --dist-init-addr [master node ip]:8746  --node-rank 0 --nnodes 4 --dp-size 8 --tp-size 4 --ep-size 32 --enable-ep-moe \
--trust-remote-code --cuda-graph-max-bs 64 --mem-fraction-static 0.65 --disable-radix-cache  --max-prefill-tokens 4096 --chunked-prefill-size -1

#188
env  GLOO_SOCKET_IFNAME=[GLOO_SOCKET_IFNAME] TORCH_ECCL_AVOID_RECORD_STREAMS=1 ENFLAME_TORCH_GCU_ENABLE_AUTO_MIGRATION=1 DS_V3_PARALLEL=True \
DS_V3_PARALLEL_MAX_TOTAL_TOKENS=1024 OUTLINES_CACHE_DIR=/home/deepseek_r1_test/.cache/outlines_2 python3.10 -m sglang.launch_server \
--model-path [path of deepseek-r1-awq] --host 0.0.0.0 --port 5000 --dist-init-addr [master node ip]:8746 \
--node-rank 1 --nnodes 4 --dp-size 8 --tp-size 4 --ep-size 32 --enable-ep-moe --trust-remote-code --cuda-graph-max-bs 64 --mem-fraction-static 0.65 --disable-radix-cache --max-prefill-tokens 4096 --chunked-prefill-size -1

#186
env  GLOO_SOCKET_IFNAME=[GLOO_SOCKET_IFNAME] TORCH_ECCL_AVOID_RECORD_STREAMS=1 ENFLAME_TORCH_GCU_ENABLE_AUTO_MIGRATION=1 DS_V3_PARALLEL=True \
DS_V3_PARALLEL_MAX_TOTAL_TOKENS=1024 OUTLINES_CACHE_DIR=/home/deepseek_r1_test/.cache/outlines_3 python3.10 -m sglang.launch_server \
--model-path [path of deepseek-r1-awq] --host 0.0.0.0 --port 5000 --dist-init-addr [master node ip]:8746 \
--node-rank 2 --nnodes 4 --dp-size 8 --tp-size 4 --ep-size 32 --enable-ep-moe --trust-remote-code --cuda-graph-max-bs 64 --mem-fraction-static 0.65 --disable-radix-cache  --max-prefill-tokens 4096 --chunked-prefill-size -1

#189
env  GLOO_SOCKET_IFNAME=[GLOO_SOCKET_IFNAME] TORCH_ECCL_AVOID_RECORD_STREAMS=1 ENFLAME_TORCH_GCU_ENABLE_AUTO_MIGRATION=1 DS_V3_PARALLEL=True \
DS_V3_PARALLEL_MAX_TOTAL_TOKENS=1024 OUTLINES_CACHE_DIR=/home/deepseek_r1_test/.cache/outlines_4 python3.10 -m sglang.launch_server \
--model-path [path of deepseek-r1-awq] --host 0.0.0.0 --port 5000 --dist-init-addr [master node ip]:8746 \
--node-rank 3 --nnodes 4 --dp-size 8 --tp-size 4 --ep-size 32 --enable-ep-moe --trust-remote-code --cuda-graph-max-bs 64 --mem-fraction-static 0.65 --disable-radix-cache  --max-prefill-tokens 4096 --chunked-prefill-size -1

注：

--port：可以配置为本机未被占用的任意端口；

--context-length：可以配置模型可生成最大token的数量；

启动router，和master server部署在同一台机器上

#187
python3.10 -m sglang_router.launch_router --host 0.0.0.0 --worker-urls http://10.12.116.187:5000

注：

--port：可以配置为本机未被占用的任意端口；

--context-length：可以配置模型可生成最大token的数量；

--dist-init-addr: 选择一台机器作为主节点，填入主节点host ip和任意一个未被占用的端口；

ifconfig -a 从结果中选择包含inet字段且内容与机器实际ip一致的字段，填入GLOO_SOCKET_IFNAME；

client发起请求

import requests
from sglang.utils import print_highlight

url = f"http://localhost:{port}/v1/chat/completions"
data = {
    "model": "[ path of DeepSeek-R1-Distill-Qwen-1.5B ]",
    "messages": [{"role": "user", "content": "What is the capital of France?"}],
}

response = requests.post(url, json=data)
print_highlight(response.json())

分布式部署¶

以4机32卡 S60部署DeepSeek-R1-awq DP=8,TP=4,EP=32为例说明.

整体视图：¶

overview

IP配置：¶

主节点IP:Port:

[master node ip]:8746

使用8746端口建立分布式通信使用5000端口作为主节点的服务端口

从节点IP：

[From node ip 01]
[From node ip 02]
[From node ip 03]

server配置：¶

主节点配置：

在[master node ip]的sever上执行如下命令：

env GLOO_SOCKET_IFNAME=[GLOO_SOCKET_IFNAME] TORCH_ECCL_AVOID_RECORD_STREAMS=1 TORCH_GCU_ENABLE_AUTO_MIGRATION=1 DS_V3_PARALLEL=True DS_V3_PARALLEL_MAX_TOTAL_TOKENS=256 python3.10 -m sglang.launch_server --model-path [path of deepseek-r1-awq] --host 0.0.0.0 --port 5000 --dist-init-addr [master node ip]:8746  --node-rank 0 --nnodes 4 --dp-size 8 --tp-size 4 --ep-size 32 --enable-ep-moe --trust-remote-code --cuda-graph-max-bs 16 --mem-fraction-static 0.7

GLOO_SOCKET_IFNAME 为[master node ip]对应的网卡名称

从节点配置：

在[From node ip 01]的sever上执行如下命令：

env GLOO_SOCKET_IFNAME=[GLOO_SOCKET_IFNAME] TORCH_ECCL_AVOID_RECORD_STREAMS=1 TORCH_GCU_ENABLE_AUTO_MIGRATION=1 DS_V3_PARALLEL=True DS_V3_PARALLEL_MAX_TOTAL_TOKENS=256 OUTLINES_CACHE_DIR=/home/.cache/outlines_2 python3.10 -m sglang.launch_server --model-path [path of deepseek-r1-awq] --host 0.0.0.0 --port 5000 --dist-init-addr [master node ip]:8746  --node-rank 1 --nnodes 4 --dp-size 8 --tp-size 4 --ep-size 32 --enable-ep-moe --trust-remote-code --cuda-graph-max-bs 16 --mem-fraction-static 0.7

在[From node ip 02]的sever上执行如下命令：

env GLOO_SOCKET_IFNAME=[GLOO_SOCKET_IFNAME] TORCH_ECCL_AVOID_RECORD_STREAMS=1 TORCH_GCU_ENABLE_AUTO_MIGRATION=1 DS_V3_PARALLEL=True DS_V3_PARALLEL_MAX_TOTAL_TOKENS=256 OUTLINES_CACHE_DIR=/home/.cache/outlines_2 python3.10 -m sglang.launch_server --model-path [path of deepseek-r1-awq] --host 0.0.0.0 --port 5000 --dist-init-addr [master node ip]:8746  --node-rank 2 --nnodes 4 --dp-size 8 --tp-size 4 --ep-size 32 --enable-ep-moe --trust-remote-code --cuda-graph-max-bs 16 --mem-fraction-static 0.7

在[From node ip 03]的sever上执行如下命令：

env GLOO_SOCKET_IFNAME=[GLOO_SOCKET_IFNAME] TORCH_ECCL_AVOID_RECORD_STREAMS=1 TORCH_GCU_ENABLE_AUTO_MIGRATION=1 DS_V3_PARALLEL=True DS_V3_PARALLEL_MAX_TOTAL_TOKENS=256 OUTLINES_CACHE_DIR=/home/.cache/outlines_2 python3.10 -m sglang.launch_server --model-path [path of deepseek-r1-awq] --host 0.0.0.0 --port 5000 --dist-init-addr [master node ip]:8746  --node-rank 3 --nnodes 4 --dp-size 8 --tp-size 4 --ep-size 32 --enable-ep-moe --trust-remote-code --cuda-graph-max-bs 16 --mem-fraction-static 0.7

--dist-init-addr 配置的是分布式初始化时主节点的IP和端口，决定各节点如何互相发现和建立通信。 GLOO_SOCKET_IFNAME 环境变量用于指定 PyTorch Gloo 通信后端在分布式部署时绑定的网络接口（如 eth0、bond0 等）。它确保多节点间的数据通信走指定的物理网卡，避免因默认接口选择错误导致节点间无法互联或通信效率低下。 --dist-init-addr 的IP要和 GLOO_SOCKET_IFNAME 绑定的网卡IP一致，确保分布式通信走同一物理网络。

router配置：¶

在[master node ip]的sever上执行如下命令：

python3.10 -m sglang_router.launch_router --host 0.0.0.0 --worker-urls http://[master node ip]:5000

和master server部署在同一台机器上。如果模型是分布式部署（上述例子），则router中添加的worker地址，只能是master server。 router默认使用端口：30000。

client发起请求：¶

python3.10 -m sglang.bench_serving --backend sglang --dataset-name random --random-input 1000 --random-output 700 --random-range-ratio 1 --num-prompts 32 \
--host [master node ip] --port 30000 --output-file "deepseek_nnode4.jsonl"

--host [master node ip]： router所在server的IP。 --port 30000： router绑定的port。 --random-input: 最大输入长度。 --random-output: 最大输出长度。 --num-prompts: 即batch_size，请求并发数。