前言¶

版本信息¶

日期	版本	作者	新增功能
20240618	v1.0	Enflame

原理介绍¶

text-generation-inference是一个开源的大语言模型生成服务部署框架

安装使用说明¶

软硬件需求

OS：ubuntu 20.04
Python：3.10
GCU：燧原S60

使用Enflame Docker镜像

请参考《TopsRider软件栈安装手册》获取Enflame官方tgi镜像
使用docker创建容器，容器内已安装项目所需编译、执行环境，以及TopsRider所需组件，如

docker run -itd --privileged --network host -v /home/:/home/ --name tgi-test artifact.enflame.cn/enflame_docker_release/amd64_ubuntu2204_tgi:3.1.20240612.1

进入容器后，进入目录``并解压文件，项目源码将位于/usr/local/topsrider/src/text-generation-inference/text-generation-inference

cd /usr/local/topsrider/src/text-generation-inference
tar -zxvf text-generation-inference_3.1.20240612.tar.gz

# make install将安装text-generation-launcher、text-generation-router、text-generation-server
make install
# make install-benchmark将安装benchmark
make install-benchmark

如无法获取镜像，也可使用本地安装方式（不推荐）

安装rust

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
source "$HOME/.cargo/env"

安装protoc

PROTOC_ZIP=protoc-21.12-linux-x86_64.zip
curl -OL https://github.com/protocolbuffers/protobuf/releases/download/v21.12/$PROTOC_ZIP
unzip -o $PROTOC_ZIP -d /usr/local bin/protoc
unzip -o $PROTOC_ZIP -d /usr/local 'include/*'

安装TopsRider环境，请参考《TopsRider软件栈安装手册》，安装相关依赖，依赖包括TopsPlatform Package，Enflame TopsAten，Enflame Collective Communications Library，Torch GCU，xformers GCU，VLLM GCU， tops-extension GCU等
获取项目源码，获取源码请联系开发工程师。解压源码后，进入项目，执行编译安装命令，

# make install将安装text-generation-launcher、text-generation-router、text-generation-server
make install
# make install-benchmark将安装benchmark
make install-benchmark

用户使用说明¶

快速开始¶

server启动

text-generation-launcher \
--model-id path/huggingface-Model/ \
--port 8080 \
--hostname 127.0.0.1 \
--max-input-length 1024 \
--max-total-tokens 4096 \
--trust-remote-code \
--num-shard 1

其中model-id需要给定从huggingface上下载的模型路径，max-total-tokens需在模型支持范围内指定大小，num-shard设置为模型需要卡的数量，默认是单卡

client请求

可以使用 /generate或/generate_stream route查询：

curl 127.0.0.1:8080/generate \
    -X POST \
    -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}' \
    -H 'Content-Type: application/json'

curl 127.0.0.1:8080/generate_stream \
    -X POST \
    -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}' \
    -H 'Content-Type: application/json'

也可以使用python调用：

import requests
headers = {
    "Content-Type": "application/json",
}
data = {
    'inputs': 'What is Deep Learning?',
    'parameters': {
        'max_new_tokens': 20,
    },
}
response = requests.post('http://127.0.0.1:8080/generate', headers=headers, json=data)
print(response.json())

benchmark请求

sever启动后，可通过benchmark请求获取性能结果

text-generation-benchmark --tokenizer-name [path of huggingface model]

关闭交互式窗口通过设置环境变量 DISABLE_CROSSTERM=1.

DISABLE_CROSSTERM=1 text-generation-benchmark  -t [path of huggingface model]

所有性能结果将存储于benchmark_output.txt

对于具有不适配的tokenizer，请尝试使用transformers中的转换工具生成”tokenizer.json”

python -m transformers.convert_slow_tokenizers_checkpoints_to_fast --tokenizer_name LlamaTokenizer --dump_path test --checkpoint_name baichuan2-7B-base

模型推理¶

baichuan2-7b¶

模型下载，用户可自行下载开源数据集，本文仅给出下载连接，不对开源数据集作任何承诺，使用开源数据集产生的一切后果和风险由用户自行承担¶

url: Baichuan2-7B-Base
branch:main
commit id:364ead367078c68c8deef6a319053302b330aa1f

将上述url设定的路径下的内容全部下载到baichuan2-7B-base文件夹中。

性能测试¶

DISABLE_CROSSTERM=1 text-generation-benchmark --tokenizer-name [path of baichuan2-7B-base] --batch-size=1 --sequence-length=1024 --decode-length=1024

baichuan2-13b¶

模型下载，用户可自行下载开源数据集，本文仅给出下载连接，不对开源数据集作任何承诺，使用开源数据集产生的一切后果和风险由用户自行承担¶

url: Baichuan2-13B-Base
branch:main
commit id:c6f590cab590cf33e78ad834dbd5f9bd6df34a94

将上述url设定的路径下的内容全部下载到baichuan2-13B-base文件夹中。

性能测试¶

server启动后，执行以下测试

DISABLE_CROSSTERM=1 text-generation-benchmark --tokenizer-name [path of baichuan2-13B-base] --batch-size=1 --sequence-length=1024 --decode-length=1024

llama2-70b¶

模型下载，用户可自行下载开源数据集，本文仅给出下载连接，不对开源数据集作任何承诺，使用开源数据集产生的一切后果和风险由用户自行承担¶

url: Llama-2-70b-hf
branch:main
commit id:6aa89cf

将上述url设定的路径下的内容全部下载到baichuan2-13B-base文件夹中。

性能测试¶

模型需要4卡，待server启动后，执行以下测试

DISABLE_CROSSTERM=1 text-generation-benchmark --tokenizer-name [path of Llama2-70b] --batch-size=1 --sequence-length=1024 --decode-length=1024