流水并行 pipeline 推理¶

环境依赖¶

大模型推理流水并行运行需要的依赖环境有：

openmpi 4.0.5；
Enflame TopsRider；
python3.8；
安装topsdistinfer_pipeline_infer_*.deb；

安装python依赖包：

python3.8 -m pip install -r /usr/local/gcu/TopsDistInfer/requirements.txt

执行模型推理¶

切换到安装目录/usr/local/gcu/TopsDistInfer/执行模型推理。

GPT2 xl¶

在1张gcu210上执行GPT2-xl模型推理：¶

将第五章方法生成的onnx和json文件放到gpt2-xl_split_2文件夹中
使用如下命令生成engine

python3 -m build_scripts.gpt-xl --model_path gpt2-xl_split_2/gpt-xl_1of2-op13-fp32-N.onnx --max_seq=1024 --batchsize 2 
python3 -m build_scripts.gpt-xl --model_path gpt2-xl_split_2/gpt-xl_2of2-op13-fp32-N.onnx --max_seq=1024 --batchsize 2 

下载GPT2-XL官方的tokenizer文件到/usr/local/gcu/TopsDistInfer/src/pre_process_python_GPT2/，包括：config.json, tokenizer.json, vocab.json, merges.txt
通过如下命令执行推理

export EFRT_CLUSTER_AS_DEVICE=true
export ECCL_RUNTIME_3_0_ENABLE=true
mpirun -n 2 --allow-run-as-root Pipeline_Text_Generate_micro_batch --run_config_file=run_configs/run_config_gpt2-xl_split_2_max_1024_bs2.json --model_path <path/to/gpt-xl_split_2>

ChatGLM 6b 手动切分版本¶

在两张gcu210上执行ChatGLM-6B模型推理：¶

将onnx模型放置在chatglm_models文件夹中，onnx文件可咨询燧原商务团队获取
使用如下命令生成engine，并把生成的bin文件放到文件夹chatglm_6b_manual中

  python3 -m build_scripts.chatglm_manual --stage=0 --max_seq=2048 --batch_size=2
  python3 -m build_scripts.chatglm_manual --stage=1 --max_seq=2048 --batch_size=2
  python3 -m build_scripts.chatglm_manual --stage=2 --max_seq=2048 --batch_size=2
  python3 -m build_scripts.chatglm_manual --stage=3 --max_seq=2048 --batch_size=2

下载ChatGLM-6B官方的config和tokenizer文件到./src/pre_process_chatglm/model/，包括：config.json, configuration_chatglm.py, ice_text.model, tokenization_chatglm.py, tokenizer_config.json
通过如下命令执行推理

export EFRT_CLUSTER_AS_DEVICE=true
export ECCL_RUNTIME_3_0_ENABLE=true
mpirun -n 4 --allow-run-as-root Pipeline_Text_Generate_micro_batch --run_config_file=run_configs/run_config_chatglm_split_4_bs2.json --model_path <path/to/chatglm_6b_manual>

在1张gcu210上执行ChatGLM-6B模型推理：¶

将onnx模型放置在chatglm_models文件夹中，onnx文件可咨询燧原商务团队获取
使用如下命令生成engine，并把生成的bin文件放到文件夹chatglm_6b_manual中

  python3 -m build_scripts.chatglm_manual --num_stages=2 --stage=0 --max_seq=2048 --batch_size=1
  python3 -m build_scripts.chatglm_manual --num_stages=2 --stage=1 --max_seq=2048 --batch_size=1

下载ChatGLM-6B官方的config和tokenizer文件到./src/pre_process_chatglm/model/，包括：config.json, configuration_chatglm.py, ice_text.model, tokenization_chatglm.py, tokenizer_config.json
执行以下命令

export EFRT_CLUSTER_AS_DEVICE=true
export ECCL_RUNTIME_3_0_ENABLE=true
mpirun -n 2 --allow-run-as-root Pipeline_Text_Generate_micro_batch --run_config_file=run_configs/run_config_chatglm_split_2.json --model_path <path/to/chatglm_6b_manual>

LLAMA 7b¶

在两张gcu210上执行LLAMA-7B模型推理：¶

将第五章方法生成的onnx和weights文件放到将onnx模型放置在llama_7b_split_4文件夹中
使用如下命令生成engine，并把生成的bin文件放到文件夹llama_7b_split_4中

  python3 -m build_scripts.llama_7b --json_path /usr/local/gcu/TopsDistInfer/build_configs/llama_7b_split_4 --onnx_path ./llama_7b_split_4

下载llama-7b-hf官方的tokenizer文件到/usr/local/gcu/TopsDistInfer/src/pre_process_LLAMA/，包括：tokenizer.model
通过如下命令安装 requirements.txt

pip3 install -r /usr/local/gcu/TopsDistInfer/src/pre_process_LLAMA/requirements.txt

通过如下命令执行推理

export EFRT_CLUSTER_AS_DEVICE=true
export ECCL_RUNTIME_3_0_ENABLE=true
mpirun -n 4 --allow-run-as-root Pipeline_Text_Generate_micro_batch --run_config_file=run_configs/run_config_llama_7b_split_4.json --model_path <path/to/llama_7b_split_4>

ALPACA 7b¶

在两张gcu210上执行ALPACA-7B模型推理：¶

将第五章方法生成的onnx和weights放到alpaca_7b文件夹中
使用如下命令生成engine，并把生成的bin文件放到文件夹alpaca_7b中

python3 -m build_engine.gcu --json_path /usr/local/gcu/TopsDistInfer/build_configs/alpaca_7b_split_4_max_1024_miax16_3pg --onnx_path alpaca_7b/

下载ALPACA-7b官方的tokenizer文件到/usr/local/gcu/TopsDistInfer/src/pre_process_Alpaca_7B/，包括：added_tokens.json，special_tokens_map.json，tokenizer_config.json，tokenizer.model
通过如下命令执行推理

export EFRT_CLUSTER_AS_DEVICE=true
export ECCL_RUNTIME_3_0_ENABLE=true
mpirun -n 4 --allow-run-as-root Pipeline_Text_Generate_micro_batch --run_config_file=run_configs/run_config_alpaca_7b_split_4_max_1024_mix16_3pg.json --model_path <path/to/alpaca_7b>

VICUNA 13b¶

在四张gcu210上执行VICUNA-13B模型推理：¶

将第五章方法生成的onnx和weights放到vicuna_13b文件夹中
使用如下命令生成engine，并把生成的bin文件放到文件夹vicuna_13b中

python3 -m build_engine.gcu --json_path /usr/local/gcu/TopsDistInfer/build_configs/vicuna_13b_split_8_max_1024_mix16_3pg --onnx_path ./vicuna_13b/

下载VICUNA-13b官方的tokenizer文件到/usr/local/gcu/TopsDistInfer/src/pre_process_Vicuna_13B/，包括：special_tokens_map.json，tokenizer_config.json，tokenizer.model
通过如下命令执行推理

export EFRT_CLUSTER_AS_DEVICE=true
export ECCL_RUNTIME_3_0_ENABLE=true
mpirun -n 8 --allow-run-as-root Pipeline_Text_Generate_micro_batch --run_config_file=run_configs/run_config_vicuna_13b_split_8_max_1024_mix16_3pg.json --model_path <path/to/vicuna_13b>

ChatGLM 6b 自动切分版本¶

在两卡gcu210上执行CHATGLM_AUTO模型推理：¶

将第五章方法生成的onnx和weights放到chatglm_6b文件夹中
使用如下命令生成engine，并把生成的bin文件放到文件夹chatglm_6b中

python3 -m build_engine.gcu --json_path /usr/local/gcu/TopsDistInfer/build_configs/chatglm_auto_split_4_max_1024_mix16_3pg --onnx_path ./chatglm_6b/

下载ChatGLM-6b官方的tokenizer文件到/usr/local/gcu/TopsDistInfer/src/pre_process_ChatGLM_AUTO_6B/，包括：ice_text.model，tokenization_chatglm.py，tokenizer_config.json
通过如下命令执行推理

export EFRT_CLUSTER_AS_DEVICE=true
export ECCL_RUNTIME_3_0_ENABLE=true
mpirun -n 4 --allow-run-as-root Pipeline_Text_Generate_micro_batch --run_config_file=run_configs/run_config_chatglm_auto_split_4_max_1024_mix16_3pg.json --model_path <path/to/chatglm_6b>

LLAMA2 7b¶

在两张gcu210上执行LLAMA2-7B模型推理:¶

将第五章方法生成的onnx和weights文件放到llama2_7b_split_4文件夹中
使用如下命令生成engine，并把生成的bin放到llama2_7b_split_4中

python3 -m build_scripts.llama_7b --json_path /usr/local/gcu/TopsDistInfer/build_configs/llama2_7b_split_4 --onnx_path ./llama2_7b_split_4

下载Llama-2-7b-hf官方的tokenizer文件到/usr/local/gcu/TopsDistInfer/src/pre_process_LLAMA/，包括：tokenizer.json, tokenizer_config.json, special_tokens_map.json, tokenizer.model
通过如下命令安装 requirements.txt

pip3 install -r /usr/local/gcu/TopsDistInfer/src/pre_process_LLAMA/requirements.txt

通过如下命令执行推理

export EFRT_CLUSTER_AS_DEVICE=true
export ECCL_RUNTIME_3_0_ENABLE=true
mpirun -n 4 --allow-run-as-root Pipeline_Text_Generate_micro_batch --run_config_file=run_configs/run_config_llama2_7b_split_4.json --model_path <path/to/llama2_7b_split_4>

BLOOMZ 7b¶

在两张gcu210上执行BLOOMZ-7B1模型推理：¶

将第五章方法生成的onnx和weights文件放到bloomz-7b1文件夹中
使用如下命令生成engine，并把生成的bin放到bloomz-7b1文件夹中

python3 -m build_scripts.bloomz_7b1 --model_path ./bloomz-7b1/ --stage_id 0 --max_bs 1 --max_seq 2048 --kvcache_dtype fp16
python3 -m build_scripts.bloomz_7b1 --model_path ./bloomz-7b1/ --stage_id 1 --max_bs 1 --max_seq 2048 --kvcache_dtype fp16
python3 -m build_scripts.bloomz_7b1 --model_path ./bloomz-7b1/ --stage_id 2 --max_bs 1 --max_seq 2048 --kvcache_dtype fp16
python3 -m build_scripts.bloomz_7b1 --model_path ./bloomz-7b1/ --stage_id 3 --max_bs 1 --max_seq 2048 --kvcache_dtype fp16

下载BLOOMZ-7B1官方的tokenizer文件到/usr/local/gcu/TopsDistInfer/src/pre_process_bloomz_7b1/，包括：special_tokens_map.json ， tokenizer.json，tokenizer_config.json
通过以下命令执行推理

export EFRT_CLUSTER_AS_DEVICE=true
export ECCL_RUNTIME_3_0_ENABLE=true
mpirun -n 4 --allow-run-as-root Pipeline_Text_Generate_micro_batch --run_config_file=run_configs/run_config_bloomz_7b1_split_4_max_2048_mix16_3pg.json --model_path <path/to/bloomz-7b1>

Baichuan 7b¶

在两张gcu210上执行BAICHUAN-7B模型推理：¶

将第五章方法生成的onnx和weights文件放到/usr/local/gcu/TopsDistInfer/baichuan_7b_models文件夹中
下载baichuan-7b官方的tokenizer相关文件到/usr/local/gcu/TopsDistInfer/src/pre_process_Baichuan_7B/，其中包括tokenizer.model, tokenizer_config.json, special_tokens_map.json
使用如下命令生成engine，并把生成的bin放到baichuan_7b_split_4_max_2048_mix16_3pg文件夹中

python3 -m  build_scripts.baichuan_7b --model_path=/usr/local/gcu/TopsDistInfer/baichuan_7b_models/ --json_path=/usr/local/gcu/TopsDistInfer/build_configs/baichuan_7b_split_4_max_2048_mix16_3pg --max_seq=2048 --batch_size=1

通过以下命令执行推理

export EFRT_CLUSTER_AS_DEVICE=true
export ECCL_RUNTIME_3_0_ENABLE=true
mpirun -n 4 --allow-run-as-root Pipeline_Text_Generate_micro_batch --run_config_file=run_configs/run_config_baichuan_7b_split_4_max_2048_mix16_3pg.json --model_path <path/to/baichuan_7b_models>

Chatglm2-6b¶

在两张gcu210上执行Chatglm2-6b模型推理：¶

将第五章方法生成的onnx和json文件放到chatglm2_models文件夹中
使用如下命令生成engine，并把生成的bin放到chatglm2_models文件夹中

python3.8 -m  build_engine.gcu --json_path /usr/local/gcu/TopsDistInfer/build_configs/chatglm2_6b_split_4/ --onnx_path ./chatglm2_models

下载ChatGLM2-6B官方的config和tokenizer文件到./src/pre_process_chatglm2/tokenizer/，包括：config.json, configuration_chatglm.py, tokenizer.model, tokenization_chatglm.py, tokenizer_config.json
通过以下命令执行推理

export EFRT_CLUSTER_AS_DEVICE=true
export ECCL_RUNTIME_3_0_ENABLE=true
mpirun -n 4 Pipeline_Text_Generate_micro_batch --run_config_file=run_configs/run_config_chatglm2_split_4.json --model_path <path/to/chatglm2_models>

ChatGLM3-6b¶

在两张gcu210上执行ChatGLM3-6b模型推理：¶

将第五章方法生成的onnx和json文件放到chatglm3_split文件夹
使用如下命令生成engine，并把生成的bin放到chatglm3_split文件夹中

python3.8 -m build_engine.gcu --json_path /usr/local/gcu/TopsDistInfer/build_configs/chatglm3_6b_split_4/ --onnx_path ./chatglm3_split

下载ChatGLM3-6b官方的tokenizer文件到/usr/local/gcu/TopsDistInfer/src/pre_process_chatglm3/，包括tokenization_chatglm.py，tokenizer.model，tokenizer_config.json
通过以下命令执行推理

export EFRT_CLUSTER_AS_DEVICE=true
export ECCL_RUNTIME_3_0_ENABLE=true
mpirun -n 4 Pipeline_Text_Generate_micro_batch --run_config_file=run_configs/run_config_chatglm3_split_4.json --model_path <path/to/chatglm3_split>

OPT 13b¶

在四张gcu210上执行OPT-13B模型推理：¶

将第五章方法生成的onnx和weights文件放到opt_13b_models文件夹中
使用如下命令生成engine，并把生成的bin放到opt_13b_models文件夹中

python3 -m build_engine.gcu --json_path /usr/local/gcu/TopsDistInfer/build_configs/opt_13b_split_8_max_1024_mix16_3pg/ --onnx_path ./opt_13b_models

下载OPT-13b官方的tokenizer文件到/usr/local/gcu/TopsDistInfer/src/pre_process_OPT_13B/，包括：special_tokens_map.json，tokenizer_config.json，vocab.json，merges.txt
通过以下命令执行推理

export EFRT_CLUSTER_AS_DEVICE=true
export ECCL_RUNTIME_3_0_ENABLE=true
mpirun -n 8 --allow-run-as-root Pipeline_Text_Generate_micro_batch --run_config_file=run_configs/run_config_opt_13b_split_8_max_1024_mix16_3pg.json --model_path=opt_13b_models

STARCODERBASE 15B¶

在四张gcu210上执行STARCODERBASE-15B模型推理：¶

将第五章方法生成的onnx和weights文件放到starcoderbase_15b_split_8文件夹中
使用如下命令生成engine，并把生成的bin放到starcoderbase_15b_split_8文件夹中

python3 -m build_scripts.starcoderbase_15b --model_path ./starcoderbase_15b_split_8/ --stage_id 0 --max_bs 1 --max_seq 1024 --kvcache_dtype fp16
python3 -m build_scripts.starcoderbase_15b --model_path ./starcoderbase_15b_split_8/ --stage_id 1 --max_bs 1 --max_seq 1024 --kvcache_dtype fp16
python3 -m build_scripts.starcoderbase_15b --model_path ./starcoderbase_15b_split_8/ --stage_id 2 --max_bs 1 --max_seq 1024 --kvcache_dtype fp16
python3 -m build_scripts.starcoderbase_15b --model_path ./starcoderbase_15b_split_8/ --stage_id 3 --max_bs 1 --max_seq 1024 --kvcache_dtype fp16
python3 -m build_scripts.starcoderbase_15b --model_path ./starcoderbase_15b_split_8/ --stage_id 4 --max_bs 1 --max_seq 1024 --kvcache_dtype fp16
python3 -m build_scripts.starcoderbase_15b --model_path ./starcoderbase_15b_split_8/ --stage_id 5 --max_bs 1 --max_seq 1024 --kvcache_dtype fp16
python3 -m build_scripts.starcoderbase_15b --model_path ./starcoderbase_15b_split_8/ --stage_id 6 --max_bs 1 --max_seq 1024 --kvcache_dtype fp16
python3 -m build_scripts.starcoderbase_15b --model_path ./starcoderbase_15b_split_8/ --stage_id 7 --max_bs 1 --max_seq 1024 --kvcache_dtype fp16

下载STARCODERBASE-15B官方的tokenizer文件到/usr/local/gcu/TopsDistInfer/src/pre_process_starcoderbase_15b/，包括：tokenizer.json, tokenizer_config.json
通过以下命令执行推理

export EFRT_CLUSTER_AS_DEVICE=true
export ECCL_RUNTIME_3_0_ENABLE=true
mpirun -n 8 --allow-run-as-root Pipeline_Text_Generate_micro_batch --run_config_file=run_configs/run_config_starcoderbase_15b_split_8_max_1024_mix16_3pg.json --model_path=<path/to/starcoderbase_15b_split_8>

Aquila-7b¶

在2张gcu210上执行Aquila-7b模型推理：¶

将第五章方法生成的onnx和json文件放到aquila_7b_split_4文件夹中
使用如下命令生成engine

python3 -m build_scripts.aquila_7b --model_path /usr/local/gcu/TopsDistInfer/aquila_7b_split_4/aquila1of4-op13-fp32-N.onnx --max_seq=2048 --batchsize 1
python3 -m build_scripts.aquila_7b --model_path /usr/local/gcu/TopsDistInfer/aquila_7b_split_4/aquila2of4-op13-fp32-N.onnx --max_seq=2048 --batchsize 1
python3 -m build_scripts.aquila_7b --model_path /usr/local/gcu/TopsDistInfer/aquila_7b_split_4/aquila3of4-op13-fp32-N.onnx --max_seq=2048 --batchsize 1
python3 -m build_scripts.aquila_7b --model_path /usr/local/gcu/TopsDistInfer/aquila_7b_split_4/aquila4of4-op13-fp32-N.onnx --max_seq=2048 --batchsize 1

下载Aquila-7B官方的tokenizer文件到/usr/local/gcu/TopsDistInfer/src/pre_process_python_Aquila_7B/，包括:tokenizer.json, vocab.json, tokenizer_config.json
通过如下命令执行推理

export EFRT_CLUSTER_AS_DEVICE=true
export ECCL_RUNTIME_3_0_ENABLE=true
mpirun -n 4 --allow-run-as-root Pipeline_Text_Generate_micro_batch --run_config_file=run_configs/run_config_aquila_split_4_bs1.json

Internlm-7b¶

在2张gcu210上执行Internlm-7b模型推理：¶

将第五章方法生成的onnx和json文件放到internlm_7b_split_4文件夹中
使用如下命令生成engine

python3 -m build_scripts.internlm_7b --model_path /usr/local/gcu/TopsDistInfer/internlm_7b_split_4/internlm1of4-op13-fp32-N.onnx --max_seq=2048 --batchsize 1
python3 -m build_scripts.internlm_7b --model_path /usr/local/gcu/TopsDistInfer/internlm_7b_split_4/internlm2of4-op13-fp32-N.onnx --max_seq=2048 --batchsize 1
python3 -m build_scripts.internlm_7b --model_path /usr/local/gcu/TopsDistInfer/internlm_7b_split_4/internlm3of4-op13-fp32-N.onnx --max_seq=2048 --batchsize 1
python3 -m build_scripts.internlm_7b --model_path /usr/local/gcu/TopsDistInfer/internlm_7b_split_4/internlm4of4-op13-fp32-N.onnx --max_seq=2048 --batchsize 1

下载internlm-7B官方的tokenizer文件到/usr/local/gcu/TopsDistInfer/src/pre_process_python_InternLM_7B/，包括:special_tokens_map.json,tokenization_internlm.py,tokenizer_config.json,tokenizer.model
通过如下命令执行推理

export EFRT_CLUSTER_AS_DEVICE=true
export ECCL_RUNTIME_3_0_ENABLE=true
mpirun -n 4 --allow-run-as-root Pipeline_Text_Generate_micro_batch --run_config_file=run_configs/run_config_internlm_7b_split_4_bs1.json

LLAMA 13b¶

在四张gcu210上执行LLAMA-13B模型推理:¶

将第五章方法生成的onnx和weights文件放到llama_13b_split_8文件夹中
使用如下命令生成engine，并把生成的bin放到llama_13b_split_8中

python3 -m build_scripts.llama_13b --json_path /usr/local/gcu/TopsDistInfer/build_configs/llama_13b_split_8 --onnx_path ./llama_13b_split_8

下载Llama-13b-hf官方的tokenizer文件到/usr/local/gcu/TopsDistInfer/src/pre_process_LLAMA_13B/，包括：special_tokens_map.json, tokenizer.model
通过如下命令安装 requirements.txt

pip3 install -r /usr/local/gcu/TopsDistInfer/src/pre_process_LLAMA_13B/requirements.txt

通过如下命令执行推理

export EFRT_CLUSTER_AS_DEVICE=true
export ECCL_RUNTIME_3_0_ENABLE=true
mpirun -n 8 --allow-run-as-root Pipeline_Text_Generate_micro_batch --run_config_file=run_configs/run_config_llama_13b_split_8.json --model_path <path/to/llama_13b_split_8>

LLAMA2 13b¶

在四张gcu210上执行LLAMA2-13B模型推理:¶

将第五章方法生成的onnx和weights文件放到llama2_13b_split_8文件夹中
使用如下命令生成engine，并把生成的bin放到llama2_13b_split_8中

python3 -m build_scripts.llama_13b --json_path /usr/local/gcu/TopsDistInfer/build_configs/llama2_13b_split_8 --onnx_path ./llama2_13b_split_8

下载Llama-2-13b-hf官方的tokenizer文件到/usr/local/gcu/TopsDistInfer/src/pre_process_LLAMA_13B/，包括：tokenizer.json, tokenizer_config.json, special_tokens_map.json, tokenizer.model
通过如下命令安装 requirements.txt

pip3 install -r /usr/local/gcu/TopsDistInfer/src/pre_process_LLAMA_13B/requirements.txt

通过如下命令执行推理

export EFRT_CLUSTER_AS_DEVICE=true
export ECCL_RUNTIME_3_0_ENABLE=true
mpirun -n 8 --allow-run-as-root Pipeline_Text_Generate_micro_batch --run_config_file=run_configs/run_config_llama2_13b_split_8.json --model_path <path/to/llama2_13b_split_8>

LLAMA2 13b chat¶

在四张gcu210上执行LLAMA2-13B-Chat模型推理:¶

将第五章方法生成的onnx和weights文件放到llama2_13b_chat_split_8文件夹中
使用如下命令生成engine，并把生成的bin放到llama2_13b_chat_split_8中

python3 -m build_scripts.llama_13b --json_path /usr/local/gcu/TopsDistInfer/build_configs/llama2_13b_chat_split_8 --onnx_path ./llama2_13b_chat_split_8

下载Llama-2-13b-chat-hf官方的tokenizer文件到/usr/local/gcu/TopsDistInfer/src/pre_process_LLAMA_13B/，包括：tokenizer.json, tokenizer_config.json, special_tokens_map.json, tokenizer.model
通过如下命令安装 requirements.txt

pip3 install -r /usr/local/gcu/TopsDistInfer/src/pre_process_LLAMA_13B/requirements.txt

通过如下命令执行推理

export EFRT_CLUSTER_AS_DEVICE=true
export ECCL_RUNTIME_3_0_ENABLE=true
mpirun -n 8 --allow-run-as-root Pipeline_Text_Generate_micro_batch --run_config_file=run_configs/run_config_llama2_13b_chat_split_8.json --model_path <path/to/llama2_13b_chat_split_8>

BLOOM 7b1¶

在两张gcu210上执行BLOOM-7B1模型推理：¶

将第五章方法生成的onnx和weights文件放到bloom-7b1文件夹中
使用如下命令生成engine，并把生成的bin放到bloom-7b1文件夹中

python3 -m build_scripts.bloom_7b1 --model_path ./bloom-7b1/ --stage_id 0 --max_bs 1 --max_seq 2048 --kvcache_dtype fp16
python3 -m build_scripts.bloom_7b1 --model_path ./bloom-7b1/ --stage_id 1 --max_bs 1 --max_seq 2048 --kvcache_dtype fp16
python3 -m build_scripts.bloom_7b1 --model_path ./bloom-7b1/ --stage_id 2 --max_bs 1 --max_seq 2048 --kvcache_dtype fp16
python3 -m build_scripts.bloom_7b1 --model_path ./bloom-7b1/ --stage_id 3 --max_bs 1 --max_seq 2048 --kvcache_dtype fp16

下载BLOOM-7B1官方的tokenizer文件到/usr/local/gcu/TopsDistInfer/src/pre_process_bloom_7b1/，包括：special_tokens_map.json ， tokenizer.json，tokenizer_config.json
通过以下命令执行推理

export EFRT_CLUSTER_AS_DEVICE=true
export ECCL_RUNTIME_3_0_ENABLE=true
mpirun -n 4 --allow-run-as-root Pipeline_Text_Generate_micro_batch --run_config_file=run_configs/run_config_bloom_7b1_split_4_max_2048_mix16_3pg.json --model_path <path/to/bloom-7b1>

Baichuan2 7b¶

在两张gcu210上执行BAICHUAN2-7B模型推理：¶

将第五章方法生成的onnx和weights文件放到/usr/local/gcu/TopsDistInfer/baichuan2_7b_models文件夹中
下载baichuan2-7b官方的tokenizer相关文件到/usr/local/gcu/TopsDistInfer/src/pre_process_Baichuan2_7B/，其中包括tokenizer.model, tokenizer_config.json, special_tokens_map.json, tokenization_baichuan.py
使用如下命令生成engine，并把生成的bin放到baichuan2_7b_split_4_max_2048_mix16_3pg文件夹中

python3 -m  build_scripts.baichuan_7b --model_path=/usr/local/gcu/TopsDistInfer/baichuan2_7b_models/ --json_path=/usr/local/gcu/TopsDistInfer/build_configs/baichuan2_7b_split_4_max_2048_mix16_3pg --max_seq=2048 --batch_size=1

通过以下命令执行推理

export EFRT_CLUSTER_AS_DEVICE=true
export ECCL_RUNTIME_3_0_ENABLE=true
mpirun -n 4 --allow-run-as-root Pipeline_Text_Generate_micro_batch --run_config_file=run_configs/run_config_baichuan2_7b_split_4_max_2048_mix16_3pg.json --model_path=<path/to/baichuan2_7b_models>

Baichuan2 13b base¶

在四张gcu210上执行BAICHUAN2-13B-Base模型推理：¶

将第五章方法生成的onnx和weights文件放到/usr/local/gcu/TopsDistInfer/baichuan2_13b_models文件夹中
下载baichuan2-13b官方的tokenizer相关文件到/usr/local/gcu/TopsDistInfer/src/pre_process_Baichuan2_13B/，其中包括tokenizer.model, tokenizer_config.json, special_tokens_map.json, tokenization_baichuan.py
使用如下命令生成engine，并把生成的bin放到baichuan2_13b_split_8_max_1024_mix16_3pg文件夹中

python3 -m  build_scripts.baichuan2_13b --model_path=/usr/local/gcu/TopsDistInfer/baichuan2_13b_models/ --json_path=/usr/local/gcu/TopsDistInfer/build_configs/baichuan2_13b_split_8_max_1024_mix16_3pg --max_seq=1024 --batch_size=1

通过以下命令执行推理

export EFRT_CLUSTER_AS_DEVICE=true
export ECCL_RUNTIME_3_0_ENABLE=true
mpirun -n 8 --allow-run-as-root Pipeline_Text_Generate_micro_batch --run_config_file=run_configs/run_config_baichuan2_13b_split_8_max_1024_mix16_3pg.json --model_path=<path/to/baichuan2_13b_models>

WIZARDCODER 15B¶

在四张gcu210上执行WIZARDCODER-15B模型推理：¶

将第五章方法生成的onnx和weights文件放到wizardcoder_15b_split_8文件夹中
使用如下命令生成engine，并把生成的bin放到wizardcoder_15b_split_8文件夹中

python3 -m build_scripts.wizardcoder_15b --model_path ./wizardcoder_15b_split_8/ --stage_id 0 --max_bs 1 --max_seq 1024 --kvcache_dtype fp16
python3 -m build_scripts.wizardcoder_15b --model_path ./wizardcoder_15b_split_8/ --stage_id 1 --max_bs 1 --max_seq 1024 --kvcache_dtype fp16
python3 -m build_scripts.wizardcoder_15b --model_path ./wizardcoder_15b_split_8/ --stage_id 2 --max_bs 1 --max_seq 1024 --kvcache_dtype fp16
python3 -m build_scripts.wizardcoder_15b --model_path ./wizardcoder_15b_split_8/ --stage_id 3 --max_bs 1 --max_seq 1024 --kvcache_dtype fp16
python3 -m build_scripts.wizardcoder_15b --model_path ./wizardcoder_15b_split_8/ --stage_id 4 --max_bs 1 --max_seq 1024 --kvcache_dtype fp16
python3 -m build_scripts.wizardcoder_15b --model_path ./wizardcoder_15b_split_8/ --stage_id 5 --max_bs 1 --max_seq 1024 --kvcache_dtype fp16
python3 -m build_scripts.wizardcoder_15b --model_path ./wizardcoder_15b_split_8/ --stage_id 6 --max_bs 1 --max_seq 1024 --kvcache_dtype fp16
python3 -m build_scripts.wizardcoder_15b --model_path ./wizardcoder_15b_split_8/ --stage_id 7 --max_bs 1 --max_seq 1024 --kvcache_dtype fp16

下载WIZARDCODER-15B官方的tokenizer文件到/usr/local/gcu/TopsDistInfer/src/pre_process_wizardcoder_15b/，包括：tokenizer.json， tokenizer_config.json，added_tokens.json，special_tokens_map.json
通过以下命令执行推理

export EFRT_CLUSTER_AS_DEVICE=true
export ECCL_RUNTIME_3_0_ENABLE=true
mpirun -n 8 --allow-run-as-root Pipeline_Text_Generate_micro_batch --run_config_file=run_configs/run_config_wizardcoder_15b_split_8_max_1024.json --model_path=<path/to/wizardcoder_15b_split_8>

Baichuan2 13b chat¶

在四张gcu210上执行BAICHUAN2-13B-Chat模型推理：¶

将第五章方法生成的onnx和weights文件放到/usr/local/gcu/TopsDistInfer/baichuan2_13b_chat_models文件夹中
下载baichuan2-13b官方的tokenizer相关文件到/usr/local/gcu/TopsDistInfer/src/pre_process_Baichuan2_13B/，其中包括tokenizer.model, tokenizer_config.json, special_tokens_map.json, tokenization_baichuan.py
使用如下命令生成engine，并把生成的bin放到baichuan2_13b_chat_split_8_max_1024_mix16_3pg文件夹中

python3 -m  build_scripts.baichuan2_13b --model_path=/usr/local/gcu/TopsDistInfer/baichuan2_13b_chat_models/ --json_path=/usr/local/gcu/TopsDistInfer/build_configs/baichuan2_13b_chat_split_8_max_1024_mix16_3pg --max_seq=1024 --batch_size=1

通过以下命令执行推理

export EFRT_CLUSTER_AS_DEVICE=true
export ECCL_RUNTIME_3_0_ENABLE=true
mpirun -n 8 --allow-run-as-root Pipeline_Text_Generate_micro_batch --run_config_file=run_configs/run_config_baichuan2_13b_chat_split_8_max_1024_mix16_3pg.json --model_path=<path/to/baichuan2_13b_chat_models>