5.5. Llama¶

Llama2-70b¶

本模型推理及性能测试需要四张enflame gcu。

模型下载¶

url:llama2-70b
branch:main
commit id:6aa89cf

将上述url设定的路径下的内容全部下载到llama-2-70b-hf文件夹中。

批量离线推理¶

python3 -m vllm_utils.benchmark_test \
 --model=[path of llama-2-70b-hf] \
 --tensor-parallel-size=8 \
 --demo=te \
 --dtype=float16 \
 --output-len=256 \
 --gpu-memory-utilization=0.9

性能测试¶

python3 -m vllm_utils.benchmark_test --perf \
 --model=[path of llama-2-70b-hf] \
 --tensor-parallel-size=4 \
 --max-model-len=4096 \
 --tokenizer=[path of llama-2-70b-hf] \
 --input-len=512 \
 --output-len=240 \
 --num-prompts=1 \
 --block-size=64 \
 --gpu-memory-utilization=0.9 \
 --dtype=float16

注：

本模型支持的max-model-len为4096；
input-len、output-len和num-prompts可按需调整；
配置 output-len为1时,输出内容中的latency即为time_to_first_token_latency;

基于OpenCompass进行mmlu数据集评测¶

安装OpenCompass

执行 OpenCompass的安装步骤

注：建议使用OpenCompass0.3.1版本。如果安装依赖时安装了和torch_gcu不一致的版本，请重新手动安装。

注：需要安装以下依赖：

python3 -m pip install opencv-python==4.9.0.80
python3 -m pip install huggingface-hub==0.25.2
# for x86_64
python3 -m pip install torchvision==0.21.0+cpu -i https://download.pytorch.org/whl/cpu
# for aarch64
python3 -m pip install torchvision==0.21.0
# for x86_64 and python_version>=3.10
python3 -m pip install importlib-metadata==8.5.0
# for aarch64 and python_version>=3.10
python3 -m pip install importlib-metadata==4.6.4

准备config文件

将下面的配置信息存为一个python文件，放入OpenCompass中如下路径configs/models/llama/vllm_llama2_70b.py

from opencompass.models import VLLM

models = [
    dict(
        type=VLLM,
        abbr='llama2-70b-vllm',
        path='/path/to/Llama2-70b',
        max_out_len=100,
        max_seq_len=2048,
        batch_size=16,
        generation_kwargs=dict(temperature=0),
        run_cfg=dict(num_gpus=0, num_procs=1),
        model_kwargs=dict(device='gcu',
                          tensor_parallel_size=4,
                          enforce_eager=True)
    )
]

执行以下命令

export CUDA_VISIBLE_DEVICES=0,1,2,3
python3 run.py \
 --models=vllm_llama2_70b \
 --datasets=mmlu_gen

Meta-Llama-3-8B¶

模型下载¶

url:Meta-Llama-3-8B
branch:master
commit id:e4260355

将上述url设定的路径下的内容全部下载到Meta-Llama-3-8B文件夹中。

批量离线推理¶

python3 -m vllm_utils.benchmark_test \
 --model=[path of Meta-Llama-3-8B] \
 --demo=te \
 --dtype=float16 \
 --output-len=20 \
 --max-model-len=64

性能测试¶

python3 -m vllm_utils.benchmark_test --perf \
 --model=[path of Meta-Llama-3-8B] \
 --max-model-len=8192 \
 --tokenizer=[path of Meta-Llama-3-8B] \
 --input-len=2048 \
 --output-len=2048 \
 --num-prompts=1 \
 --block-size=64 \
 --dtype=float16 \
 --device gcu

注：

本模型支持的max-model-len为8192；
input-len、output-len和num-prompts可按需调整；
配置 output-len为1时,输出内容中的latency即为time_to_first_token_latency;

Meta-Llama-3-70B¶

模型下载¶

url:Meta-Llama-3-70B
branch:master
commit id:0061f2a0

将上述url设定的路径下的内容全部下载到Meta-Llama-3-70B文件夹中。

批量离线推理¶

python3 -m vllm_utils.benchmark_test \
 --model=[path of Meta-Llama-3-70B] \
 --tensor-parallel-size=4 \
 --demo=te \
 --dtype=float16 \
 --output-len=256 \
 --max-model-len=4096

性能测试¶

python3 -m vllm_utils.benchmark_test --perf \
 --model=[path of Meta-Llama-3-70B] \
 --tensor-parallel-size=4 \
 --max-model-len=8192 \
 --device "gcu" \
 --input-len=4096 \
 --output-len=4096 \
 --num-prompts=1 \
 --block-size=64 \
 --dtype=float16

注：

本模型支持的max-model-len为8192；
input-len、output-len和num-prompts可按需调整；
配置 output-len为1时,输出内容中的latency即为time_to_first_token_latency;

Llama2-13b-w8a16_gptq¶

本模型推理及性能测试需要1张enflame gcu。

模型下载¶

如需要下载权重，请联系商务人员开通EGC权限进行下载

下载llama2-13b-w8a16_gptq.tar文件并解压，将压缩包内的内容全部拷贝到llama2-13b-w8a16_gptq文件夹中。
llama2-13b-w8a16_gptq目录结构如下所示：

llama2-13b-w8a16_gptq/
├── config.json
├── generation_config.json
├── model.safetensors
├── quantize_config.json
├── special_tokens_map.json
├── tokenizer_config.json
├── tokenizer.json
└── tokenizer.model

批量离线推理¶

python3 -m vllm_utils.benchmark_test \
 --model=[path of llama2-13b-w8a16_gptq] \
 --demo=te \
 --dtype=float16 \
 --output-len=256 \
 --quantization gptq \
 --max-model-len 64

性能测试¶

python3 -m vllm_utils.benchmark_test --perf \
 --model=[path of llama2-13b-w8a16_gptq] \
 --max-model-len=4096 \
 --tokenizer=[path of llama2-13b-w8a16_gptq] \
 --input-len=128 \
 --output-len=3968 \
 --num-prompts=1 \
 --block-size=64 \
 --dtype=float16 \
 --quantization gptq \
 --enforce-eager

注：

本模型支持的max-model-len为4096；
input-len、output-len和num-prompts可按需调整；
配置 output-len为1时,输出内容中的latency即为time_to_first_token_latency;

Llama2-70b-w8a16_gptq¶

本模型推理及性能测试需要2张enflame gcu。

模型下载¶

如需要下载权重，请联系商务人员开通EGC权限进行下载

下载llama2-70b-w8a16_gptq.tar文件并解压，将压缩包内的内容全部拷贝到llama2-70b-w8a16_gptq文件夹中。
llama2-70b-w8a16_gptq目录结构如下所示：

llama2-70b-w8a16_gptq/
├── config.json
├── generation_config.json
├── model.safetensors
├── quantize_config.json
├── special_tokens_map.json
├── tokenizer_config.json
├── tokenizer.json
└── tokenizer.model

批量离线推理¶

python3 -m vllm_utils.benchmark_test \
 --model=[path of llama2-70b-w8a16_gptq] \
 --demo=te \
 --dtype=float16 \
 --output-len=20 \
 --quantization gptq \
 --tensor-parallel-size=2

性能测试¶

python3 -m vllm_utils.benchmark_test --perf \
 --model=[path of llama2-70b-w8a16_gptq] \
 --tensor-parallel-size=2 \
 --max-model-len=4096 \
 --device gcu \
 --input-len=2048 \
 --output-len=2048 \
 --num-prompts=1 \
 --block-size=64 \
 --dtype=float16 \
 --quantization gptq

注：

本模型支持的max-model-len为4096；
input-len、output-len和num-prompts可按需调整；
配置 output-len为1时,输出内容中的latency即为time_to_first_token_latency;

Llama3-8b-w8a16_gptq¶

本模型推理及性能测试需要1张enflame gcu。

模型下载¶

如需要下载权重，请联系商务人员开通EGC权限进行下载

下载llama3-8b-w8a16_gptq.tar文件并解压，将压缩包内的内容全部拷贝到llama3-8b-w8a16_gptq文件夹中。
llama3-8b-w8a16_gptq目录结构如下所示：

llama3-8b-w8a16_gptq/
├── config.json
├── model.safetensors
├── quantize_config.json
├── tokenizer_config.json
├── tokenizer.json
└── tokenizer.model

批量离线推理¶

python3 -m vllm_utils.benchmark_test \
 --model=[path of llama3-8b-w8a16_gptq] \
 --demo=te \
 --dtype=float16 \
 --output-len=256 \
 --quantization gptq \
 --max-model-len 64 \
 --output-len 20

性能测试¶

python3 -m vllm_utils.benchmark_test --perf \
 --model=[path of llama3-8b-w8a16_gptq] \
 --max-model-len=8192 \
 --tokenizer=[path of llama3-8b-w8a16_gptq] \
 --input-len=2048 \
 --output-len=2048 \
 --num-prompts=1 \
 --block-size=64 \
 --dtype=float16 \
 --quantization gptq \
 --device gcu

注：

本模型支持的max-model-len为8192；
input-len、output-len和num-prompts可按需调整；
配置 output-len为1时,输出内容中的latency即为time_to_first_token_latency;

Llama3-70b-w8a16_gptq¶

本模型推理及性能测试需要2张enflame gcu。

模型下载¶

如需要下载权重，请联系商务人员开通EGC权限进行下载

下载llama3-70b-w8a16_gptq.tar文件并解压，将压缩包内的内容全部拷贝到llama3-70b-w8a16_gptq文件夹中。
llama3-70b-w8a16_gptq目录结构如下所示：

llama3-70b-w8a16_gptq/
├── config.json
├── model.safetensors
├── quantize_config.json
├── tokenizer_config.json
├── tokenizer.json
└── tokenizer.model

批量离线推理¶

python3 -m vllm_utils.benchmark_test \
 --model=[path of llama3-70b-w8a16_gptq] \
 --demo=te \
 --dtype=float16 \
 --output-len=256 \
 --quantization gptq \
 --tensor-parallel-size=2 \
 --gpu-memory-utilization=0.945 \
 --max-model-len=4096

性能测试¶

python3 -m vllm_utils.benchmark_test --perf \
 --model=[path of llama3-70b-w8a16_gptq] \
 --tensor-parallel-size=2 \
 --max-model-len=8192 \
 --device "gcu" \
 --input-len=4096 \
 --output-len=4096 \
 --num-prompts=1 \
 --block-size=64 \
 --dtype=float16 \
 --quantization gptq \
 --gpu-memory-utilization=0.945

注：

本模型支持的max-model-len为8192；
input-len、output-len和num-prompts可按需调整；
配置 output-len为1时,输出内容中的latency即为time_to_first_token_latency;

Meta-Llama-3.1-8B-Instruct¶

本模型推理及性能测试需要1张enflame gcu。

模型下载¶

url: Meta-Llama-3.1-8B-Instruct
branch: main
commit id: 8c22764

将上述url设定的路径下的内容全部下载到Meta-Llama-3.1-8B-Instruct文件夹中。

requirements¶

python3 -m pip install transformers==4.48.2

批量离线推理¶

python3 -m vllm_utils.benchmark_test \
 --model=[path of Meta-Llama-3.1-8B-Instruct] \
 --demo=te \
 --dtype=bfloat16 \
 --output-len=256 \
 --device=gcu \
 --max-model-len=32768 \
 --tensor-parallel-size 1 \
 --gpu-memory-utilization 0.9

性能测试¶

python3 -m vllm_utils.benchmark_test --perf \
 --model=[path of Meta-Llama-3.1-8B-Instruct] \
 --max-model-len=32768 \
 --tokenizer=[path of Meta-Llama-3.1-8B-Instruct] \
 --input-len=8192 \
 --output-len=512 \
 --num-prompts=1 \
 --block-size=64 \
 --dtype=bfloat16 \
 --device gcu \
 --tensor-parallel-size 1 \
 --gpu-memory-utilization 0.9

注：

本模型支持的max-model-len为131072, 单张卡可跑32768；
input-len、output-len和num-prompts可按需调整；
配置 output-len为1时,输出内容中的latency即为time_to_first_token_latency;

Meta-Llama-3.1-70B-Instruct¶

本模型推理及性能测试需要8张enflame gcu。

模型下载¶

url: Meta-Llama-3.1-70B-Instruct
branch: master
commit id: b6444261

将上述url设定的路径下的内容全部下载到Meta-Llama-3.1-70B-Instruct文件夹中。

requirements¶

python3 -m pip install transformers==4.48.2

批量离线推理¶

python3 -m vllm_utils.benchmark_test \
 --model=[path of Meta-Llama-3.1-70B-Instruct] \
 --tensor-parallel-size=8 \
 --demo=te \
 --max-model-len=32768 \
 --dtype=bfloat16 \
 --device=gcu \
 --output-len=256 \
 --gpu-memory-utilization 0.9

性能测试¶

python3 -m vllm_utils.benchmark_test --perf \
 --model=[path of Meta-Llama-3.1-70B-Instruct] \
 --max-model-len=32768 \
 --tokenizer=[path of Meta-Llama-3.1-70B-Instruct] \
 --tensor-parallel-size=8 \
 --input-len=8192 \
 --output-len=512 \
 --num-prompts=1 \
 --block-size=64 \
 --device=gcu \
 --dtype=bfloat16 \
 --gpu-memory-utilization 0.9

注：

本模型支持的max-model-len为131072, 需8张卡跑32768；
input-len、output-len和num-prompts可按需调整；
配置 output-len为1时,输出内容中的latency即为time_to_first_token_latency;

Llama3-70b-w4a16¶

本模型推理及性能测试需要2张enflame gcu。

模型下载¶

如需要下载权重，请联系商务人员开通EGC权限进行下载

下载Meta-Llama-3-70B_W4A16_GPTQ.tar文件以及并解压，将压缩包内的内容全部拷贝到llama3-70b-w4a16文件夹中。
llama3-70b-w4a16目录结构如下所示：

llama3-70b-w4a16/
  ├── config.json
  ├── model.safetensors
  ├── quantize_config.json
  ├── tokenizer_config.json
  ├── tokenizer.json
  └── tops_quantize_info.json

批量离线推理¶

python3 -m vllm_utils.benchmark_test \
 --model=[path of llama3-70b-w4a16] \
 --tensor-parallel-size=2 \
 --max-model-len=8192 \
 --output-len=512 \
 --demo=te \
 --dtype=float16 \
 --quantization=gptq \
 --device=gcu

性能测试¶

python3 -m vllm_utils.benchmark_test --perf \
 --model=[path of llama3-70b-w4a16] \
 --device=gcu \
 --max-model-len=8192 \
 --tokenizer=[path of llama3-70b-w4a16] \
 --input-len=2048 \
 --output-len=1024 \
 --num-prompts=1 \
 --tensor-parallel-size=2 \
 --block-size=64

注:

本模型支持的max-model-len8192;
input-len、output-len和num-prompts可按需调整；
配置 output-len为1时,输出内容中的latency即为time_to_first_token_latency;

Meta-Llama-3.1-70B-Instruct-w4a16¶

本模型推理及性能测试需要4张enflame gcu。

模型下载¶

如需要下载权重，请联系商务人员开通EGC权限进行下载

下载Meta-Llama-3.1-70B-Instruct_W4A16_AWQ.tar文件以及并解压，将压缩包内的内容全部拷贝到Meta-Llama-3.1-70B-Instruct_W4A16_AWQ文件夹中。
Meta-Llama-3.1-70B-Instruct_W4A16_AWQ目录结构如下所示：

Meta-Llama-3.1-70B-Instruct_W4A16_AWQ/
├── config.json
├── model.safetensors
├── quantize_config.json
├── special_tokens_map.json
├── tokenizer_config.json
├── tokenizer.json
├── tokenizer.model
└── tops_quantize_info.json

requirements¶

python3 -m pip install transformers==4.48.2

批量离线推理¶

python3 -m vllm_utils.benchmark_test \
 --model=[path of Meta-Llama-3.1-70B-Instruct_W4A16_AWQ] \
 --tensor-parallel-size=4 \
 --max-model-len=32768 \
 --dtype=float16 \
 --device=gcu \
 --output-len=256 \
 --demo=te \
 --gpu-memory-utilization=0.9 \
 --quantization=awq

性能测试¶

python3 -m vllm_utils.benchmark_test --perf \
 --model=[path of Meta-Llama-3.1-70B-Instruct_W4A16_AWQ] \
 --tokenizer=[path of Meta-Llama-3.1-70B-Instruct_W4A16_AWQ] \
 --tensor-parallel-size=4 \
 --max-model-len=32768 \
 --dtype=float16 \
 --device=gcu \
 --input-len=31744 \
 --output-len=1024 \
 --num-prompts=1 \
 --block-size=64 \
 --gpu-memory-utilization=0.8 \
 --quantization=awq

注:

本模型支持的max-model-len为131072，4张S60可跑32768；
input-len、output-len和num-prompts可按需调整；
配置 output-len为1时,输出内容中的latency即为time_to_first_token_latency;

Llama-3.3-70B-Instruct¶

本模型推理及性能测试需要8张enflame gcu。

模型下载¶

url: Llama-3.3-70B-Instruct
branch: master
commit id: a5b145fa

将上述url设定的路径下的内容全部下载到Llama-3.3-70B-Instruct文件夹中。

批量离线推理¶

python3 -m vllm_utils.benchmark_test \
 --model=[path of Llama-3.3-70B-Instruct] \
 --dtype=bfloat16 \
 --max-model-len=32768 \
 --tensor-parallel-size=8 \
 --output-len=256 \
 --demo=te \
 --gpu-memory-utilization=0.9 \
 --device=gcu

serving模式¶

# 启动服务端
python3 -m vllm.entrypoints.openai.api_server \
 --model=[path of Llama-3.3-70B-Instruct] \
 --tokenizer=[path of Llama-3.3-70B-Instruct] \
 --dtype=bfloat16 \
 --max-model-len=32768 \
 --tensor-parallel-size=8 \
 --block-size=64 \
 --gpu-memory-utilization=0.9 \
 --disable-log-stats \
 --device=gcu

# 启动客户端
python3 -m vllm_utils.benchmark_serving \
 --backend vllm \
 --model=[path of Llama-3.3-70B-Instruct] \
 --tokenizer=[path of Llama-3.3-70B-Instruct] \
 --request-rate=inf \
 --random-input-len=1024 \
 --random-output-len=1024 \
 --num-prompts=1 \
 --dataset-name=random \
 --ignore-eos \
 --strict-in-out-len

注：

本模型支持的max-model-len为131072，8张S60可跑32768；
为保证输入输出长度固定，数据集使用随机数测试；
num-prompts, random-input-len和random-output-len可按需调整；