3.16. llama¶

llama-65b¶

本模型推理及性能测试需要四张enflame gcu。

模型下载¶

url: llama-65b
branch: llama_v1
commit id: 57b0eb62de0636e75af471e49e2f1862d908d9d8

参考download下载llama-65b模型，将全部内容下载到llama-65b文件夹内。
参考convert_llama_weights_to_hf.py，将下载的模型文件转为huggingface transfomers格式，将转换的全部内容存放在llama-65b-hf文件夹中。

批量离线推理¶

python3 -m vllm_utils.benchmark_test \
 --model=[path of llama-65b-hf] \
 --demo=te \
 --tensor-parallel-size=4 \
 --dtype=float16 \
 --output-len=256

性能测试¶

python3 -m vllm_utils.benchmark_test --perf \
 --model=[path of llama-65b-hf] \
 --tensor-parallel-size=4 \
 --max-model-len=2048 \
 --tokenizer=[path of llama-65b-hf] \
 --input-len=128 \
 --output-len=128 \
 --num-prompts=1 \
 --block-size=64 \
 --dtype=float16

注:

本模型支持的max-model-len为2048；
input-len、output-len和num-prompts可按需调整；
配置 output-len为1时,输出内容中的latency即为time_to_first_token_latency;

llama2-7b¶

模型下载¶

url:llama2-7b
branch:main
commit id:3f025b

将上述url设定的路径下的内容全部下载到llama-2-7b-hf文件夹中。

批量离线推理¶

python3 -m vllm_utils.benchmark_test \
 --model=[path of llama-2-7b-hf] \
 --demo=te \
 --dtype=float16 \
 --output-len=256 \
 --gpu-memory-utilization=0.945

性能测试¶

python3 -m vllm_utils.benchmark_test --perf \
 --model=[path of llama-2-7b-hf] \
 --max-model-len=4096 \
 --tokenizer=[path of llama-2-7b-hf] \
 --input-len=3968 \
 --output-len=128 \
 --num-prompts=64 \
 --block-size=64 \
 --dtype=float16 \
 --gpu-memory-utilization=0.945

注：

本模型支持的max-model-len为4096；
input-len、output-len和num-prompts可按需调整；
配置 output-len为1时,输出内容中的latency即为time_to_first_token_latency;

基于OpenCompass进行mmlu数据集评测¶

安装OpenCompass

执行 OpenCompass的安装步骤

注：建议使用OpenCompass0.2.1版本。如果安装依赖时安装了和torch_gcu不一致的版本，请重新手动安装。

准备config文件

将下面的配置信息存为一个python文件，放入OpenCompass中如下路径configs/models/llama/vllm_llama2_7b.py

from opencompass.models import VLLM

models = [
    dict(
        type=VLLM,
        abbr='llama2-7b-vllm',
        path='/path/to/Llama-2-7b-hf/',
        max_out_len=1,
        max_seq_len=4096,
        batch_size=32,
        generation_kwargs=dict(temperature=0),
        run_cfg=dict(num_gpus=0, num_procs=1),
        model_kwargs=dict(device='gcu',
                          gpu_memory_utilization=0.7,
                          enforce_eager=True)
    )
]

修改opencompass中的opencompass/models/vllm.py，增加get_ppl方法

    def get_ppl(self,
                inputs: List[str],
                mask_length: Optional[List[int]] = None) -> List[float]:
        assert mask_length is None, 'mask_length is not supported'
        bsz = len(inputs)

        # tokenize
        prompt_tokens = [self.tokenizer(x, truncation=True,
                                        add_special_tokens=False,
                                        max_length=self.max_seq_len - 1
                                        )['input_ids'] for x in inputs]
        max_prompt_size = max([len(t) for t in prompt_tokens])
        total_len = min(self.max_seq_len, max_prompt_size)
        tokens = torch.zeros((bsz, total_len)).long()
        for k, t in enumerate(prompt_tokens):
            num_token = min(total_len, len(t))
            tokens[k, :num_token] = torch.tensor(t[-num_token:]).long()
        # forward
        generation_kwargs = {}
        generation_kwargs.update(self.generation_kwargs)
        global ce_loss, bz_idx
        ce_loss = []
        bz_idx = 0

        def logits_hook(logits):
            global ce_loss, bz_idx
            # compute ppl
            shift_logits = logits[..., :-1, :].contiguous().float()
            shift_labels = tokens[bz_idx:bz_idx + logits.shape[0],
                                  1:logits.shape[1]
                                  ].contiguous().to(logits.device)
            bz_idx += logits.shape[0]
            shift_logits = shift_logits.view(-1, shift_logits.size(-1))
            shift_labels = shift_labels.view(-1)
            loss_fct = torch.nn.CrossEntropyLoss(
                reduction='none', ignore_index=0)
            loss = loss_fct(shift_logits, shift_labels).view(
                logits.shape[0], -1)
            lens = (shift_labels != 0).sum(-1).cpu().numpy()
            ce_loss.append(loss.sum(-1).cpu().detach().numpy() / lens)
            return logits

        generation_kwargs['full_logits_processors'] = [logits_hook]
        generation_kwargs['max_tokens'] = 1
        sampling_kwargs = SamplingParams(**generation_kwargs)
        outputs = self.model.generate(
            None, sampling_kwargs, prompt_tokens, use_tqdm=False)
        ce_loss = np.concatenate(ce_loss)
        return ce_loss

执行以下命令

python3 run.py \
 --models=vllm_llama2_7b \
 --datasets=mmlu_ppl \
 --max-partition-size=10000000

llama2-13b¶

模型下载¶

url:llama2-13b
branch:main
commit id:638c8be

将上述url设定的路径下的内容全部下载到llama-2-13b-hf文件夹中。

批量离线推理¶

python3 -m vllm_utils.benchmark_test \
 --model=[path of llama-2-13b-hf] \
 --demo=te \
 --dtype=float16 \
 --output-len=256 \
 --gpu-memory-utilization=0.945

性能测试¶

python3 -m vllm_utils.benchmark_test --perf \
 --model=[path of llama-2-13b-hf] \
 --max-model-len=4096 \
 --tokenizer=[path of llama-2-13b-hf] \
 --input-len=128 \
 --output-len=3968 \
 --num-prompts=1 \
 --block-size=64 \
 --gpu-memory-utilization=0.945 \
 --dtype=float16

注：

本模型支持的max-model-len为4096；
input-len、output-len和num-prompts可按需调整；
配置 output-len为1时,输出内容中的latency即为time_to_first_token_latency;

llama2-70b¶

本模型推理及性能测试需要四张enflame gcu。

模型下载¶

url:llama2-70b
branch:main
commit id:6aa89cf

将上述url设定的路径下的内容全部下载到llama-2-70b-hf文件夹中。

批量离线推理¶

python3 -m vllm_utils.benchmark_test \
 --model=[path of llama-2-70b-hf] \
 --tensor-parallel-size=8 \
 --demo=te \
 --dtype=float16 \
 --output-len=256 \
 --gpu-memory-utilization=0.945

性能测试¶

python3 -m vllm_utils.benchmark_test --perf \
 --model=[path of llama-2-70b-hf] \
 --tensor-parallel-size=4 \
 --max-model-len=4096 \
 --tokenizer=[path of llama-2-70b-hf] \
 --input-len=128 \
 --output-len=3968 \
 --num-prompts=1 \
 --block-size=64 \
 --gpu-memory-utilization=0.945 \
 --dtype=float16

注：

本模型支持的max-model-len为4096；
input-len、output-len和num-prompts可按需调整；
配置 output-len为1时,输出内容中的latency即为time_to_first_token_latency;

基于OpenCompass进行mmlu数据集评测¶

安装OpenCompass

执行 OpenCompass的安装步骤

注：建议使用OpenCompass0.2.1版本。如果安装依赖时安装了和torch_gcu不一致的版本，请重新手动安装。

准备config文件

将下面的配置信息存为一个python文件，放入OpenCompass中如下路径configs/models/llama/vllm_llama2_70b.py

from opencompass.models import VLLM

models = [
    dict(
        type=VLLM,
        abbr='llama2-70b-vllm',
        path='/path/to/Llama2-70b',
        max_out_len=100,
        max_seq_len=2048,
        batch_size=16,
        generation_kwargs=dict(temperature=0),
        run_cfg=dict(num_gpus=0, num_procs=1),
        model_kwargs=dict(device='gcu',
                          tensor_parallel_size=4,
                          enforce_eager=True)
    )
]

执行以下命令

export CUDA_VISIBLE_DEVICES=0,1,2,3
python3 run.py \
 --models=vllm_llama2_70b \
 --datasets=mmlu_gen \
 --max-partition-size=10000000

chinese-llama-2-7b¶

模型下载¶

url:chinese-llama-2-7b
branch:main
commit id:c40cf9a

将上述url设定的路径下的内容全部下载到chinese-llama-2-7b-hf文件夹中。

批量离线推理¶

python3 -m vllm_utils.benchmark_test \
 --model=[path of chinese-llama-2-7b-hf] \
 --max-model-len=4096 \
 --tokenizer=[path of chinese-llama-2-7b-hf] \
 --dtype=float16 \
 --demo=tc  \
 --output-len=256

性能测试¶

python3 -m vllm_utils.benchmark_test --perf \
 --model=[path of chinese-llama-2-7b-hf] \
 --max-model-len=4096 \
 --tokenizer=[path of chinese-llama-2-7b-hf] \
 --input-len=128 \
 --output-len=128 \
 --num-prompts=4 \
 --block-size=64 \
 --dtype=float16

注：

本模型支持的max-model-len为4096；
input-len、output-len和num-prompts可按需调整；
配置 output-len为1时,输出内容中的latency即为time_to_first_token_latency;

chinese-llama-2-7b-16k¶

模型下载¶

url:chinese-llama-2-7b-16k
branch:main
commit id:c934a79

将上述url设定的路径下的内容全部下载到chinese-llama-2-7b-16k-hf文件夹中。

批量离线推理¶

python3 -m vllm_utils.benchmark_test \
 --model=[path of chinese-llama-2-7b-16k-hf] \
 --max-model-len=4096 \
 --tokenizer=[path of chinese-llama-2-7b-16k-hf] \
 --dtype=float16 \
 --demo=tc \
 --output-len=256

性能测试¶

python3 -m vllm_utils.benchmark_test --perf \
 --model=[path of chinese-llama-2-7b-16k-hf] \
 --max-model-len=4096 \
 --tokenizer=[path of chinese-llama-2-7b-16k-hf] \
 --input-len=128 \
 --output-len=128 \
 --num-prompts=4 \
 --block-size=64 \
 --dtype=float16

注：

本模型支持的max-model-len为16384;
input-len、output-len和num-prompts可按需调整；
配置 output-len为1时,输出内容中的latency即为time_to_first_token_latency;

chinese-llama-2-13b¶

模型下载¶

url:chinese-llama-2-13b
branch:main
commit id:043f8d2

将上述url设定的路径下的内容全部下载到chinese-llama-2-13b-hf文件夹中。

批量离线推理¶

python3 -m vllm_utils.benchmark_test \
 --model=[path of chinese-llama-2-13b-hf] \
 --max-model-len=4096 \
 --tokenizer=[path of chinese-llama-2-13b-hf] \
 --dtype=float16 \
 --demo=tc  \
 --output-len=256

性能测试¶

python3 -m vllm_utils.benchmark_test --perf \
 --model=[path of chinese-llama-2-13b-hf] \
 --max-model-len=4096 \
 --tokenizer=[path of chinese-llama-2-13b-hf] \
 --input-len=128 \
 --output-len=128 \
 --num-prompts=2 \
 --block-size=64 \
 --dtype=float16

注：

本模型支持的max-model-len为4096；
input-len、output-len和num-prompts可按需调整；
配置 output-len为1时,输出内容中的latency即为time_to_first_token_latency;

chinese-llama-2-13b-16k¶

模型下载¶

url:chinese-llama-2-13b-16k
branch:main
commit id:1c90d65

将上述url设定的路径下的内容全部下载到chinese-llama-2-13b-16k-hf文件夹中。

批量离线推理¶

python3 -m vllm_utils.benchmark_test \
 --model=[path of chinese-llama-2-13b-16k-hf] \
 --max-model-len=4096 \
 --tokenizer=[path of chinese-llama-2-13b-16k-hf] \
 --dtype=float16 \
 --demo=tc  \
 --output-len=256

性能测试¶

python3 -m vllm_utils.benchmark_test --perf \
 --model=[path of chinese-llama-2-13b-16k-hf] \
 --max-model-len=4096 \
 --tokenizer=[path of chinese-llama-2-13b-16k-hf] \
 --input-len=128 \
 --output-len=128 \
 --num-prompts=2 \
 --block-size=64 \
 --dtype=float16

注：

本模型支持的max-model-len为16384；
input-len、output-len和num-prompts可按需调整；
配置 output-len为1时,输出内容中的latency即为time_to_first_token_latency;

Meta-Llama-3-8B¶

模型下载¶

url:Meta-Llama-3-8B
branch:master
commit id:e4260355

将上述url设定的路径下的内容全部下载到Meta-Llama-3-8B文件夹中。

批量离线推理¶

python3 -m vllm_utils.benchmark_test \
 --model=[path of Meta-Llama-3-8B] \
 --demo=te \
 --dtype=float16 \
 --output-len=20 \
 --max-model-len=64

性能测试¶

python3 -m vllm_utils.benchmark_test --perf \
 --model=[path of Meta-Llama-3-8B] \
 --max-model-len=8192 \
 --tokenizer=[path of Meta-Llama-3-8B] \
 --input-len=128 \
 --output-len=3968 \
 --num-prompts=1 \
 --block-size=64 \
 --dtype=float16 \
 --enforce-eager

注：

本模型支持的max-model-len为8192；
input-len、output-len和num-prompts可按需调整；
配置 output-len为1时,输出内容中的latency即为time_to_first_token_latency;

Meta-Llama-3-70B¶

模型下载¶

url:Meta-Llama-3-70B
branch:master
commit id:0061f2a0

将上述url设定的路径下的内容全部下载到Meta-Llama-3-70B文件夹中。

批量离线推理¶

python3 -m vllm_utils.benchmark_test \
 --model=[path of Meta-Llama-3-70B] \
 --tensor-parallel-size=4 \
 --demo=te \
 --dtype=float16 \
 --output-len=256 \
 --max-model-len=4096

性能测试¶

python3 -m vllm_utils.benchmark_test --perf \
 --model=[path of Meta-Llama-3-70B] \
 --tensor-parallel-size=4 \
 --max-model-len=8192 \
 --tokenizer=[path of Meta-Llama-3-70B] \
 --input-len=1024 \
 --output-len=7168 \
 --num-prompts=1 \
 --block-size=64 \
 --dtype=float16

注：

本模型支持的max-model-len为8192；
input-len、output-len和num-prompts可按需调整；
配置 output-len为1时,输出内容中的latency即为time_to_first_token_latency;

llama2-7b-w8a16¶

本模型推理及性能测试需要1张enflame gcu。

模型下载¶

如需要下载权重，请联系商务人员开通EGC权限进行下载

下载llama2-7b-w8a16.tar文件并解压，将压缩包内的内容全部拷贝到llama2-7b-w8a16文件夹中。
llama2-7b-w8a16目录结构如下所示：

llama2-7b-w8a16/
├── config.json
├── generation_config.json
├── model.safetensors
├── quantize_config.json
├── special_tokens_map.json
├── tokenizer_config.json
├── tokenizer.json
└── tokenizer.model

批量离线推理¶

python3 -m vllm_utils.benchmark_test \
 --model=[path of llama2-7b-w8a16] \
 --demo=te \
 --dtype=float16 \
 --output-len=256 \
 --quantization w8a16 \
 --max-model-len=64

性能测试¶

python3 -m vllm_utils.benchmark_test --perf \
 --model=[path of llama2-7b-w8a16] \
 --max-model-len=4096 \
 --tokenizer=[path of llama2-7b-w8a16] \
 --input-len=128 \
 --output-len=3968 \
 --num-prompts=1 \
 --block-size=64 \
 --dtype=float16 \
 --quantization w8a16 \
 --enforce-eager

注：

本模型支持的max-model-len为4096；
input-len、output-len和num-prompts可按需调整；
配置 output-len为1时,输出内容中的latency即为time_to_first_token_latency;

llama2-13b-w8a16¶

本模型推理及性能测试需要1张enflame gcu。

模型下载¶

如需要下载权重，请联系商务人员开通EGC权限进行下载

下载llama2-13b-w8a16.tar文件并解压，将压缩包内的内容全部拷贝到llama2-13b-w8a16文件夹中。
llama2-13b-w8a16目录结构如下所示：

llama2-13b-w8a16/
├── config.json
├── generation_config.json
├── model.safetensors
├── quantize_config.json
├── special_tokens_map.json
├── tokenizer_config.json
├── tokenizer.json
└── tokenizer.model

批量离线推理¶

python3 -m vllm_utils.benchmark_test \
 --model=[path of llama2-13b-w8a16] \
 --demo=te \
 --dtype=float16 \
 --output-len=256 \
 --quantization w8a16 \
 --max-model-len 64

性能测试¶

python3 -m vllm_utils.benchmark_test --perf \
 --model=[path of llama2-13b-w8a16] \
 --max-model-len=4096 \
 --tokenizer=[path of llama2-13b-w8a16] \
 --input-len=128 \
 --output-len=3968 \
 --num-prompts=1 \
 --block-size=64 \
 --dtype=float16 \
 --quantization w8a16 \
 --enforce-eager

注：

本模型支持的max-model-len为4096；
input-len、output-len和num-prompts可按需调整；
配置 output-len为1时,输出内容中的latency即为time_to_first_token_latency;

llama2-70b-w8a16¶

本模型推理及性能测试需要2张enflame gcu。

模型下载¶

如需要下载权重，请联系商务人员开通EGC权限进行下载

下载llama2-70b-w8a16.tar文件并解压，将压缩包内的内容全部拷贝到llama2-70b-w8a16文件夹中。
llama2-70b-w8a16目录结构如下所示：

llama2-70b-w8a16/
├── config.json
├── generation_config.json
├── model.safetensors
├── quantize_config.json
├── special_tokens_map.json
├── tokenizer_config.json
├── tokenizer.json
└── tokenizer.model

批量离线推理¶

python3 -m vllm_utils.benchmark_test \
 --model=[path of llama2-70b-w8a16] \
 --demo=te \
 --dtype=float16 \
 --output-len=20 \
 --quantization w8a16 \
 --tensor-parallel-size=2

性能测试¶

python3 -m vllm_utils.benchmark_test --perf \
 --model=[path of llama2-70b-w8a16] \
 --tensor-parallel-size=2 \
 --max-model-len=4096 \
 --tokenizer=[path of llama2-70b-w8a16] \
 --input-len=128 \
 --output-len=3968 \
 --num-prompts=1 \
 --block-size=64 \
 --dtype=float16 \
 --quantization w8a16

注：

本模型支持的max-model-len为4096；
input-len、output-len和num-prompts可按需调整；
配置 output-len为1时,输出内容中的latency即为time_to_first_token_latency;

llama3-8b-w8a16¶

本模型推理及性能测试需要1张enflame gcu。

模型下载¶

如需要下载权重，请联系商务人员开通EGC权限进行下载

下载llama3-8b-w8a16.tar文件并解压，将压缩包内的内容全部拷贝到llama3-8b-w8a16文件夹中。
llama3-8b-w8a16目录结构如下所示：

llama3-8b-w8a16/
├── config.json
├── model.safetensors
├── quantize_config.json
├── tokenizer_config.json
├── tokenizer.json
└── tokenizer.model

批量离线推理¶

python3 -m vllm_utils.benchmark_test \
 --model=[path of llama3-8b-w8a16] \
 --demo=te \
 --dtype=float16 \
 --output-len=256 \
 --quantization w8a16 \
 --max-model-len 64 \
 --output-len 20

性能测试¶

python3 -m vllm_utils.benchmark_test --perf \
 --model=[path of llama3-8b-w8a16] \
 --max-model-len=8192 \
 --tokenizer=[path of llama3-8b-w8a16] \
 --input-len=128 \
 --output-len=3968 \
 --num-prompts=1 \
 --block-size=64 \
 --dtype=float16 \
 --quantization w8a16 \
 --enforce-eager

注：

本模型支持的max-model-len为8192；
input-len、output-len和num-prompts可按需调整；
配置 output-len为1时,输出内容中的latency即为time_to_first_token_latency;

llama3-70b-w8a16¶

本模型推理及性能测试需要2张enflame gcu。

模型下载¶

如需要下载权重，请联系商务人员开通EGC权限进行下载

下载llama3-70b-w8a16.tar文件并解压，将压缩包内的内容全部拷贝到llama3-70b-w8a16文件夹中。
llama3-70b-w8a16目录结构如下所示：

llama3-70b-w8a16/
├── config.json
├── model.safetensors
├── quantize_config.json
├── tokenizer_config.json
├── tokenizer.json
└── tokenizer.model

批量离线推理¶

python3 -m vllm_utils.benchmark_test \
 --model=[path of llama3-70b-w8a16] \
 --demo=te \
 --dtype=float16 \
 --output-len=256 \
 --quantization w8a16 \
 --tensor-parallel-size=2 \
 --gpu-memory-utilization=0.945 \
 --max-model-len=4096

性能测试¶

python3 -m vllm_utils.benchmark_test --perf \
 --model=[path of llama3-70b-w8a16] \
 --tensor-parallel-size=2 \
 --max-model-len=8192 \
 --tokenizer=[path of llama3-70b-w8a16] \
 --input-len=1024 \
 --output-len=7168 \
 --num-prompts=1 \
 --block-size=64 \
 --dtype=float16 \
 --quantization w8a16 \
 --gpu-memory-utilization=0.945

注：

本模型支持的max-model-len为8192；
input-len、output-len和num-prompts可按需调整；
配置 output-len为1时,输出内容中的latency即为time_to_first_token_latency;

Meta-Llama-3.1-8B-Instruct¶

本模型推理及性能测试需要1张enflame gcu。

模型下载¶

url: Meta-Llama-3.1-8B-Instruct
branch: main
commit id: 8c22764

将上述url设定的路径下的内容全部下载到Meta-Llama-3.1-8B-Instruct文件夹中。

批量离线推理¶

python3 -m vllm_utils.benchmark_test \
 --model=[path of Meta-Llama-3.1-8B-Instruct] \
 --demo=te \
 --dtype=bfloat16 \
 --output-len=256 \
 --device=gcu \
 --max-model-len=32768

性能测试¶

python3 -m vllm_utils.benchmark_test --perf \
 --model=[path of Meta-Llama-3.1-8B-Instruct] \
 --max-model-len=32768 \
 --tokenizer=[path of Meta-Llama-3.1-8B-Instruct] \
 --input-len=128 \
 --output-len=128 \
 --num-prompts=8 \
 --block-size=64 \
 --dtype=bfloat16

注：

本模型支持的max-model-len为131072, 单张卡可跑32768；
input-len、output-len和num-prompts可按需调整；
配置 output-len为1时,输出内容中的latency即为time_to_first_token_latency;

llama2-7b-w4a16¶

本模型推理及性能测试需要1张enflame gcu。

模型下载¶

url: Llama-2-7B-Chat-GPTQ
branch: main
commit id: d5ad9310836dd91b6ac6133e2e47f47394386cea

将上述url设定的路径下的内容全部下载到Llama-2-7B-Chat-GPTQ文件夹中。

批量离线推理¶

python3 -m vllm_utils.benchmark_test \
    --demo='te' \
    --model=[path of Llama-2-7B-Chat-GPTQ] \
    --tokenizer=[path of Llama-2-7B-Chat-GPTQ] \
    --num-prompts 1 \
    --block-size=64 \
    --output-len=256 \
    --device=gcu \
    --dtype=float16 \
    --quantization=gptq \
    --gpu-memory-utilization=0.945 \
    --tensor-parallel-size=1

性能测试¶

python3 -m vllm_utils.benchmark_test \
    --perf \
    --model=[path of Llama-2-7B-Chat-GPTQ] \
    --tensor-parallel-size 1 \
    --max-model-len=2048 \
    --input-len=1024 \
    --output-len=1024 \
    --dtype=float16 \
    --device gcu \
    --num-prompts 1 \
    --block-size=64 \
    --quantization=gptq \
    --gpu-memory-utilization=0.945

注:

本模型支持的max-model-len为2048；
input-len、output-len和num-prompts可按需调整；
配置 output-len为1时,输出内容中的latency即为time_to_first_token_latency;

Meta-Llama-3.1-70B-Instruct¶

本模型推理及性能测试需要8张enflame gcu。

模型下载¶

url: Meta-Llama-3.1-70B-Instruct
branch: master
commit id: b6444261

将上述url设定的路径下的内容全部下载到Meta-Llama-3.1-70B-Instruct文件夹中。

批量离线推理¶

python3 -m vllm_utils.benchmark_test \
 --model=[path of Meta-Llama-3.1-70B-Instruct] \
 --tensor-parallel-size=8 \
 --demo=te \
 --max-model-len=32768 \
 --dtype=bfloat16 \
 --device=gcu \
 --output-len=256

性能测试¶

python3 -m vllm_utils.benchmark_test --perf \
 --model=[path of Meta-Llama-3.1-70B-Instruct] \
 --max-model-len=32768 \
 --tokenizer=[path of Meta-Llama-3.1-70B-Instruct] \
 --input-len=128 \
 --output-len=128 \
 --num-prompts=64 \
 --block-size=64 \
 --dtype=bfloat16

注：

本模型支持的max-model-len为131072, 需8张卡跑32768；
input-len、output-len和num-prompts可按需调整；
配置 output-len为1时,输出内容中的latency即为time_to_first_token_latency;

llama3-70b-w4a16¶

本模型推理及性能测试需要2张enflame gcu。

模型下载¶

如需要下载权重，请联系商务人员开通EGC权限进行下载

下载Meta-Llama-3-70B_W4A16_GPTQ.tar文件以及并解压，将压缩包内的内容全部拷贝到llama3-70b-w4a16文件夹中。
llama3-70b-w4a16目录结构如下所示：

llama3-70b-w4a16/
  ├── config.json
  ├── model.safetensors
  ├── quantize_config.json
  ├── tokenizer_config.json
  ├── tokenizer.json
  └── tops_quantize_info.json

批量离线推理¶

python3 -m vllm_utils.benchmark_test \
 --model=[path of llama3-70b-w4a16] \
 --tensor-parallel-size=2 \
 --max-model-len=8192 \
 --output-len=512 \
 --demo=te \
 --dtype=float16 \
 --quantization=gptq \
 --device=gcu

性能测试¶

python3 -m vllm_utils.benchmark_test --perf \
 --model=[path of llama3-70b-w4a16] \
 --device=gcu \
 --max-model-len=8192 \
 --tokenizer=[path of llama3-70b-w4a16] \
 --input-len=2048 \
 --output-len=1024 \
 --num-prompts=1 \
 --tensor-parallel-size=2 \
 --block-size=64

注:

本模型支持的max-model-len8192;
input-len、output-len和num-prompts可按需调整；
配置 output-len为1时,输出内容中的latency即为time_to_first_token_latency;

llama2-7b-w4a16c8¶

本模型推理及性能测试需要1张enflame gcu。

模型下载¶

url: Llama-2-7B-Chat-GPTQ
branch: main
commit id: d5ad9310836dd91b6ac6133e2e47f47394386cea

将上述url设定的路径下的内容全部下载到Llama-2-7B-Chat-w4a16c8文件夹中。
int8_kv_cache.json文件请联系商务人员开通EGC权限进行下载，并拷贝到Llama-2-7B-Chat-w4a16c8文件夹中。

批量离线推理¶

python3 -m vllm_utils.benchmark_test \
    --demo='te' \
    --model=[Llama-2-7B-Chat-w4a16c8] \
    --tokenizer=[Llama-2-7B-Chat-w4a16c8] \
    --output-len=128 \
    --device=gcu \
    --dtype=float16 \
    --quantization=gptq \
    --quantization-param-path=[path of int8_kv_cache.json] \
    --kv-cache-dtype=int8

性能测试¶

python3 -m vllm_utils.benchmark_test \
    --perf \
    --model=[Llama-2-7B-Chat-w4a16c8] \
    --tensor-parallel-size 1 \
    --max-model-len=4096 \
    --input-len=1024 \
    --output-len=1024 \
    --dtype=float16 \
    --device gcu \
    --num-prompts 1 \
    --block-size=64 \
    --quantization=gptq \
    --quantization-param-path=[path of int8_kv_cache.json] \
    --kv-cache-dtype=int8

注:

本模型支持的max-model-len为4096；
input-len、output-len和num-prompts可按需调整；
配置 output-len为1时,输出内容中的latency即为time_to_first_token_latency;

llama2-70b-w4a16c8¶

本模型推理及性能测试需要4张enflame gcu。

模型下载¶

如需要下载权重，请联系商务人员开通EGC权限进行下载

下载llama2_70b_w4a16c8.tar文件以及并解压，将压缩包内的内容全部拷贝到llama2_70b_w4a16c8.tar文件夹中。
llama2_70b_w4a16c8.tar目录结构如下所示：

.
├── config.json
├── int8_kv_cache.json
├── model.safetensors
├── quantize_config.json
├── special_tokens_map.json
├── tokenizer_config.json
├── tokenizer.json
├── tokenizer.model
└── tops_quantize_info.json

批量离线推理¶

python3 -m vllm_utils.benchmark_test \
    --demo='te' \
    --model=[llama2_70b_w4a16c8] \
    --tokenizer=[llama2_70b_w4a16c8] \
    --output-len=128 \
    --device=gcu \
    --dtype=float16 \
    --quantization=gptq \
    --quantization-param-path=[path of int8_kv_cache.json] \
    --kv-cache-dtype=int8

性能测试¶

python3 -m vllm_utils.benchmark_test \
    --perf \
    --model=[llama2_70b_w4a16c8] \
    --tensor-parallel-size 4 \
    --max-model-len=4096 \
    --input-len=1024 \
    --output-len=1024 \
    --dtype=float16 \
    --device gcu \
    --num-prompts 1 \
    --block-size=64 \
    --quantization=gptq \
    --quantization-param-path=[path of int8_kv_cache.json] \
    --kv-cache-dtype=int8

注:

本模型支持的max-model-len为4096；
input-len、output-len和num-prompts可按需调整；
配置 output-len为1时,输出内容中的latency即为time_to_first_token_latency;

Llama-2-13B-chat-GPTQ¶

本模型推理及性能测试需要1张enflame gcu。

模型下载¶

url: Llama-2-13B-chat-GPTQ
branch: main
commit id: ea078917a7e91c896787c73dba935f032ae658e9

将上述url设定的路径下的内容全部下载到Llama-2-13B-chat-GPTQ文件夹中。

批量离线推理¶

python3 -m vllm_utils.benchmark_test \
    --demo='te' \
    --model=[path of Llama-2-13B-chat-GPTQ] \
    --tokenizer=[path of Llama-2-13B-chat-GPTQ] \
    --num-prompts 1 \
    --block-size=64 \
    --output-len=256 \
    --device=gcu \
    --dtype=float16 \
    --quantization=gptq \
    --gpu-memory-utilization=0.945 \
    --tensor-parallel-size=1

性能测试¶

python3 -m vllm_utils.benchmark_test \
    --perf \
    --model=[path of Llama-2-13B-chat-GPTQ] \
    --tensor-parallel-size 1 \
    --max-model-len=4096 \
    --input-len=512 \
    --output-len=512 \
    --dtype=float16 \
    --device gcu \
    --num-prompts 1 \
    --block-size=64 \
    --quantization=gptq \
    --gpu-memory-utilization=0.945

注:

本模型支持的max-model-len为4096；
input-len、output-len和num-prompts可按需调整；
配置 output-len为1时,输出内容中的latency即为time_to_first_token_latency;