3.13. llama¶

llama-65b¶

本模型推理及性能测试需要四张enflame gcu。

模型下载¶

url: llama-65b
branch: llama_v1
commit id: 57b0eb62de0636e75af471e49e2f1862d908d9d8

参考download下载llama-65b模型，将全部内容下载到llama-65b文件夹内。
参考convert_llama_weights_to_hf.py，将下载的模型文件转为huggingface transfomers格式，将转换的全部内容存放在llama-65b-hf文件夹中。

批量离线推理¶

python3.8 -m vllm_utils.benchmark_test \
 --model=[path of llama-65b-hf] \
 --demo=te \
 --tensor-parallel-size=4 \
 --dtype=float16 \
 --output-len=256

性能测试¶

python3.8 -m vllm_utils.benchmark_test --perf \
 --model=[path of llama-65b-hf] \
 --tensor-parallel-size=4 \
 --max-model-len=2048 \
 --tokenizer=[path of llama-65b-hf] \
 --input-len=128 \
 --output-len=128 \
 --num-prompts=8 \
 --block-size=64 \
 --dtype=float16

注:

本模型支持的max-model-len为2048；
input-len、output-len和num-prompts可按需调整；
配置 output-len为1时,输出内容中的latency即为time_to_first_token_latency;

llama2-7b¶

模型下载¶

url:llama2-7b
branch:main
commit id:3f025b

将上述url设定的路径下的内容全部下载到llama-2-7b-hf文件夹中。

批量离线推理¶

python3.8 -m vllm_utils.benchmark_test \
 --model=[path of llama-2-7b-hf] \
 --demo=te \
 --dtype=float16 \
 --output-len=256 \
 --gpu-memory-utilization=0.945

性能测试¶

python3.8 -m vllm_utils.benchmark_test --perf \
 --model=[path of llama-2-7b-hf] \
 --max-model-len=4096 \
 --tokenizer=[path of llama-2-7b-hf] \
 --input-len=128 \
 --output-len=128 \
 --num-prompts=64 \
 --block-size=64 \
 --dtype=float16 \
 --gpu-memory-utilization=0.945

注：

本模型支持的max-model-len为4096；
input-len、output-len和num-prompts可按需调整；
配置 output-len为1时,输出内容中的latency即为time_to_first_token_latency;

基于OpenCompass进行mmlu数据集评测¶

安装OpenCompass

执行 OpenCompass的安装步骤

注：建议使用OpenCompass0.2.1版本。如果安装依赖时安装了和torch_gcu不一致的版本，请重新手动安装。

准备config文件

将下面的配置信息存为一个python文件，放入OpenCompass中如下路径configs/models/llama/vllm_llama2_7b.py

from opencompass.models import VLLM
 
models = [
    dict(
        type=VLLM,
        abbr='llama2-7b-vllm',
        path='/path/to/Llama-2-7b-hf/',
        max_out_len=1,
        max_seq_len=4096,
        batch_size=32,
        generation_kwargs=dict(temperature=0),
        run_cfg=dict(num_gpus=0, num_procs=1),
        model_kwargs=dict(device='gcu',
                          gpu_memory_utilization=0.7,
                          enforce_eager=True)
    )
]

修改opencompass中的opencompass/models/vllm.py，增加get_ppl方法

    def get_ppl(self,
                inputs: List[str],
                mask_length: Optional[List[int]] = None) -> List[float]:
        assert mask_length is None, 'mask_length is not supported'
        bsz = len(inputs)

        # tokenize
        prompt_tokens = [self.tokenizer(x, truncation=True,
                                        add_special_tokens=False,
                                        max_length=self.max_seq_len - 1
                                        )['input_ids'] for x in inputs]
        max_prompt_size = max([len(t) for t in prompt_tokens])
        total_len = min(self.max_seq_len, max_prompt_size)
        tokens = torch.zeros((bsz, total_len)).long()
        for k, t in enumerate(prompt_tokens):
            num_token = min(total_len, len(t))
            tokens[k, :num_token] = torch.tensor(t[-num_token:]).long()
        # forward
        generation_kwargs = {}
        generation_kwargs.update(self.generation_kwargs)
        global ce_loss, bz_idx
        ce_loss = []
        bz_idx = 0

        def logits_hook(logits):
            global ce_loss, bz_idx
            # compute ppl
            shift_logits = logits[..., :-1, :].contiguous().float()
            shift_labels = tokens[bz_idx:bz_idx + logits.shape[0],
                                  1:logits.shape[1]
                                  ].contiguous().to(logits.device)
            bz_idx += logits.shape[0]
            shift_logits = shift_logits.view(-1, shift_logits.size(-1))
            shift_labels = shift_labels.view(-1)
            loss_fct = torch.nn.CrossEntropyLoss(
                reduction='none', ignore_index=0)
            loss = loss_fct(shift_logits, shift_labels).view(
                logits.shape[0], -1)
            lens = (shift_labels != 0).sum(-1).cpu().numpy()
            ce_loss.append(loss.sum(-1).cpu().detach().numpy() / lens)
            return logits

        generation_kwargs['full_logits_processors'] = [logits_hook]
        generation_kwargs['max_tokens'] = 1
        sampling_kwargs = SamplingParams(**generation_kwargs)
        outputs = self.model.generate(
            None, sampling_kwargs, prompt_tokens, use_tqdm=False)
        ce_loss = np.concatenate(ce_loss)
        return ce_loss

执行以下命令

python3 run.py \
 --models=vllm_llama2_7b \
 --datasets=mmlu_ppl \
 --max-partition-size=10000000

llama2-13b¶

模型下载¶

url:llama2-13b
branch:main
commit id:638c8be

将上述url设定的路径下的内容全部下载到llama-2-13b-hf文件夹中。

批量离线推理¶

python3.8 -m vllm_utils.benchmark_test \
 --model=[path of llama-2-13b-hf] \
 --demo=te \
 --dtype=float16 \
 --output-len=256 \
 --gpu-memory-utilization=0.945

性能测试¶

python3.8 -m vllm_utils.benchmark_test --perf \
 --model=[path of llama-2-13b-hf] \
 --max-model-len=4096 \
 --tokenizer=[path of llama-2-13b-hf] \
 --input-len=128 \
 --output-len=128 \
 --num-prompts=64 \
 --block-size=64 \
 --gpu-memory-utilization=0.945 \
 --dtype=float16

注：

本模型支持的max-model-len为4096；
input-len、output-len和num-prompts可按需调整；
配置 output-len为1时,输出内容中的latency即为time_to_first_token_latency;

llama2-70b¶

本模型推理及性能测试需要四张enflame gcu。

模型下载¶

url:llama2-70b
branch:main
commit id:6aa89cf

将上述url设定的路径下的内容全部下载到llama-2-70b-hf文件夹中。

批量离线推理¶

python3.8 -m vllm_utils.benchmark_test \
 --model=[path of llama-2-70b-hf] \
 --tensor-parallel-size=4 \
 --demo=te \
 --dtype=float16 \
 --output-len=256 \
 --gpu-memory-utilization=0.945

性能测试¶

python3.8 -m vllm_utils.benchmark_test --perf \
 --model=[path of llama-2-70b-hf] \
 --tensor-parallel-size=4 \
 --max-model-len=4096 \
 --tokenizer=[path of llama-2-70b-hf] \
 --input-len=128 \
 --output-len=128 \
 --num-prompts=8 \
 --block-size=64 \
 --gpu-memory-utilization=0.945 \
 --dtype=float16

注：

本模型支持的max-model-len为4096；
input-len、output-len和num-prompts可按需调整；
配置 output-len为1时,输出内容中的latency即为time_to_first_token_latency;

基于OpenCompass进行mmlu数据集评测¶

安装OpenCompass

执行 OpenCompass的安装步骤

注：建议使用OpenCompass0.2.1版本。如果安装依赖时安装了和torch_gcu不一致的版本，请重新手动安装。

准备config文件

将下面的配置信息存为一个python文件，放入OpenCompass中如下路径configs/models/llama/vllm_llama2_70b.py

from opencompass.models import VLLM
 
models = [
    dict(
        type=VLLM,
        abbr='llama2-70b-vllm',
        path='/path/to/Llama2-70b',
        max_out_len=100,
        max_seq_len=2048,
        batch_size=16,
        generation_kwargs=dict(temperature=0),
        run_cfg=dict(num_gpus=0, num_procs=1),
        model_kwargs=dict(device='gcu',
                          tensor_parallel_size=4,
                          enforce_eager=True)
    )
]

执行以下命令

export CUDA_VISIBLE_DEVICES=0,1,2,3
python3 run.py \
 --models=vllm_llama2_70b \
 --datasets=mmlu_gen \
 --max-partition-size=10000000

chinese-llama-2-7b¶

模型下载¶

url:chinese-llama-2-7b
branch:main
commit id:c40cf9a

将上述url设定的路径下的内容全部下载到chinese-llama-2-7b-hf文件夹中。

批量离线推理¶

python3.8 -m vllm_utils.benchmark_test \
 --model=[path of chinese-llama-2-7b-hf] \
 --max-model-len=4096 \
 --tokenizer=[path of chinese-llama-2-7b-hf] \
 --dtype=float16 \
 --demo=tc  \
 --output-len=256

性能测试¶

python3.8 -m vllm_utils.benchmark_test --perf \
 --model=[path of chinese-llama-2-7b-hf] \
 --max-model-len=4096 \
 --tokenizer=[path of chinese-llama-2-7b-hf] \
 --input-len=128 \
 --output-len=128 \
 --num-prompts=4 \
 --block-size=64 \
 --dtype=float16

注：

本模型支持的max-model-len为4096；
input-len、output-len和num-prompts可按需调整；
配置 output-len为1时,输出内容中的latency即为time_to_first_token_latency;

chinese-llama-2-7b-16k¶

模型下载¶

url:chinese-llama-2-7b-16k
branch:main
commit id:c934a79

将上述url设定的路径下的内容全部下载到chinese-llama-2-7b-16k-hf文件夹中。

批量离线推理¶

python3.8 -m vllm_utils.benchmark_test \
 --model=[path of chinese-llama-2-7b-16k-hf] \
 --max-model-len=4096 \
 --tokenizer=[path of chinese-llama-2-7b-16k-hf] \
 --dtype=float16 \
 --demo=tc \
 --output-len=256

性能测试¶

python3.8 -m vllm_utils.benchmark_test --perf \
 --model=[path of chinese-llama-2-7b-16k-hf] \
 --max-model-len=4096 \
 --tokenizer=[path of chinese-llama-2-7b-16k-hf] \
 --input-len=128 \
 --output-len=128 \
 --num-prompts=4 \
 --block-size=64 \
 --dtype=float16

注：

本模型支持的max-model-len为16384;
input-len、output-len和num-prompts可按需调整；
配置 output-len为1时,输出内容中的latency即为time_to_first_token_latency;

chinese-llama-2-13b¶

模型下载¶

url:chinese-llama-2-13b
branch:main
commit id:043f8d2

将上述url设定的路径下的内容全部下载到chinese-llama-2-13b-hf文件夹中。

批量离线推理¶

python3.8 -m vllm_utils.benchmark_test \
 --model=[path of chinese-llama-2-13b-hf] \
 --max-model-len=4096 \
 --tokenizer=[path of chinese-llama-2-13b-hf] \
 --dtype=float16 \
 --demo=tc  \
 --output-len=256

性能测试¶

python3.8 -m vllm_utils.benchmark_test --perf \
 --model=[path of chinese-llama-2-13b-hf] \
 --max-model-len=4096 \
 --tokenizer=[path of chinese-llama-2-13b-hf] \
 --input-len=128 \
 --output-len=128 \
 --num-prompts=2 \
 --block-size=64 \
 --dtype=float16

注：

本模型支持的max-model-len为4096；
input-len、output-len和num-prompts可按需调整；
配置 output-len为1时,输出内容中的latency即为time_to_first_token_latency;

chinese-llama-2-13b-16k¶

模型下载¶

url:chinese-llama-2-13b-16k
branch:main
commit id:1c90d65

将上述url设定的路径下的内容全部下载到chinese-llama-2-13b-16k-hf文件夹中。

批量离线推理¶

python3.8 -m vllm_utils.benchmark_test \
 --model=[path of chinese-llama-2-13b-16k-hf] \
 --max-model-len=4096 \
 --tokenizer=[path of chinese-llama-2-13b-16k-hf] \
 --dtype=float16 \
 --demo=tc  \
 --output-len=256

性能测试¶

python3.8 -m vllm_utils.benchmark_test --perf \
 --model=[path of chinese-llama-2-13b-16k-hf] \
 --max-model-len=4096 \
 --tokenizer=[path of chinese-llama-2-13b-16k-hf] \
 --input-len=128 \
 --output-len=128 \
 --num-prompts=2 \
 --block-size=64 \
 --dtype=float16

注：

本模型支持的max-model-len为16384；
input-len、output-len和num-prompts可按需调整；
配置 output-len为1时,输出内容中的latency即为time_to_first_token_latency;

Meta-Llama-3-8B¶

模型下载¶

url:Meta-Llama-3-8B
branch:master
commit id:e4260355

将上述url设定的路径下的内容全部下载到Meta-Llama-3-8B文件夹中。

批量离线推理¶

python3.8 -m vllm_utils.benchmark_test \
 --model=[path of Meta-Llama-3-8B] \
 --demo=te \
 --dtype=float16 \
 --output-len=256

性能测试¶

python3.8 -m vllm_utils.benchmark_test --perf \
 --model=[path of Meta-Llama-3-8B] \
 --max-model-len=8192 \
 --tokenizer=[path of Meta-Llama-3-8B] \
 --input-len=128 \
 --output-len=128 \
 --num-prompts=64 \
 --block-size=64 \
 --dtype=float16

注：

本模型支持的max-model-len为8192；
input-len、output-len和num-prompts可按需调整；
配置 output-len为1时,输出内容中的latency即为time_to_first_token_latency;

Meta-Llama-3-70B¶

模型下载¶

url:Meta-Llama-3-70B
branch:master
commit id:0061f2a0

将上述url设定的路径下的内容全部下载到Meta-Llama-3-70B文件夹中。

批量离线推理¶

python3.8 -m vllm_utils.benchmark_test \
 --model=[path of Meta-Llama-3-70B] \
 --tensor-parallel-size=4 \
 --demo=te \
 --dtype=float16 \
 --output-len=256

性能测试¶

python3.8 -m vllm_utils.benchmark_test --perf \
 --model=[path of Meta-Llama-3-70B] \
 --tensor-parallel-size=4 \
 --max-model-len=8192 \
 --tokenizer=[path of Meta-Llama-3-70B] \
 --input-len=128 \
 --output-len=128 \
 --num-prompts=64 \
 --block-size=64 \
 --dtype=float16

注：

本模型支持的max-model-len为8192；
input-len、output-len和num-prompts可按需调整；
配置 output-len为1时,输出内容中的latency即为time_to_first_token_latency;

llama2-7b-w8a16¶

本模型推理及性能测试需要1张enflame gcu。

模型下载¶

如需要下载权重，请联系商务人员开通EGC权限进行下载

下载llama2-7b-w8a16.tar文件并解压，将压缩包内的内容全部拷贝到llama2-7b-w8a16文件夹中。
llama2-7b-w8a16目录结构如下所示：

llama2-7b-w8a16/
├── config.json
├── generation_config.json
├── model.safetensors
├── quantize_config.json
├── special_tokens_map.json
├── tokenizer_config.json
├── tokenizer.json
└── tokenizer.model

批量离线推理¶

python3.8 -m vllm_utils.benchmark_test \
 --model=[path of llama2-7b-w8a16] \
 --demo=te \
 --dtype=float16 \
 --output-len=256 \
 --quantization w8a16

性能测试¶

python3.8 -m vllm_utils.benchmark_test --perf \
 --model=[path of llama2-7b-w8a16] \
 --max-model-len=4096 \
 --tokenizer=[path of llama2-7b-w8a16] \
 --input-len=128 \
 --output-len=128 \
 --num-prompts=64 \
 --block-size=64 \
 --dtype=float16 \
 --quantization w8a16

注：

本模型支持的max-model-len为4096；
input-len、output-len和num-prompts可按需调整；
配置 output-len为1时,输出内容中的latency即为time_to_first_token_latency;

llama2-13b-w8a16¶

本模型推理及性能测试需要1张enflame gcu。

模型下载¶

如需要下载权重，请联系商务人员开通EGC权限进行下载

下载llama2-13b-w8a16.tar文件并解压，将压缩包内的内容全部拷贝到llama2-13b-w8a16文件夹中。
llama2-13b-w8a16目录结构如下所示：

llama2-13b-w8a16/
├── config.json
├── generation_config.json
├── model.safetensors
├── quantize_config.json
├── special_tokens_map.json
├── tokenizer_config.json
├── tokenizer.json
└── tokenizer.model

批量离线推理¶

python3.8 -m vllm_utils.benchmark_test \
 --model=[path of llama2-13b-w8a16] \
 --demo=te \
 --dtype=float16 \
 --output-len=256 \
 --quantization w8a16

性能测试¶

python3.8 -m vllm_utils.benchmark_test --perf \
 --model=[path of llama2-13b-w8a16] \
 --max-model-len=4096 \
 --tokenizer=[path of llama2-13b-w8a16] \
 --input-len=128 \
 --output-len=128 \
 --num-prompts=64 \
 --block-size=64 \
 --dtype=float16 \
 --quantization w8a16

注：

本模型支持的max-model-len为4096；
input-len、output-len和num-prompts可按需调整；
配置 output-len为1时,输出内容中的latency即为time_to_first_token_latency;

llama2-70b-w8a16¶

本模型推理及性能测试需要2张enflame gcu。

模型下载¶

如需要下载权重，请联系商务人员开通EGC权限进行下载

下载llama2-70b-w8a16.tar文件并解压，将压缩包内的内容全部拷贝到llama2-70b-w8a16文件夹中。
llama2-70b-w8a16目录结构如下所示：

llama2-70b-w8a16/
├── config.json
├── generation_config.json
├── model.safetensors
├── quantize_config.json
├── special_tokens_map.json
├── tokenizer_config.json
├── tokenizer.json
└── tokenizer.model

批量离线推理¶

python3.8 -m vllm_utils.benchmark_test \
 --model=[path of llama2-70b-w8a16] \
 --demo=te \
 --dtype=float16 \
 --output-len=256 \
 --quantization w8a16 \
 --tensor-parallel-size=2

性能测试¶

python3.8 -m vllm_utils.benchmark_test --perf \
 --model=[path of llama2-70b-w8a16] \
 --tensor-parallel-size=2 \
 --max-model-len=4096 \
 --tokenizer=[path of llama2-70b-w8a16] \
 --input-len=128 \
 --output-len=128 \
 --num-prompts=64 \
 --block-size=64 \
 --dtype=float16 \
 --quantization w8a16

注：

本模型支持的max-model-len为4096；
input-len、output-len和num-prompts可按需调整；
配置 output-len为1时,输出内容中的latency即为time_to_first_token_latency;

llama3-8b-w8a16¶

本模型推理及性能测试需要1张enflame gcu。

模型下载¶

如需要下载权重，请联系商务人员开通EGC权限进行下载

下载llama3-8b-w8a16.tar文件并解压，将压缩包内的内容全部拷贝到llama3-8b-w8a16文件夹中。
llama3-8b-w8a16目录结构如下所示：

llama3-8b-w8a16/
├── config.json
├── model.safetensors
├── quantize_config.json
├── tokenizer_config.json
├── tokenizer.json
└── tokenizer.model

批量离线推理¶

python3.8 -m vllm_utils.benchmark_test \
 --model=[path of llama3-8b-w8a16] \
 --demo=te \
 --dtype=float16 \
 --output-len=256 \
 --quantization w8a16

性能测试¶

python3.8 -m vllm_utils.benchmark_test --perf \
 --model=[path of llama3-8b-w8a16] \
 --max-model-len=8192 \
 --tokenizer=[path of llama3-8b-w8a16] \
 --input-len=128 \
 --output-len=128 \
 --num-prompts=64 \
 --block-size=64 \
 --dtype=float16 \
 --quantization w8a16

注：

本模型支持的max-model-len为8192；
input-len、output-len和num-prompts可按需调整；
配置 output-len为1时,输出内容中的latency即为time_to_first_token_latency;

llama3-70b-w8a16¶

本模型推理及性能测试需要2张enflame gcu。

模型下载¶

如需要下载权重，请联系商务人员开通EGC权限进行下载

下载llama3-70b-w8a16.tar文件并解压，将压缩包内的内容全部拷贝到llama3-70b-w8a16文件夹中。
llama3-70b-w8a16目录结构如下所示：

llama3-70b-w8a16/
├── config.json
├── model.safetensors
├── quantize_config.json
├── tokenizer_config.json
├── tokenizer.json
└── tokenizer.model

批量离线推理¶

python3.8 -m vllm_utils.benchmark_test \
 --model=[path of llama3-70b-w8a16] \
 --demo=te \
 --dtype=float16 \
 --output-len=256 \
 --quantization w8a16 \
 --tensor-parallel-size=2 \
 --gpu-memory-utilization=0.945

性能测试¶

python3.8 -m vllm_utils.benchmark_test --perf \
 --model=[path of llama3-70b-w8a16] \
 --tensor-parallel-size=2 \
 --max-model-len=8192 \
 --tokenizer=[path of llama3-70b-w8a16] \
 --input-len=128 \
 --output-len=128 \
 --num-prompts=64 \
 --block-size=64 \
 --dtype=float16 \
 --quantization w8a16 \
 --gpu-memory-utilization=0.945

注：

本模型支持的max-model-len为8192；
input-len、output-len和num-prompts可按需调整；
配置 output-len为1时,输出内容中的latency即为time_to_first_token_latency;