4.6. LLaMA 系列¶

介绍¶

LLaMA（Large Language Model Meta AI ）是由 Meta 开发的一系列开源大语言模型。迄今为止，LLaMA 已经发布了4个版本，最新版本为llama4。

Meta-Llama-3.1-8B-Instruct¶

本模型推理及性能测试需要1张enflame gcu。

模型下载¶

url: Meta-Llama-3.1-8B-Instruct
branch: main
commit id: 8c22764

将上述url路径下的内容下载到本地Meta-Llama-3.1-8B-Instruct文件夹中。

requirements¶

python3 -m pip install transformers==4.48.2

export ENFLAME_TORCH_GCU_ENABLE_AUTO_MIGRATION=1

注：

环境要求：python3.10；transformers >= 4.42.3 均可；

环境变量ENFLAME_TORCH_GCU_ENABLE_AUTO_MIGRATION=1用于开启torch-gcu自动迁移功能；

online推理示例¶

启动server

python3 -m sglang.launch_server --model-path [ path of Meta-Llama-3.1-8B-Instruct ] --host 0.0.0.0 --port 8089  --dp-size 1 --tp-size 1 --trust-remote-code

注：

--port：可以配置为本机未被占用的任意端口；

--context-length：可以配置模型可生成最大token的数量；

client发起请求

import requests

url = f"http://localhost:8089/v1/chat/completions"
data = {
    "model": "[ path of Meta-Llama-3.1-8B-Instruct ]",
    "messages": [{"role": "user", "content": "What is the capital of France?"}],
}

response = requests.post(url, json=data)
print(response.json())

性能测试¶

参考online示例，先启动sglang server.

启动server

python3 -m sglang.launch_server --model-path [ path of Meta-Llama-3.1-8B-Instruct ] --host 0.0.0.0 --port 8089  --dp-size 1 --tp-size 1 --trust-remote-code

性能测试

python3 -m sglang.bench_serving --backend sglang --dataset-path [ path of dataset ] --random-range-ratio 1.0  --num-prompts 1 --random-input-len 2048 --random-output-len 1024 --host 0.0.0.0 --port 8089

注：

可以通过--dataset-path指定数据集的存储路径，否则默认从/tmp/下读取；

dataset 下载地址：https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/
blob/main/ShareGPT_V3_unfiltered_cleaned_split.json；

可以通过 --num-prompts、--random-input-len、--random-output-len 等参数自定义测试规模和输入输出长度。详细参数和用法见官方文档和脚本注释；

精度验证¶

安装lm-eval

python3.10 -m pip install lm_eval[all]

启动server

python3 -m sglang.launch_server --model [ path of Meta-Llama-3.1-8B-Instruct ]  --host 0.0.0.0 --port 30000  --dp-size 1 --tp-size 1 --trust-remote-code

精度测试

lm_eval --model local-completions --model_args model=[ path of Meta-Llama-3.1-8B-Instruct ],tokenizer=[ path of Meta-Llama-3.1-8B-Instruct ],base_url=http://0.0.0.0:30000/v1/completions,max_retries=3 --tasks mmlu_llama  --trust_remote_code  --gen_kwargs temperature=0 --seed 0 --batch_size 1 --num_fewshot 5 --limit 1000

结果查看 测试结果显示在终端

|      Groups      |Version|    Filter   |n-shot|  Metric   |   |Value |   |Stderr|
|------------------|------:|-------------|------|-----------|---|-----:|---|-----:|
|mmlu_llama        |      1|strict_match |      |exact_match|↑  |0.6662|±  |0.0038|
| - humanities     |      1|strict_match |      |exact_match|↑  |0.6170|±  |0.0068|
| - other          |      1|strict_match |      |exact_match|↑  |0.7441|±  |0.0075|
| - social sciences|      1|strict_match |      |exact_match|↑  |0.7631|±  |0.0075|
| - stem           |      0|strict_match |      |exact_match|↑  |0.5683|±  |0.0084|

LLaMA-2-7B¶

本模型推理及性能测试需要1张enflame gcu。

模型下载¶

url: Llama-2-7b
branch: main
commit id: 299e68d8

将上述url路径下的内容下载到本地Llama-2-7b-chat-hf文件夹中。

requirements¶

python3 -m pip install transformers==4.48.2

export ENFLAME_TORCH_GCU_ENABLE_AUTO_MIGRATION=1

注：

环境要求：python3.10；transformers >= 4.32.0.dev0 均可；

环境变量ENFLAME_TORCH_GCU_ENABLE_AUTO_MIGRATION=1用于开启torch-gcu自动迁移功能；

online推理示例¶

启动server

python3 -m sglang.launch_server --model-path [ path of Llama-2-7b-chat-hf ] --host 0.0.0.0 --port 8089  --dp-size 1 --tp-size 1 --trust-remote-code

注：

--port：可以配置为本机未被占用的任意端口；

--context-length：可以配置模型可生成最大token的数量；

client发起请求

import requests

url = f"http://localhost:8089/v1/chat/completions"
data = {
    "model": "[ path of Llama-2-7b-chat-hf ]",
    "messages": [{"role": "user", "content": "What is the capital of France?"}],
}

response = requests.post(url, json=data)
print(response.json())

性能测试¶

参考online示例，先启动sglang server.

启动server

python3 -m sglang.launch_server --model-path [ path of Llama-2-7b-chat-hf ] --host 0.0.0.0 --port 8089  --dp-size 1 --tp-size 1 --trust-remote-code

性能测试

python3 -m sglang.bench_serving --backend sglang --dataset-path [ path of dataset ] --random-range-ratio 1.0  --num-prompts 1 --random-input-len 512 --random-output-len 240 --host 0.0.0.0 --port 8089

注：

可以通过--dataset-path指定数据集的存储路径，否则默认从/tmp/下读取；

dataset 下载地址：https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/ blob/main/ShareGPT_V3_unfiltered_cleaned_split.json；

可以通过 --num-prompts、--random-input-len、--random-output-len 等参数自定义测试规模和输入输出长度。详细参数和用法见官方文档和脚本注释；

精度验证¶

安装lm-eval

python3.10 -m pip install lm_eval[all]

启动server

python3 -m sglang.launch_server --model [ path of Llama-2-7b-chat-hf ]  --host 0.0.0.0 --port 30000  --dp-size 1 --tp-size 1 --trust-remote-code

精度测试

lm_eval --model local-completions --model_args model=[ path of Llama-2-7b-chat-hf ],tokenizer=[ path of Llama-2-7b-chat-hf ],base_url=http://0.0.0.0:30000/v1/completions,max_retries=3 --tasks mmlu_llama  --trust_remote_code  --gen_kwargs temperature=0 --seed 0 --batch_size 1 --num_fewshot 5 --limit 1000

结果查看 测试结果显示在终端

|      Groups      |Version|    Filter   |n-shot|  Metric   |   |Value |   |Stderr|
|------------------|------:|-------------|------|-----------|---|-----:|---|-----:|
|mmlu_llama        |      1|strict_match |      |exact_match|↑  |0.4389|±  |0.0041|
| - humanities     |      1|strict_match |      |exact_match|↑  |0.4089|±  |0.0070|
| - other          |      1|strict_match |      |exact_match|↑  |0.4976|±  |0.0088|
| - social sciences|      1|strict_match |      |exact_match|↑  |0.4946|±  |0.0089|
| - stem           |      0|strict_match |      |exact_match|↑  |0.3714|±  |0.0085|