1. 版本申明¶
版本 | 修改内容 | 修改时间 |
---|---|---|
v1.0 | 初始化 | 11/30/2022 |
v1.1 | 格式调整 | 12/01/2022 |
v1.2 | 更新一些格式与内容 | 4/8/2024 |
v1.3 | 更新一些格式 | 4/17/2024 |
v1.4 | 更新部分内容 | 5/17/2024 |
v1.5 | 更新tke使用部分内容 | 5/22/2024 |
v1.6 | 更新配置文件使用说明 | 5/28/2024 |
v1.7 | 更新格式 | 6/24/2024 |
v1.8 | 更新前置依赖 | 7/9/2024 |
v1.9 | 更新格式 | 7/17/2024 |
2. 简介¶
GCU Feature Discovery是一款部署在k8s集群上的组件,主要用于给GCU节点打上一些与GCU设备属性相关的标签。比如:该节点的GCU驱动是哪个版本,GCU显存是多大等。这些标签多是以”enflame.com”开头的标签,打上这些标签的主要目的是在之后的任务调度中,可以根据标签很方便的将任务调度到指定节点上。
3. GFD部署¶
3.1. 前置依赖¶
已安装docker, containerd(k8s version >=1.24)
kubernetes集群版本高于1.9
GCU Driver
Enflame Container Toolkit
Enflame K8s Device Plugin
Node Feature Discovery
3.2. 安装包说明¶
GFD目录内容如下:
gcu-feature-discovery_<VERSION>/
├── all-in-one
├── build-image.sh
├── config
├── delete.sh
├── deploy.sh
├── docker
├── gcu-feature-discovery
├── README.md
├── show-labels.sh
└── yaml
all-in-one
, k8s-device-plugin, GFD, NFD 合在一个daemonSet里的安装包;build-image.sh
, gcu-feature-discovery 镜像构建脚本;config
, gcu-feature-discovery 配置文件所在目录;delete.sh
, gcu-feature-discovery daemonset删除脚本;deploy.sh
, gcu-feature-discovery daemonset部署脚本;docker
, gcu-feature-discovery dockerfile目录;gcu-feature-discovery
, gcu-feature-discovery 二进制文件;README.md
, gcu-feature-discovery 简单的README文件;yaml
, gcu-feature-discovery yaml目录;
3.3. 提供的标签¶
GFD提供的标签以enflame.com开头,主要标签如下:
Label Name | Meaning | Example |
---|---|---|
enflame.com/gfd.timestamp | Timestamp of the deploy gfd (optional) | 2023-03-02-02-59-47 |
enflame.com/gcu.count | Number of GCUs | 8 |
enflame.com/gcu.driverVer | Driver version of the GCU | 1.0.1.3 |
enflame.com/gcu.machine | Machine type | NF5468M5 |
enflame.com/gcu.memory | Memory of the GCU in Mb | 16384 |
enflame.com/gcu.model | Model of the GCU | T10 |
enflame.com/gcu.family | Family of the GCU | aaaa |
enflame.com/gcu.product | Product of the GCU | xxxxx |
如果节点存在VGCU设备,那么GFD提供的主要标签如下:
Label Name | Meaning | Example |
---|---|---|
enflame.com/gfd.timestamp | Timestamp of the deploy gfd (optional) | 2023-03-02-02-59-47 |
enflame.com/vgcu.present | Whether there is VGCU device (optional) | true |
enflame.com/vgcu.count | Number of VGCUs | 4 |
enflame.com/vgcu.driverVer | Driver version of the vGCU | 1.0.1.3 |
enflame.com/vgcu.machine | Machine type | NF5468M5 |
enflame.com/vgcu.memory | Memory of the VGCU in Mb | 4096 |
enflame.com/vgcu.model | Model of the VGCU | T10 |
enflame.com/vgcu.family | Family of the VGCU | aaaa |
enflame.com/vgcu.product | Product of the VGCU | xxxxx |
3.4. 配置与部署NFD¶
注:NFD的安装使用参考文档《Node Feature Discovery用户使用手册》。
1)依赖条件
GFD组件依赖于NFD组件去执行为节点打标签的动作,GFD组件是依赖于NFD的,安装GFD之前请确保NFD组件已经安装完成;
GFD组件也依赖于libefml.so,安装GFD之前,请检查主机上的libefml存在且可用,检查方法:
#ll /usr/lib/libefml.so
lrwxrwxrwx 1 root root 50 May 17 16:56 /usr/lib/libefml.so -> \
/usr/local/efsmi/efsmi-<VERSION>/lib/libefml.so.1.0.0*
2)配置NFD里的label namespace
通过配置NFD yaml里的--extra-label-ns=xxx
可以过滤GFD的Labels,比如允许enflame.com
开头的Labels展示出来,
那么就配置nfd-master 的 --extra-label-ns
为--extra-label-ns=enflame.com
。
如果即要enflame.com
命名空间的标签又要tke.cloud.tencent.com
命名空间的标签,那么可以采用逗号分隔这两个命名空间,如下:
.................
image: artifact.enflame.cn/enflame_docker_images/enflame/node-feature-discovery:v0.11.3
name: nfd-master
command:
- "nfd-master"
args:
# - "--extra-label-ns=enflame.com"
- "--extra-label-ns=enflame.com,tke.cloud.tencent.com"
- env:
- name: NODE_NAME
valueFrom:
fieldRef:
fieldPath: spec.nodeName
..................
3.5. GFD 配置文件¶
GFD的默认配置文件gfd.json位于安装包的config/
目录下,在构建镜像时会被复制到镜像内。在部署gfd时,如果宿主机的/etc/topscloud
目录下存在gfd.json,那么gfd将使用/etc/topscloud/gfd.json作为配置文件运行;否则gfd将会把拷贝到镜像内的初始化配置文件同步到宿主机的/etc/topscloud
目录,并使用/etc/topscloud/gfd.json作为配置文件运行。
默认的配置文件config/gfd.json中的标签的value值是为空的,如下:
{
"labels": {
"enflame.com/gcu.family": "",
"enflame.com/gcu.model": "",
"enflame.com/gcu.product": "",
"tke.cloud.tencent.com/gpu.family": "",
"tke.cloud.tencent.com/gpu.model": "",
"tke.cloud.tencent.com/gpu.product": ""
}
}
当有需要手工设置标签值时,可以修改gfd.json中的value值,或者key值,如下:
{
"labels": {
"enflame.com/gcu.family": "",
"enflame.com/gcu.model": "",
"enflame.com/gcu.product": "",
"enflame.com/gcu.vendor": "1e36",
"tke.cloud.tencent.com/gpu.family": "",
"tke.cloud.tencent.com/gpu.model": "",
"tke.cloud.tencent.com/gpu.product": "ABCD"
}
}
如上配置,gfd将会为节点打上”tke.cloud.tencent.com/gpu.product”: “ABCD”标签。
如不需要手工配置标签,保持gfd.json里的默认值即可。
注意:
由于topscloud使用
/etc/topscloud
目录统一管理各组件的配置文件,因此修改gfd.json时,如果该目录下存在gfd.json,你必须修改/etc/topscloud/gfd.json才行,修改安装包下的config/gfd.json是不生效的;如果/etc/topscloud/gfd.json不存在,你可以通过修改安装包下的config/gfd.json,并重做gfd镜像以使配置文件生效。修改/etc/topscloud/gfd.json后,无需重做镜像或重新安装gfd,等待不超过1分钟(默认时间)时间后,新的配置文件将自动生效。
3.6. 制作GFD组件镜像¶
执行gcu-feature-discovery_<VERSION>
安装包里的build-image.sh
脚本一键构建GFD组件镜像:
gcu-feature-discovery_<VERSION> # ./build-image.sh
1. Clear old image if exist
Untagged: artifact.enflame.cn/enflame_docker_images/enflame/enflame-device-plugin:latest
Deleted: sha256:9ed58f9e23e59f43e3b5ca5eb4d8692f15fed5f49091965012c69722c1155cc9
Deleted: sha256:c0b960fb962928ae79dbaf79e317c8094ead10f818a41295266abafb66f44eca
Deleted: sha256:b1bbc4b92deac54ce3e91267a5d54b829d5a409f1866d13e4eaee896a8871270
artifact.enflame.cn/enflame_docker_images/enflame/enflame-device-plugin:latest
2. Build image start...
image name:artifact.enflame.cn/enflame_docker_images/enflame/enflame-device-plugin, \
image version:latest
Sending build context to Docker daemon 296.1MB
Step 1/9 : FROM ubuntu:18.04
---> f9a80a55f492
Step 2/9 : WORKDIR .
---> Using cache
---> 93587095278b
Step 3/9 : ENV GRPC_GO_LOG_SEVERITY_LEVEL="INFO"
---> Using cache
---> 7a1ef5d89169
Step 4/9 : ENV ENFLAME_VISIBLE_DEVICES=all
---> Using cache
---> d5897e72ea44
Step 5/9 : COPY ./bin/enflame-device-plugin /usr/bin/
---> 32259c2523f4
Step 6/9 : COPY ./bin/gcu-feature-discovery /usr/bin/
---> a8fb520217eb
Step 7/9 : COPY ./bin/nfd-master /usr/bin/
---> d8be32e9ecfe
Step 8/9 : COPY ./bin/nfd-topology-updater /usr/bin/
---> 1c6a2ca00054
Step 9/9 : COPY ./bin/nfd-worker /usr/bin/
---> 2800478af73e
Successfully built 2800478af73e
Successfully tagged artifact.enflame.cn/enflame_docker_images/enflame/enflame-device-plugin:latest
build image success
3. save image to ./images
unpacking artifact.enflame.cn/enflame_docker_images/enflame/enflame-device-plugin:latest \
(sha256:97474d3fd81db90eacbb210b143142fcfa9169db85e3e5cc9e3d796d9f58eec9)...done
查看制作好的镜像:
gcu-feature-discovery_<VERSION> # docker images | grep gcu
artifact.enflame.cn/enflame_docker_images/enflame/gcu-feature-discovery \
latest d7a2a66a130e 21 hours ago 65MB
3.7. 部署GFD组件¶
使用deploy.sh一键部署GFD组件:
gcu-feature-discovery_<VERSION> # ./deploy.sh
Deploy gcu-feature-discovery start...
daemonset.apps/gcu-feature-discovery created
3.8. 检查GFD组件工作¶
查看pod运行正常:
gcu-feature-discovery_<VERSION> # kubectl get pod -A
NAMESPACE NAME \
READY STATUS RESTARTS AGE
kube-system etcd-sse-lg-112-32 \
1/1 Running 2 (58d ago) 64d
kube-system gcu-feature-discovery-cnm4h \
1/1 Running 0 19s
kube-system kube-apiserver-sse-lg-112-32 \
1/1 Running 2 (58d ago) 64d
kube-system kube-controller-manager-sse-lg-112-32 \
1/1 Running 2 (58d ago) 64d
kube-system kube-proxy-j2gbj \
1/1 Running 2 (58d ago) 64d
kube-system kube-scheduler-sse-lg-112-32 \
1/1 Running 2 (58d ago) 63d
kube-system nfd-fblrn \
2/2 Running 0 7m13s
.........
执行 kubectl describe node
查看节点标签更新成功:
.................
Labels: beta.kubernetes.io/arch=amd64
beta.kubernetes.io/os=linux
enflame.com/gcu.count=8
enflame.com/gcu.driverVer=1.0.4
enflame.com/gcu.family=AAAA
enflame.com/gcu.machine=X640-G40
enflame.com/gcu.memory=32768
enflame.com/gcu.model=XXXX
enflame.com/gcu.product=AAAA
enflame.com/gfd.latestLabeledTimestamp=2024-07-17-06-42-50
enflame.com/gfd.timestamp=2024-07-17-06-38-49
.................
3.9. 卸载GFD组件¶
执行delete.sh一键卸载GFD组件:
gcu-feature-discovery_<VERSION> # ./delete.sh
Uninstall gcu-feature-discovery start...
daemonset.apps "gcu-feature-discovery" deleted
4. all-in-one使用示例¶
4.1. 构建与部署¶
步骤如下:
cd all-in-one
# 构建镜像,根据需要修改docker/Dockerfile.ubuntu 里的镜像名称与路径
./build-image.sh
# apply yaml文件,根据需要修改yaml/all-in-one.yaml 里的镜像名称与路径
./deploy.sh
# 等60s,才会出结果,可以 修改all-in-one.yaml
# nfd-worker里的 - "--sleep-interval=60s"参数进行调整
./show-labels.sh
# show-labels 结果如下:
.................
tencent.com/gfd.latestLabeledTimestamp=2024-05-27-07-11-13
tencent.com/gfd.timestamp=2024-05-27-07-04-12
tencent.com/gcu.count=1
tencent.com/gcu.driverVer=2.4.2024052401
tencent.com/gcu.family=ZZZZ
tencent.com/gcu.machine=MS-7C37
tencent.com/gcu.memory=16384
tencent.com/gcu.model=ZZZZC100
tencent.com/gcu.product=zzzz-v1
tke.cloud.tencent.com/gfd.latestLabeledTimestamp=2024-05-27-07-11-13
tke.cloud.tencent.com/gfd.timestamp=2024-05-27-07-04-12
tke.cloud.tencent.com/gpu.count=1
tke.cloud.tencent.com/gpu.driverVer=2.4.2024052401
tke.cloud.tencent.com/gpu.family=ZZZZ
tke.cloud.tencent.com/gpu.machine=MS-7C37
tke.cloud.tencent.com/gpu.memory=16384
tke.cloud.tencent.com/gpu.model=ZZZZC100
tke.cloud.tencent.com/gpu.product=zzzz-v1
................
4.2. 标签使用¶
nodeSelector编程示例
.....
spec:
nodeSelector:
tke.cloud.tencent.com/gpu.product: AAAA # AAAA为我们想要的 GCU 产品型号
.....
5. 常见问题¶
1)如何修改默认的镜像与名称
build-image.sh 里默认的镜像路径与名称为:
artifact.enflame.cn/enflame_docker_images/enflame/gcu-feature-discovery:latest
,如下:
ORIGIN_NAME="gcu-feature-discovery"
VERSION="latest"
REPO="artifact.enflame.cn/enflame_docker_images/enflame"
2)什么是all in one 安装包
all-in-one是把k8s-device-plugin, gcu-feature-discovery, nfd-master, nfd-worker 合在一个容器镜像内,并且只使用一个daemonSet的安装包。但是topscloud 默认也提供 k8s-device-plugin, gcu-feature-discovery, node-feature-discovery 分离的安装包,用于满足不同用户的需求。
在all-in-one.yaml 里需要关闭 基于设备vendor id的 nodeSelector,如下:
# nodeSelector:
# feature.node.kubernetes.io/pci-1e36.present: "true"
因为GFD 依赖于NFD上报Node Label,当NFD没有启动时,GFD是拿不到这个标签的,表现就是这个all-in-one yaml的daemonset 启动失败, 另外在没有GCU卡的节点上也不需要拉起这个all-in-one daemonset。