1. 版本申明

版本 修改内容 修改时间
v1.0 初始化 11/30/2022
v1.1 格式调整 12/01/2022
v1.2 更新一些格式与内容 4/8/2024
v1.3 更新一些格式 4/17/2024
v1.4 更新部分内容 5/17/2024
v1.5 更新tke使用部分内容 5/22/2024
v1.6 更新配置文件使用说明 5/28/2024
v1.7 更新格式 6/24/2024
v1.8 更新前置依赖 7/9/2024
v1.9 更新格式 7/17/2024

2. 简介

GCU Feature Discovery是一款部署在k8s集群上的组件,主要用于给GCU节点打上一些与GCU设备属性相关的标签。比如:该节点的GCU驱动是哪个版本,GCU显存是多大等。这些标签多是以”enflame.com”开头的标签,打上这些标签的主要目的是在之后的任务调度中,可以根据标签很方便的将任务调度到指定节点上。

3. GFD部署

3.1. 前置依赖

  • 已安装docker, containerd(k8s version >=1.24)

  • kubernetes集群版本高于1.9

  • GCU Driver

  • Enflame Container Toolkit

  • Enflame K8s Device Plugin

  • Node Feature Discovery

3.2. 安装包说明

GFD目录内容如下:

gcu-feature-discovery_<VERSION>/
├── all-in-one
├── build-image.sh
├── config
├── delete.sh
├── deploy.sh
├── docker
├── gcu-feature-discovery
├── README.md
├── show-labels.sh
└── yaml
  • all-in-one, k8s-device-plugin, GFD, NFD 合在一个daemonSet里的安装包;

  • build-image.sh, gcu-feature-discovery 镜像构建脚本;

  • config, gcu-feature-discovery 配置文件所在目录;

  • delete.sh, gcu-feature-discovery daemonset删除脚本;

  • deploy.sh, gcu-feature-discovery daemonset部署脚本;

  • docker, gcu-feature-discovery dockerfile目录;

  • gcu-feature-discovery, gcu-feature-discovery 二进制文件;

  • README.md, gcu-feature-discovery 简单的README文件;

  • yaml, gcu-feature-discovery yaml目录;

3.3. 提供的标签

GFD提供的标签以enflame.com开头,主要标签如下:

Label Name Meaning Example
enflame.com/gfd.timestamp Timestamp of the deploy gfd (optional) 2023-03-02-02-59-47
enflame.com/gcu.count Number of GCUs 8
enflame.com/gcu.driverVer Driver version of the GCU 1.0.1.3
enflame.com/gcu.machine Machine type NF5468M5
enflame.com/gcu.memory Memory of the GCU in Mb 16384
enflame.com/gcu.model Model of the GCU T10
enflame.com/gcu.family Family of the GCU aaaa
enflame.com/gcu.product Product of the GCU xxxxx

如果节点存在VGCU设备,那么GFD提供的主要标签如下:

Label Name Meaning Example
enflame.com/gfd.timestamp Timestamp of the deploy gfd (optional) 2023-03-02-02-59-47
enflame.com/vgcu.present Whether there is VGCU device (optional) true
enflame.com/vgcu.count Number of VGCUs 4
enflame.com/vgcu.driverVer Driver version of the vGCU 1.0.1.3
enflame.com/vgcu.machine Machine type NF5468M5
enflame.com/vgcu.memory Memory of the VGCU in Mb 4096
enflame.com/vgcu.model Model of the VGCU T10
enflame.com/vgcu.family Family of the VGCU aaaa
enflame.com/vgcu.product Product of the VGCU xxxxx

3.4. 配置与部署NFD

注:NFD的安装使用参考文档《Node Feature Discovery用户使用手册》。

1)依赖条件

  • GFD组件依赖于NFD组件去执行为节点打标签的动作,GFD组件是依赖于NFD的,安装GFD之前请确保NFD组件已经安装完成;

  • GFD组件也依赖于libefml.so,安装GFD之前,请检查主机上的libefml存在且可用,检查方法:

#ll /usr/lib/libefml.so

lrwxrwxrwx 1 root root 50 May 17 16:56 /usr/lib/libefml.so -> \
        /usr/local/efsmi/efsmi-<VERSION>/lib/libefml.so.1.0.0*

2)配置NFD里的label namespace

通过配置NFD yaml里的--extra-label-ns=xxx可以过滤GFD的Labels,比如允许enflame.com开头的Labels展示出来,
那么就配置nfd-master 的 --extra-label-ns--extra-label-ns=enflame.com
如果即要enflame.com命名空间的标签又要tke.cloud.tencent.com命名空间的标签,那么可以采用逗号分隔这两个命名空间,如下:

.................
          image: artifact.enflame.cn/enflame_docker_images/enflame/node-feature-discovery:v0.11.3
          name: nfd-master
          command:
            - "nfd-master"
          args:
          #  - "--extra-label-ns=enflame.com"
            - "--extra-label-ns=enflame.com,tke.cloud.tencent.com"
        - env:
          - name: NODE_NAME
            valueFrom:
              fieldRef:
                fieldPath: spec.nodeName
..................

3.5. GFD 配置文件

GFD的默认配置文件gfd.json位于安装包的config/目录下,在构建镜像时会被复制到镜像内。在部署gfd时,如果宿主机的/etc/topscloud目录下存在gfd.json,那么gfd将使用/etc/topscloud/gfd.json作为配置文件运行;否则gfd将会把拷贝到镜像内的初始化配置文件同步到宿主机的/etc/topscloud目录,并使用/etc/topscloud/gfd.json作为配置文件运行。

默认的配置文件config/gfd.json中的标签的value值是为空的,如下:

{
    "labels": {
      "enflame.com/gcu.family": "",
      "enflame.com/gcu.model": "",
      "enflame.com/gcu.product": "",
      "tke.cloud.tencent.com/gpu.family": "",
      "tke.cloud.tencent.com/gpu.model": "",
      "tke.cloud.tencent.com/gpu.product": ""
    }
}

当有需要手工设置标签值时,可以修改gfd.json中的value值,或者key值,如下:

{
    "labels": {
      "enflame.com/gcu.family": "",
      "enflame.com/gcu.model": "",
      "enflame.com/gcu.product": "",
      "enflame.com/gcu.vendor": "1e36",
      "tke.cloud.tencent.com/gpu.family": "",
      "tke.cloud.tencent.com/gpu.model": "",
      "tke.cloud.tencent.com/gpu.product": "ABCD"
    }
}

如上配置,gfd将会为节点打上”tke.cloud.tencent.com/gpu.product”: “ABCD”标签。

如不需要手工配置标签,保持gfd.json里的默认值即可。

注意:

  • 由于topscloud使用/etc/topscloud目录统一管理各组件的配置文件,因此修改gfd.json时,如果该目录下存在gfd.json,你必须修改/etc/topscloud/gfd.json才行,修改安装包下的config/gfd.json是不生效的;如果/etc/topscloud/gfd.json不存在,你可以通过修改安装包下的config/gfd.json,并重做gfd镜像以使配置文件生效。

  • 修改/etc/topscloud/gfd.json后,无需重做镜像或重新安装gfd,等待不超过1分钟(默认时间)时间后,新的配置文件将自动生效。

3.6. 制作GFD组件镜像

执行gcu-feature-discovery_<VERSION>安装包里的build-image.sh脚本一键构建GFD组件镜像:

gcu-feature-discovery_<VERSION> # ./build-image.sh

1. Clear old image if exist
Untagged: artifact.enflame.cn/enflame_docker_images/enflame/enflame-device-plugin:latest
Deleted: sha256:9ed58f9e23e59f43e3b5ca5eb4d8692f15fed5f49091965012c69722c1155cc9
Deleted: sha256:c0b960fb962928ae79dbaf79e317c8094ead10f818a41295266abafb66f44eca
Deleted: sha256:b1bbc4b92deac54ce3e91267a5d54b829d5a409f1866d13e4eaee896a8871270
artifact.enflame.cn/enflame_docker_images/enflame/enflame-device-plugin:latest
2. Build image start...
image name:artifact.enflame.cn/enflame_docker_images/enflame/enflame-device-plugin, \
                                                  image version:latest
Sending build context to Docker daemon  296.1MB
Step 1/9 : FROM ubuntu:18.04
 ---> f9a80a55f492
Step 2/9 : WORKDIR .
 ---> Using cache
 ---> 93587095278b
Step 3/9 : ENV GRPC_GO_LOG_SEVERITY_LEVEL="INFO"
 ---> Using cache
 ---> 7a1ef5d89169
Step 4/9 : ENV ENFLAME_VISIBLE_DEVICES=all
 ---> Using cache
 ---> d5897e72ea44
Step 5/9 : COPY ./bin/enflame-device-plugin /usr/bin/
 ---> 32259c2523f4
Step 6/9 : COPY ./bin/gcu-feature-discovery /usr/bin/
 ---> a8fb520217eb
Step 7/9 : COPY ./bin/nfd-master /usr/bin/
 ---> d8be32e9ecfe
Step 8/9 : COPY ./bin/nfd-topology-updater /usr/bin/
 ---> 1c6a2ca00054
Step 9/9 : COPY ./bin/nfd-worker /usr/bin/
 ---> 2800478af73e
Successfully built 2800478af73e
Successfully tagged artifact.enflame.cn/enflame_docker_images/enflame/enflame-device-plugin:latest
build image success
3. save image to ./images
unpacking artifact.enflame.cn/enflame_docker_images/enflame/enflame-device-plugin:latest \
(sha256:97474d3fd81db90eacbb210b143142fcfa9169db85e3e5cc9e3d796d9f58eec9)...done

查看制作好的镜像:

gcu-feature-discovery_<VERSION> # docker images | grep gcu
artifact.enflame.cn/enflame_docker_images/enflame/gcu-feature-discovery         \
latest          d7a2a66a130e   21 hours ago        65MB

3.7. 部署GFD组件

使用deploy.sh一键部署GFD组件:

gcu-feature-discovery_<VERSION> # ./deploy.sh
Deploy gcu-feature-discovery start...
daemonset.apps/gcu-feature-discovery created

3.8. 检查GFD组件工作

查看pod运行正常:

gcu-feature-discovery_<VERSION> # kubectl get pod -A
NAMESPACE     NAME                                       \
                                      READY   STATUS    RESTARTS      AGE
kube-system   etcd-sse-lg-112-32                         \
                                      1/1     Running   2 (58d ago)   64d
kube-system   gcu-feature-discovery-cnm4h                \
                                      1/1     Running   0             19s
kube-system   kube-apiserver-sse-lg-112-32               \
                                      1/1     Running   2 (58d ago)   64d
kube-system   kube-controller-manager-sse-lg-112-32      \
                                      1/1     Running   2 (58d ago)   64d
kube-system   kube-proxy-j2gbj                           \
                                      1/1     Running   2 (58d ago)   64d
kube-system   kube-scheduler-sse-lg-112-32               \
                                      1/1     Running   2 (58d ago)   63d
kube-system   nfd-fblrn                                  \
                                      2/2     Running   0             7m13s
.........

执行 kubectl describe node查看节点标签更新成功:

.................
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/os=linux
                    enflame.com/gcu.count=8
                    enflame.com/gcu.driverVer=1.0.4
                    enflame.com/gcu.family=AAAA
                    enflame.com/gcu.machine=X640-G40
                    enflame.com/gcu.memory=32768
                    enflame.com/gcu.model=XXXX
                    enflame.com/gcu.product=AAAA
                    enflame.com/gfd.latestLabeledTimestamp=2024-07-17-06-42-50
                    enflame.com/gfd.timestamp=2024-07-17-06-38-49
.................

3.9. 卸载GFD组件

执行delete.sh一键卸载GFD组件:

gcu-feature-discovery_<VERSION> # ./delete.sh
Uninstall gcu-feature-discovery start...
daemonset.apps "gcu-feature-discovery" deleted

4. all-in-one使用示例

4.1. 构建与部署

步骤如下:

cd all-in-one

# 构建镜像,根据需要修改docker/Dockerfile.ubuntu 里的镜像名称与路径
./build-image.sh

# apply yaml文件,根据需要修改yaml/all-in-one.yaml 里的镜像名称与路径
./deploy.sh

# 等60s,才会出结果,可以 修改all-in-one.yaml 
# nfd-worker里的 - "--sleep-interval=60s"参数进行调整
./show-labels.sh

# show-labels 结果如下:
.................
tencent.com/gfd.latestLabeledTimestamp=2024-05-27-07-11-13
tencent.com/gfd.timestamp=2024-05-27-07-04-12
tencent.com/gcu.count=1
tencent.com/gcu.driverVer=2.4.2024052401
tencent.com/gcu.family=ZZZZ
tencent.com/gcu.machine=MS-7C37
tencent.com/gcu.memory=16384
tencent.com/gcu.model=ZZZZC100
tencent.com/gcu.product=zzzz-v1
tke.cloud.tencent.com/gfd.latestLabeledTimestamp=2024-05-27-07-11-13
tke.cloud.tencent.com/gfd.timestamp=2024-05-27-07-04-12
tke.cloud.tencent.com/gpu.count=1
tke.cloud.tencent.com/gpu.driverVer=2.4.2024052401
tke.cloud.tencent.com/gpu.family=ZZZZ
tke.cloud.tencent.com/gpu.machine=MS-7C37
tke.cloud.tencent.com/gpu.memory=16384
tke.cloud.tencent.com/gpu.model=ZZZZC100
tke.cloud.tencent.com/gpu.product=zzzz-v1
................

4.2. 标签使用

nodeSelector编程示例

.....
spec:
  nodeSelector:
    tke.cloud.tencent.com/gpu.product: AAAA  # AAAA为我们想要的 GCU 产品型号
.....

5. 常见问题

1)如何修改默认的镜像与名称

build-image.sh 里默认的镜像路径与名称为: artifact.enflame.cn/enflame_docker_images/enflame/gcu-feature-discovery:latest,如下:

ORIGIN_NAME="gcu-feature-discovery"
VERSION="latest"
REPO="artifact.enflame.cn/enflame_docker_images/enflame"

2)什么是all in one 安装包

all-in-one是把k8s-device-plugin, gcu-feature-discovery, nfd-master, nfd-worker 合在一个容器镜像内,并且只使用一个daemonSet的安装包。但是topscloud 默认也提供 k8s-device-plugin, gcu-feature-discovery, node-feature-discovery 分离的安装包,用于满足不同用户的需求。

在all-in-one.yaml 里需要关闭 基于设备vendor id的 nodeSelector,如下:

      # nodeSelector:
      #   feature.node.kubernetes.io/pci-1e36.present: "true"

因为GFD 依赖于NFD上报Node Label,当NFD没有启动时,GFD是拿不到这个标签的,表现就是这个all-in-one yaml的daemonset 启动失败, 另外在没有GCU卡的节点上也不需要拉起这个all-in-one daemonset。