1. 版本申明

版本 修改内容 修改时间
v1.0 初始化 5/09/2022
v1.1 添加内容 8/11/2022
v1.2 更新一些内容与格式 4/8/2024
v1.3 更新一些格式 4/17/2024

2. 简介

Enflame kubeone 是基于sealer进行定制二次开发的k8s 集群部署工具,其遵循Apache-2.0 协议。当前1.0.x 版本仅支持enflame公司内网使用,外网用户请依据sealer用户文档:http://sealer.cool/docs/getting-started/quick-start.html(复制到浏览器查看)自行参照使用。

enflame kubeone继承了sealer的所有优点, 可以像docker那样把整个集群制作成镜像,把所有的enflame软件栈以及依赖整体打包到集群镜像中,实现燧原训练或推理集群的快速构建、一键交付以及可靠运行。

不同于sealer ,enflame kubeone更专注于燧原软件栈打包、专有云、数据中心、离线部署这样的场景,kubeone实现整个燧原训练或推理集群的镜像化打包和交付,从而提供一种“开箱即用”的应用封装技术。

3. 前置准备

  • 所有节点操作系统均为ubuntu 系统,当前暂不支持centos/redhat 系列;

  • 检查执行kubeone所在节点5000端口是否被占用,如果被占用应关掉相应进程;

  • 检查/etc/systemd/system/docker.service是否存在,如果存在建议备份后删除;

  • 具备root用户权限;

  • 所有节点部署同一套ssh公钥,私钥部署在kubeone所在机器上/root/.ssh/id_rsa下;

  • 如果是部署单机伪集群,本机需要同时部署公钥和私钥;

  • 保证所有节点对外暴露IP互通(单机伪集群忽略);

4. 配置与命令

4.1. 安装包

当前发布的安装包内容如下:

kubeone_x.x.x/
├── kubeone
├── kubeone.json.multi-nodes
├── kubeone.json.one-node
├── LICENSE
└── README.md

其中:

  • kubeone, k8s集群部署工具;

  • kubeone.json.multi-nodes, kubeone多节点配置模板;

  • kubeone.json.one-node, kubeone单节点配置模板;

  • LICENSE, 协议文件;

  • README.md , 简单的README文档;

4.2. kubeone.json

Enflame kubeone 默认的配置文件为kubeone.json,如果本地没有配置文件,则按照默认配置部署,其默认部署单机k8s集群, 当前kubernetes默认版本为artifact.enflame.cn/enflame_docker_images/enflame/kubernetes-v.1.20.0:1.0.1kubeone.json·格式如下:

{
   "masters": ["IP1", "IP2", "IP3"],
   "nodes": ["IP4", "IP5"],
   "image": "kubernetes镜像仓库路径",
   "image_version: "kubernetes版本号",
   "cluster_name": "集群名称"
}

kubeone.json包括以下几个元素:

  • masters(主节点):采用一个数组进行描述,里面包含所有承担master role的机器的IP地址;

  • nodes(工作节点):同样用一个数组去描述,里面包含所有承担工作节点role的机器的IP地址;

  • image(kubernetes镜像):默认为enflame内网定制的kubernetes镜像路径:artifact.enflame.cn/enflame_docker_images/enflame/kubernetes-v1.20.0

  • image_version(镜像版本):用一个字符串去描述,对应我们要安装的镜像版本号,目前默认为1.0.1;

  • cluster_name(集群名称):给待部署集群定一个名字,比如: “enflame”;

其中,同一个IP既可以是master,也可以是node,但作为生产环境建议每个IP只有一个角色,对于集群建议是最少1个或3个节点作为master节点。

4.3. 命令集

kubeone start

直接执行该命令,不需要任何参数,默认读取位于当前工作目录下的kubeone.json配置文件,并按照配置文件中的描述自动部署集群。

kubeone stop

直接执行该命令,不需要任何参数,默认读取位于当前工作目录下的kubeone.json配置文件,并按照配置文件中的描述自动清理k8s环境。

kubeone pull

k8s镜像拉取命令,命令示例:

# kubeone pull artifact.enflame.cn/enflame_docker_images/enflame/kubernetes-v1.20.0:1.0.1

当所需镜像不在本地时,kubeone start会自动将镜像pull到本地

kubeone save

将集群镜像从本地导出,命令示例:

# kubeone save kubernetes-1.20.0:1.0.1 -o k8s.tar

kubeone load

从tar包中导入集群镜像,命令示例:

# kubeone load -i k8s.tar

kubeone –help

更多命令参考 kubeone --helpkubeone -h, 如下:

# kubeone --help
Usage:
  kubeone [command]

Available Commands:
  help        Help about any command
  images      list all cluster images
  load        load image from a tar file
  login       login image repository
  pull        pull cloud image to local
  push        push cloud image to registry
  rmi         remove local images by name
  save        save image to a tar file
  start       start a cluster with config
  stop        stop a cluster with config

Flags:
  -d, --debug           turn on debug mode
  -h, --help            help for kubeone

Use "kubeone [command] --help" for more information about a command.

5. 部署示例

5.1. 单机部署

1) 不基于kubeone.json

在安装包目录直接执行kubeone start,会在本机快速安装默认版本为v1.20.0的k8s单机伪集群,例如:

# cd kubeone_1.0.1
# kubeone start

2) 基于kubeone.json

将kubeone安装包里的 kubeone.json.one-node 改名为 kubeone.json:

# cp kubeone.json.one-node kubeone.json

根据实际情况编辑里头的内容,比如可以根据需要自我定制master节点IP、node节点IP、镜像、镜像版本、集群名称,例如:

{
   "masters": ["10.12.0.1"],
   "nodes": ["10.12.0.1"],
   "image": "artifact.enflame.cn/enflame_docker_images/enflame/kubernetes-v1.20.0",
   "image_version": "1.0.1",
   "cluster_name": "enflame"
}

然后执行:

# kubeone start

5.2. 多机部署

1)修改kubeone.json

将kubeone安装包里的 kubeone.json.multi-nodes 改名为 kubeone.json

# cp kubeone.json.multi-nodes kubeone.json

根据实际情况编辑里头的内容,比如可以根据需要自我定制master节点IP、node节点IP、镜像、镜像版本、集群名称,例如:

{
   "masters": ["10.12.0.1", "10.12.0.2", "10.12.0.3"],
   "nodes": ["10.12.0.4", "10.12.0.5"],
   "image": "artifact.enflame.cn/enflame_docker_images/enflame/kubernetes-v1.20.0",
   "image_version": "1.0.1",
   "cluster_name": "enflame"
}

2)kubeone start

kubeone.json 配置好后 ,执行 kubeone start 即可

# kubeone start

6. 常见问题

1)无法拉取k8s镜像

外网环境下,kubeone无法拉取k8s镜像,需要使用EGC平台,具体请咨询CSE团队。

2)清理系统内原有的K8S

如果需要清理原有的OS系统已安装的k8s,可以采用kubeone stop,这个命令可以直接清理环境。

3)如何离线部署

在燧原内网里使用kubeone pull,拉取已有k8s镜像,再使用kubeone save导出镜像成tar包,通过介质拷贝到目标机器后,再通过kubeone load进行导入,最后执行kubeone start

4)如何安装docker-ce

ubuntu下安装docker-ce 步骤如下:

# apt install dkms -y
# curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -
# add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu $(cat /etc/os-release | grep UBUNTU_CODENAME | cut -d '=' -f 2) stable"
# apt update
# apt install docker-ce -y
# systemctl enable docker
# systemctl restart docker
# docker --version

如果需要清理原有docker-ce再安装的话,可以参考以下步骤:

# apt purge contained.io docker-ce
# rm -rf /var/lib/containerd/
# apt install docker-ce containerd.io

5)如何处理 node(s) had taint 错误

单机部署时,该节点既是master又是node,如果出现类似 node(s) had taint 这样的错误,例如:

 Type     Reason            Age   From               Message
  ----     ------            ----  ----               -------
  Warning  FailedScheduling  119s  default-scheduler  0/1 nodes are available: 1 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate.
  Warning  FailedScheduling  119s  default-scheduler  0/1 nodes are available: 1 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate.
 

处理策略

这时需要打开master节点的资源调度能力,执行一下以下命令即可修复:

# kubectl taint nodes --all node-role.kubernetes.io/master-

6)如何处理coredns CrashLoopBackOff 错误

例如,当k8s集群部署好后,出现类似 coredns CrashLoopBackOff 这样的错误,例如:

  # kubectl get pod -o wide -A

  NAMESPACE       NAME                                      READY STATUS           REST                    NODE        NOMIN     READINESS
                                                                                   ARTS AGE IP                         ATED NODE GATES
  calico-system   calico-kube-controllers-5689d4dfdf-tzh5k  1/1   Running          0    58s 100.98.178.129 develop-host <none>   <none>
  calico-system   calico-node-kzzcp                         1/1   Running          0    59s 10.8.52.16     develop-host <none>   <none>
  calico-system   calico-typha-7949c4b4db-lj72x             1/1   Running          0    59s 10.8.52.16     develop-host <none>   <none>
  kube-system     coredns-597c5579bc-7w4hx                  0/1   CrashLoopBackOff 2    66s 100.98.178.130 develop-host <none>   <none>
  kube-system     coredns-597c5579bc-pkbk9                  0/1   CrashLoopBackOff 1    66s 100.98.178.131 develop-host <none>   <none>
  kube-system     etcd-jeff-develop-host                    1/1   Running          0    84s 10.8.52.16     develop-host <none>   <none>
  kube-system     kube-apiserver-jeff-develop-host          1/1   Running          0    84s 10.8.52.16     develop-host <none>   <none>
  kube-system     kube-controller-manager-jeff-develop-host 1/1   Running          0    84s 10.8.52.16     develop-host <none>   <none>
  kube-system     kube-proxy-d2g7q                          1/1   Running          0    66s 10.8.52.16     develop-host <none>   <none>
  kube-system     kube-scheduler-jeff-develop-host          1/1   Running          0    83s 10.8.52.16     develop-host <none>   <none>
  tigera-operator tigera-operator-86c4fc874f-dsz4n          1/1   Running          0    66s 10.8.52.16     develop-host <none>   <none>

处理策略:

这时需要在/etc/resolv.conf里添加合适的 nameserver,例如:

# Dynamic resolv.conf(5) file for glibc resolver(3) generated by resolvconf(8)
#     DO NOT EDIT THIS FILE BY HAND -- YOUR CHANGES WILL BE OVERWRITTEN
nameserver 10.12.31.11
nameserver 172.16.11.11

然后再重启coredns,例如:

# kubectl delete po coredns-597c5579bc-7w4hx -n kube-system
pod "coredns-597c5579bc-7w4hx" deleted
# kubectl delete po coredns-597c5579bc-pkbk9 -n kube-system
pod "coredns-597c5579bc-pkbk9" deleted

7)如何处理 ``Failed to execute operation: File exists 错误

当部署k8s集群 出现类似 Failed to execute operation: File exists 错误时,例如:

vm.swappiness = 0
net.ipv4.ip_forward = 1
/usr/sbin/ufw
防火墙在系统启动时自动禁用
Failed to execute operation: File exists
2021-08-15 15:16:15 [EROR] [sshcmd.go:82] exec command failed Process exited with status 1
2021-08-15
 15:16:15 [EROR] [filesystem.go:187] exec init.sh failed exec command
failed 10.8.52.16 cd /var/lib/sealer/data/my-cluster/rootfs &&
chmod +x scripts/* && cd scripts && sh init.sh
2021-08-15 15:16:15 [EROR] [run.go:55] mount rootfs failed mountRootfs failed

对应的处理策略如下:

先执行systemctl enable kubelet 看看是否出现上述错误,
再执行 find / -name "kubelet*"  找多多余的那个删掉。

8)如何处理 Unit docker.service is masked.错误

执行 systemctl restart docker 时出现Unit docker.service is masked. ,例如:

# systemctl restart docker
Failed to restart docker.service: Unit docker.service is masked.

处理策略

# systemctl unmask docker.service
# systemctl unmask docker.socket
# systemctl start docker.service

9)如何处理The recommended driver is "systemd"错误

当出现类似 The recommended driver is "systemd" 错误时,例如:

    [WARNING Service-Docker]: docker service is not enabled, please run 'systemctl enable docker.service'
    [WARNING IsDockerSystemdCheck]: detected "cgroupfs" as the Docker cgroup driver. The recommended driver is "systemd". Please follow the guide at https://kubernetes.io/docs/setup/cri/
error execution phase preflight: [preflight] Some fatal errors occurred:
    [ERROR Port-10250]: Port 10250 is in use

处理策略

/etc/docker/daemon.json 里加
"exec-opts": ["native.cgroupdriver=systemd"]

然后再重启docker:

# systemctl restart docker

10)如何处理Port 10250 is in use 错误

当出现 类似Port 10250 is in use 这样的错误时,例如:

Port 10250 is in use
error execution phase preflight: [preflight] Some fatal errors occurred:
    [ERROR Port-10250]: Port 10250 is in use

处理策略:

# kubeadm reset 

11) 如何处理overlayfs: maximum fs stacking depth exceeded 错误

dmesg 里看到类似 overlayfs: maximum fs stacking depth exceeded 这样的错误,例如

[ 2204.645961] overlayfs: maximum fs stacking depth exceeded

处理策略

# df
# umount overlay

12)如何处理write /proc/self/attr/keycreate: invalid argument错误

当出现类似 write /proc/self/attr/keycreate: invalid argument 这样的错误时,例如:

 docker: Error response from daemon: OCI runtime create failed: container_linux.go:349: starting container process caused "process_linux.go:449: container init caused \"write /proc/self/attr/keycreate: invalid argument\"": unknown.

处理策略:

这是 遇到 一个 runc 与 selinux 的bug,升级 runc 到RC93后的即可,例如:

# runc -v
# wget https://github.com/opencontainers/runc/releases/download/v1.0.1/runc.amd64
# mv /usr/bin/runc /usr/bin/runc.bak
# chmod runc.amd64 && cp runc.amd64 /usr/bin/