技术流ken

运维拯救世界

kubernetes(k8s)排错过程

环境版本

linux信息:

CentOS Linux release 7.9.2009 (Core)

3.10.0-1160.el7.x86_64

k8s集群信息:

v1.15.12

docker信息

Version: 18.09.0

 

错误信息

近期在部署k8s集群的时候出现如下的错误:

[root@ken kube]#  kubeadm init --image-repository registry.aliyuncs.com/google_containers --kubernetes-version v1.15.12 --pod-network-cidr=10.244.0.0/16
[init] Using Kubernetes version: v1.15.12
[preflight] Running pre-flight checks
	[WARNING IsDockerSystemdCheck]: detected "cgroupfs" as the Docker cgroup driver. The recommended driver is "systemd". Please follow the guide at https://kubernetes.io/docs/setup/cri/
	[WARNING SystemVerification]: this Docker version is not on the list of validated versions: 20.10.9. Latest validated version: 18.09
[preflight] Pulling images required for setting up a Kubernetes cluster
[preflight] This might take a minute or two, depending on the speed of your internet connection
[preflight] You can also perform this action in beforehand using 'kubeadm config images pull'
[kubelet-start] Writing kubelet environment file with flags to file "/var/lib/kubelet/kubeadm-flags.env"
[kubelet-start] Writing kubelet configuration to file "/var/lib/kubelet/config.yaml"
[kubelet-start] Activating the kubelet service
[certs] Using certificateDir folder "/etc/kubernetes/pki"
[certs] Generating "ca" certificate and key
[certs] Generating "apiserver" certificate and key
[certs] apiserver serving cert is signed for DNS names [ken kubernetes kubernetes.default kubernetes.default.svc kubernetes.default.svc.cluster.local] and IPs [10.96.0.1 192.168.227.12]
[certs] Generating "apiserver-kubelet-client" certificate and key
[certs] Generating "etcd/ca" certificate and key
[certs] Generating "etcd/server" certificate and key
[certs] etcd/server serving cert is signed for DNS names [ken localhost] and IPs [192.168.227.12 127.0.0.1 ::1]
[certs] Generating "etcd/healthcheck-client" certificate and key
[certs] Generating "apiserver-etcd-client" certificate and key
[certs] Generating "etcd/peer" certificate and key
[certs] etcd/peer serving cert is signed for DNS names [ken localhost] and IPs [192.168.227.12 127.0.0.1 ::1]
[certs] Generating "front-proxy-ca" certificate and key
[certs] Generating "front-proxy-client" certificate and key
[certs] Generating "sa" key and public key
[kubeconfig] Using kubeconfig folder "/etc/kubernetes"
[kubeconfig] Writing "admin.conf" kubeconfig file
[kubeconfig] Writing "kubelet.conf" kubeconfig file
[kubeconfig] Writing "controller-manager.conf" kubeconfig file
[kubeconfig] Writing "scheduler.conf" kubeconfig file
[control-plane] Using manifest folder "/etc/kubernetes/manifests"
[control-plane] Creating static Pod manifest for "kube-apiserver"
[control-plane] Creating static Pod manifest for "kube-controller-manager"
[control-plane] Creating static Pod manifest for "kube-scheduler"
[etcd] Creating static Pod manifest for local etcd in "/etc/kubernetes/manifests"
[wait-control-plane] Waiting for the kubelet to boot up the control plane as static Pods from directory "/etc/kubernetes/manifests". This can take up to 4m0s
[kubelet-check] Initial timeout of 40s passed.

Unfortunately, an error has occurred:
	timed out waiting for the condition

This error is likely caused by:
	- The kubelet is not running
	- The kubelet is unhealthy due to a misconfiguration of the node in some way (required cgroups disabled)

If you are on a systemd-powered system, you can try to troubleshoot the error with the following commands:
	- 'systemctl status kubelet'
	- 'journalctl -xeu kubelet'

Additionally, a control plane component may have crashed or exited when started by the container runtime.
To troubleshoot, list all containers using your preferred container runtimes CLI, e.g. docker.
Here is one example how you may list all Kubernetes containers running in docker:
	- 'docker ps -a | grep kube | grep -v pause'
	Once you have found the failing container, you can inspect its logs with:
	- 'docker logs CONTAINERID'
error execution phase wait-control-plane: couldn't initialize a Kubernetes cluster

从上面输出的一些信息可以看出,主要有几个信息

1.两个warning,一个提到了cgroups,一个提到了docker版本

2.kubelet未运行,但是检查发现kubelet是在运行中

[root@ken kube]# systemctl status kubelet
● kubelet.service - kubelet: The Kubernetes Node Agent
   Loaded: loaded (/usr/lib/systemd/system/kubelet.service; enabled; vendor preset: disabled)
  Drop-In: /etc/systemd/system/kubelet.service.d
           └─10-kubeadm.conf
   Active: active (running) since Thu 2021-10-14 09:46:34 CST; 6min ago
     Docs: https://kubernetes.io/docs/
 Main PID: 7609 (kubelet)
    Tasks: 17
   Memory: 29.8M
   CGroup: /system.slice/kubelet.service
           └─7609 /usr/bin/kubelet

Oct 14 09:51:39 ken kubelet[7609]: I1014 09:51:39.492281    7609 kubelet_node_status.go:286] Setting n...etach
Oct 14 09:51:49 ken kubelet[7609]: I1014 09:51:49.527401    7609 kubelet_node_status.go:286] Setting n...etach
Oct 14 09:51:59 ken kubelet[7609]: I1014 09:51:59.556323    7609 kubelet_node_status.go:286] Setting n...etach
Oct 14 09:52:09 ken kubelet[7609]: I1014 09:52:09.594535    7609 kubelet_node_status.go:286] Setting n...etach
Oct 14 09:52:19 ken kubelet[7609]: I1014 09:52:19.632376    7609 kubelet_node_status.go:286] Setting n...etach
Oct 14 09:52:29 ken kubelet[7609]: I1014 09:52:29.669662    7609 kubelet_node_status.go:286] Setting n...etach
Oct 14 09:52:39 ken kubelet[7609]: I1014 09:52:39.709010    7609 kubelet_node_status.go:286] Setting n...etach
Oct 14 09:52:49 ken kubelet[7609]: I1014 09:52:49.746458    7609 kubelet_node_status.go:286] Setting n...etach
Oct 14 09:52:59 ken kubelet[7609]: I1014 09:52:59.781364    7609 kubelet_node_status.go:286] Setting n...etach
Oct 14 09:53:09 ken kubelet[7609]: I1014 09:53:09.812214    7609 kubelet_node_status.go:286] Setting n...etach
Hint: Some lines were ellipsized, use -l to show in full.

按照逐一击破的战略,先解决掉两个报警信息

1.[警告IsDockerSystemdCheck]:检测到“cgroupfs”作为Docker cgroup驱动程序。 推荐的驱动程序是“systemd”。

	[WARNING IsDockerSystemdCheck]: detected "cgroupfs" as the Docker cgroup driver. The recommended driver is "systemd". Please follow the guide at https://kubernetes.io/docs/setup/cri/

解决办法:

解决方法:修改docker

在/etc/docker下创建daemon.json并编辑:

mkdir /etc/docker/daemon.json
加入以下内容:

{
“exec-opts”:[“native.cgroupdriver=systemd”]
}

 

重启docker

systemctl restart docker
systemctl status docker

验证一下:发现现在已经没有了这一条报警信息

[root@ken kube]# kubeadm init
I1014 09:58:01.523515   10640 version.go:248] remote version is much newer: v1.22.2; falling back to: stable-1.15
[init] Using Kubernetes version: v1.15.12
[preflight] Running pre-flight checks
	[WARNING SystemVerification]: this Docker version is not on the list of validated versions: 20.10.9. Latest validated version: 18.09
[preflight] Pulling images required for setting up a Kubernetes cluster
[preflight] This might take a minute or two, depending on the speed of your internet connection
[preflight] You can also perform this action in beforehand using 'kubeadm config images pull'
^X^C

 

2.卸载现有版本docker,按照指定的版本docker

[WARNING SystemVerification]: this Docker version is not on the list of validated versions: 20.10.9. Latest validated version: 18.09

找到清华源中对应的docker版本,直接点击下载到本地,再上传到linux中即可

https://mirrors.tuna.tsinghua.edu.cn/docker-ce/linux/centos/7.9/x86_64/stable/Packages/

[root@ken docker18.09]# rz

[root@ken docker18.09]# ls
docker-ce-18.09.0-3.el7.x86_64.rpm

 

卸载现有docker

[root@ken docker18.09]# yum remove docker-ce -y

 

执行完以上动作之后,再执行初始化操作没有报警信息

[root@ken ~]#  kubeadm init --image-repository registry.aliyuncs.com/google_containers --kubernetes-version v1.15.12 --pod-network-cidr=10.244.0.0/16
[init] Using Kubernetes version: v1.15.12
[preflight] Running pre-flight checks
[preflight] Pulling images required for setting up a Kubernetes cluster
[preflight] This might take a minute or two, depending on the speed of your internet connection
[preflight] You can also perform this action in beforehand using 'kubeadm config images pull'
[kubelet-start] Writing kubelet environment file with flags to file "/var/lib/kubelet/kubeadm-flags.env"
[kubelet-start] Writing kubelet configuration to file "/var/lib/kubelet/config.yaml"
[kubelet-start] Activating the kubelet service
[certs] Using certificateDir folder "/etc/kubernetes/pki"
[certs] Generating "etcd/ca" certificate and key
[certs] Generating "apiserver-etcd-client" certificate and key
[certs] Generating "etcd/healthcheck-client" certificate and key
[certs] Generating "etcd/server" certificate and key
[certs] etcd/server serving cert is signed for DNS names [ken localhost] and IPs [192.168.227.12 127.0.0.1 ::1]
[certs] Generating "etcd/peer" certificate and key
[certs] etcd/peer serving cert is signed for DNS names [ken localhost] and IPs [192.168.227.12 127.0.0.1 ::1]
[certs] Generating "ca" certificate and key
[certs] Generating "apiserver" certificate and key
[certs] apiserver serving cert is signed for DNS names [ken kubernetes kubernetes.default kubernetes.default.svc kubernetes.default.svc.cluster.local] and IPs [10.96.0.1 192.168.227.12]
[certs] Generating "apiserver-kubelet-client" certificate and key
[certs] Generating "front-proxy-ca" certificate and key
[certs] Generating "front-proxy-client" certificate and key
[certs] Generating "sa" key and public key
[kubeconfig] Using kubeconfig folder "/etc/kubernetes"
[kubeconfig] Writing "admin.conf" kubeconfig file
[kubeconfig] Writing "kubelet.conf" kubeconfig file
[kubeconfig] Writing "controller-manager.conf" kubeconfig file
[kubeconfig] Writing "scheduler.conf" kubeconfig file
[control-plane] Using manifest folder "/etc/kubernetes/manifests"
[control-plane] Creating static Pod manifest for "kube-apiserver"
[control-plane] Creating static Pod manifest for "kube-controller-manager"
[control-plane] Creating static Pod manifest for "kube-scheduler"
[etcd] Creating static Pod manifest for local etcd in "/etc/kubernetes/manifests"
[wait-control-plane] Waiting for the kubelet to boot up the control plane as static Pods from directory "/etc/kubernetes/manifests". This can take up to 4m0s
[kubelet-check] Initial timeout of 40s passed.
[kubelet-check] It seems like the kubelet isn't running or healthy.
[kubelet-check] The HTTP call equal to 'curl -sSL http://localhost:10248/healthz' failed with error: Get http://localhost:10248/healthz: dial tcp [::1]:10248: connect: connection refused.
[kubelet-check] It seems like the kubelet isn't running or healthy.
[kubelet-check] The HTTP call equal to 'curl -sSL http://localhost:10248/healthz' failed with error: Get http://localhost:10248/healthz: dial tcp [::1]:10248: connect: connection refused.
[kubelet-check] It seems like the kubelet isn't running or healthy.
[kubelet-check] The HTTP call equal to 'curl -sSL http://localhost:10248/healthz' failed with error: Get http://localhost:10248/healthz: dial tcp [::1]:10248: connect: connection refused.

 

但是又出现了一个一个报错信息:

[kubelet-check] The HTTP call equal to 'curl -sSL http://localhost:10248/healthz' failed with error: Get http://localhost:10248/healthz: dial tcp [::1]:10248: connect: connection refused.

解决方法如下:

[root@test2 ~]# vim /etc/systemd/system/kubelet.service.d/10-kubeadm.conf
...
Environment="KUBELET_SYSTEM_PODS_ARGS=--pod-manifest-path=/etc/kubernetes/manifests --allow-privileged=true --fail-swap-on=false"

[root@test2 ~]# systemctl daemon-reload
[root@test2 ~]# systemctl restart kubelet

在执行初始化命令时,出现如下的信息


Oct 14 10:32:59 ken kubelet: I1014 10:32:59.655343   17321 kubelet_node_status.go:286] Setting node annotation to enable volume controller attach/detach
Oct 14 10:33:09 ken kubelet: I1014 10:33:09.665511   17321 kubelet_node_status.go:286] Setting node annotation to enable volume controller attach/detach
Oct 14 10:33:19 ken kubelet: I1014 10:33:19.678178   17321 kubelet_node_status.go:286] Setting node annotation to enable volume controller attach/detach
Oct 14 10:33:29 ken kubelet: I1014 10:33:29.689093   17321 kubelet_node_status.go:286] Setting node annotation to enable volume controller attach/detach
Oct 14 10:33:39 ken kubelet: I1014 10:33:39.699769   17321 kubelet_node_status.go:286] Setting node annotation to enable volume controller attach/detach
Oct 14 10:33:49 ken kubelet: I1014 10:33:49.710420   17321 kubelet_node_status.go:286] Setting node annotation to enable volume controller attach/detach

综合考虑应该是由于国内网络原因获取不到镜像的原因,所以我的解决办法是,从阿里云获取到所需的镜像,然后实验docker tag改成k8s集群所需的镜像,

[root@ken1 kube1.15.tar]# docker image ls
REPOSITORY                           TAG        IMAGE ID       CREATED         SIZE
k8s.gcr.io/kube-apiserver            v1.15.12   c81971987f04   17 months ago   207MB
k8s.gcr.io/kube-controller-manager   v1.15.12   7b4d4985877a   17 months ago   159MB
k8s.gcr.io/kube-proxy                v1.15.12   00206e1127f2   17 months ago   82.5MB
k8s.gcr.io/kube-scheduler            v1.15.12   196d53938faa   17 months ago   81.2MB
k8s.gcr.io/coredns                   1.3.1      eb516548c180   2 years ago     40.3MB
k8s.gcr.io/etcd                      3.3.10     2c4adeb21b4f   2 years ago     258MB
k8s.gcr.io/pause                     3.1        da86e6ba6ca1   3 years ago     742kB

最终一次成功

[root@ken1 kube1.15.tar]# kubeadm init
I1014 11:16:53.725840    8354 version.go:248] remote version is much newer: v1.22.2; falling back to: stable-1.15
[init] Using Kubernetes version: v1.15.12
[preflight] Running pre-flight checks
	[WARNING IsDockerSystemdCheck]: detected "cgroupfs" as the Docker cgroup driver. The recommended driver is "systemd". Please follow the guide at https://kubernetes.io/docs/setup/cri/
error execution phase preflight: [preflight] Some fatal errors occurred:
	[ERROR Swap]: running with swap on is not supported. Please disable swap
[preflight] If you know what you are doing, you can make a check non-fatal with `--ignore-preflight-errors=...`
[root@ken1 kube1.15.tar]# swapo
swapoff  swapon   
[root@ken1 kube1.15.tar]# swapoff -a
[root@ken1 kube1.15.tar]# kubeadm init
I1014 11:17:11.049271    8486 version.go:248] remote version is much newer: v1.22.2; falling back to: stable-1.15
[init] Using Kubernetes version: v1.15.12
[preflight] Running pre-flight checks
	[WARNING IsDockerSystemdCheck]: detected "cgroupfs" as the Docker cgroup driver. The recommended driver is "systemd". Please follow the guide at https://kubernetes.io/docs/setup/cri/
[preflight] Pulling images required for setting up a Kubernetes cluster
[preflight] This might take a minute or two, depending on the speed of your internet connection
[preflight] You can also perform this action in beforehand using 'kubeadm config images pull'
[kubelet-start] Writing kubelet environment file with flags to file "/var/lib/kubelet/kubeadm-flags.env"
[kubelet-start] Writing kubelet configuration to file "/var/lib/kubelet/config.yaml"
[kubelet-start] Activating the kubelet service
[certs] Using certificateDir folder "/etc/kubernetes/pki"
[certs] Generating "ca" certificate and key
[certs] Generating "apiserver" certificate and key
[certs] apiserver serving cert is signed for DNS names [ken1 kubernetes kubernetes.default kubernetes.default.svc kubernetes.default.svc.cluster.local] and IPs [10.96.0.1 192.168.227.13]
[certs] Generating "apiserver-kubelet-client" certificate and key
[certs] Generating "front-proxy-ca" certificate and key
[certs] Generating "front-proxy-client" certificate and key
[certs] Generating "etcd/ca" certificate and key
[certs] Generating "etcd/server" certificate and key
[certs] etcd/server serving cert is signed for DNS names [ken1 localhost] and IPs [192.168.227.13 127.0.0.1 ::1]
[certs] Generating "etcd/healthcheck-client" certificate and key
[certs] Generating "etcd/peer" certificate and key
[certs] etcd/peer serving cert is signed for DNS names [ken1 localhost] and IPs [192.168.227.13 127.0.0.1 ::1]
[certs] Generating "apiserver-etcd-client" certificate and key
[certs] Generating "sa" key and public key
[kubeconfig] Using kubeconfig folder "/etc/kubernetes"
[kubeconfig] Writing "admin.conf" kubeconfig file
[kubeconfig] Writing "kubelet.conf" kubeconfig file
[kubeconfig] Writing "controller-manager.conf" kubeconfig file
[kubeconfig] Writing "scheduler.conf" kubeconfig file
[control-plane] Using manifest folder "/etc/kubernetes/manifests"
[control-plane] Creating static Pod manifest for "kube-apiserver"
[control-plane] Creating static Pod manifest for "kube-controller-manager"
[control-plane] Creating static Pod manifest for "kube-scheduler"
[etcd] Creating static Pod manifest for local etcd in "/etc/kubernetes/manifests"
[wait-control-plane] Waiting for the kubelet to boot up the control plane as static Pods from directory "/etc/kubernetes/manifests". This can take up to 4m0s
[apiclient] All control plane components are healthy after 14.502272 seconds
[upload-config] Storing the configuration used in ConfigMap "kubeadm-config" in the "kube-system" Namespace
[kubelet] Creating a ConfigMap "kubelet-config-1.15" in namespace kube-system with the configuration for the kubelets in the cluster
[upload-certs] Skipping phase. Please see --upload-certs
[mark-control-plane] Marking the node ken1 as control-plane by adding the label "node-role.kubernetes.io/master=''"
[mark-control-plane] Marking the node ken1 as control-plane by adding the taints [node-role.kubernetes.io/master:NoSchedule]
[bootstrap-token] Using token: y57al1.hfb51a0tb90dhz6y
[bootstrap-token] Configuring bootstrap tokens, cluster-info ConfigMap, RBAC Roles
[bootstrap-token] configured RBAC rules to allow Node Bootstrap tokens to post CSRs in order for nodes to get long term certificate credentials
[bootstrap-token] configured RBAC rules to allow the csrapprover controller automatically approve CSRs from a Node Bootstrap Token
[bootstrap-token] configured RBAC rules to allow certificate rotation for all node client certificates in the cluster
[bootstrap-token] Creating the "cluster-info" ConfigMap in the "kube-public" namespace
[addons] Applied essential addon: CoreDNS
[addons] Applied essential addon: kube-proxy

Your Kubernetes control-plane has initialized successfully!

To start using your cluster, you need to run the following as a regular user:

  mkdir -p $HOME/.kube
  sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
  sudo chown $(id -u):$(id -g) $HOME/.kube/config

You should now deploy a pod network to the cluster.
Run "kubectl apply -f [podnetwork].yaml" with one of the options listed at:
  https://kubernetes.io/docs/concepts/cluster-administration/addons/

Then you can join any number of worker nodes by running the following on each as root:

kubeadm join 192.168.227.13:6443 --token y57al1.hfb51a0tb90dhz6y \
    --discovery-token-ca-cert-hash sha256:cab57b733b32f9c9da2bb1e9fa00a0b9a1822aabb8774ba42189bc696b42d9fa 

 

初始化成功之后,还出现了一个小插曲,在执行完成开始部署网络flannel,执行如下的命令

 kubectl apply -f https://raw.githubusercontent.com/coreos/flannel/master/Documentation/kube-flannel.yml

完成之后查看k8s集群信息,发现节点一直显示notready

[root@ken1 ~]# kubectl get nodes
NAME   STATUS     ROLES    AGE   VERSION
ken1   NotReady   master   20m   v1.15.1

查看/var/log/messages日志有如下的报错信息

 

拉下来这个yaml文件,查看

wget https://raw.githubusercontent.com/coreos/flannel/master/Documentation/kube-flannel.yml

可以发现这个文件在请求下载镜像

...    
 hostNetwork: true
      priorityClassName: system-node-critical
      tolerations:
      - operator: Exists
        effect: NoSchedule
      serviceAccountName: flannel
      initContainers:
      - name: install-cni
        image: quay.io/coreos/flannel:v0.14.0
        command:
        - cp
        args:
        - -f
        - /etc/kube-flannel/cni-conf.json
        - /etc/cni/net.d/10-flannel.conflist
....

但是查看docker image ls列表并没有发现下载完成。所以手动进行拉取该镜像

[root@ken1 ~]# docker pull quay.io/coreos/flannel:v0.14.0

然后再执行一遍文件

 kubectl apply -f https://raw.githubusercontent.com/coreos/flannel/master/Documentation/kube-flannel.yml

最终节点恢复正常,k8s集群页部署完成

[root@ken1 ~]# kubectl get nodes
NAME   STATUS   ROLES    AGE   VERSION
ken1   Ready    master   23m   v1.15.1

 

 

部署相关命令

可以在清华源或者阿里云中下载对应版本的kubectl、kubeadm、kubelet rpm包等

https://mirrors.tuna.tsinghua.edu.cn/kubernetes/yum/repos/kubernetes-el7-x86_64/Packages/

[root@ken kube]# ls
kubeadm-1.15.1-0.x86_64.rpm  kubectl-1.15.1-0.x86_64.rpm  kubelet-1.15.1-0.x86_64.rpm

 

可以使用命令查看当前kubeadm等支持的版本,以便拉取对应版本的镜像

[root@ken kube]# kubeadm  config images list
I1014 09:44:18.227292    7376 version.go:248] remote version is much newer: v1.22.2; falling back to: stable-1.15
k8s.gcr.io/kube-apiserver:v1.15.12
k8s.gcr.io/kube-controller-manager:v1.15.12
k8s.gcr.io/kube-scheduler:v1.15.12
k8s.gcr.io/kube-proxy:v1.15.12
k8s.gcr.io/pause:3.1
k8s.gcr.io/etcd:3.3.10
k8s.gcr.io/coredns:1.3.1

在初始化集群中,指定对应的支持版本

[root@ken kube]#  kubeadm init --image-repository registry.aliyuncs.com/google_containers --kubernetes-version v1.15.12 --pod-network-cidr=10.244.0.0/16

 

 

 

 

 

 

发表评论

邮箱地址不会被公开。