#kubernetes 如何调试k8s集群启动失败的应用

调试是每个程序员都躲不过去的必修课

任何应用的开发过程,都不总会一帆风顺,那么,怎么调试,就是一个很重要的问题。

对于k8s集群,即使是按照文档一步步去部署一个很成熟的服务时,依然可能会出现各种各样的错误。

例如,最近在按照文档部署elk时,就出现过各种各样的问题。本文以此为例,示范如何进行调试,查找错误原因。

1. 查看pod列表

kubectl get pods -n <namespace>

示例

kubectl get pod -n k8s-logging

NAME READY STATUS RESTARTS AGE es-logging-es-default-0 0/1 Init:CrashLoopBackOff 7 14m

2. 查看pod的详细信息

可以用以下命令查看失败状态的pod的详细信息:

kubectl describe pod <pod name> -n <namespace>

该命令会输出pod的Events列表,可以看到该pod运行过程中的相关事件。

有时候,这个Events列表中就已经包含了详细的错误原因。

示例

kubectl describe pod es-logging-es-default-0 -n k8s-logging

Name: es-logging-es-default-0 Namespace: k8s-logging Priority: 0 Node: k8s-node-02/192.168.1.15 Start Time: Fri, 14 May 2021 02:35:49 +0000 Labels: common.k8s.elastic.co/type=elasticsearch controller-revision-hash=es-logging-es-default-7ffcbbf5 elasticsearch.k8s.elastic.co/cluster-name=es-logging elasticsearch.k8s.elastic.co/config-hash=1754400308 elasticsearch.k8s.elastic.co/http-scheme=https elasticsearch.k8s.elastic.co/node-data=true elasticsearch.k8s.elastic.co/node-ingest=true elasticsearch.k8s.elastic.co/node-master=true elasticsearch.k8s.elastic.co/node-ml=true elasticsearch.k8s.elastic.co/node-remote_cluster_client=true elasticsearch.k8s.elastic.co/node-transform=true elasticsearch.k8s.elastic.co/node-voting_only=false elasticsearch.k8s.elastic.co/statefulset-name=es-logging-es-default elasticsearch.k8s.elastic.co/version=7.12.1 statefulset.kubernetes.io/pod-name=es-logging-es-default-0 Annotations: co.elastic.logs/module: elasticsearch update.k8s.elastic.co/timestamp: 2021-05-14T02:35:52.442769569Z Status: Pending IP: 10.244.2.114 IPs: IP: 10.244.2.114 Controlled By: StatefulSet/es-logging-es-default Init Containers: elastic-internal-init-filesystem: Container ID: docker://ef9155f4c23f12cc95baf4eb56256497ef6495355c0c2a87adfd4c8973686855 Image: docker.elastic.co/elasticsearch/elasticsearch:7.12.1 Image ID: docker-pullable://docker.elastic.co/elasticsearch/elasticsearch@sha256:561bf27aa989803bfbac48ebd48e32daadb4215cf7940c599a62c13f225427fa Port: <none> Host Port: <none> Command: bash -c /mnt/elastic-internal/scripts/prepare-fs.sh State: Waiting Reason: CrashLoopBackOff Last State: Terminated Reason: Error Exit Code: 1 Started: Fri, 14 May 2021 02:56:55 +0000 Finished: Fri, 14 May 2021 02:56:55 +0000 Ready: False Restart Count: 9 Limits: cpu: 100m memory: 50Mi Requests: cpu: 100m memory: 50Mi Environment: POD_IP: (v1:status.podIP) POD_NAME: es-logging-es-default-0 (v1:metadata.name) NODE_NAME: (v1:spec.nodeName) NAMESPACE: k8s-logging (v1:metadata.namespace) HEADLESS_SERVICE_NAME: es-logging-es-default Mounts: /mnt/elastic-internal/downward-api from downward-api (ro) /mnt/elastic-internal/elasticsearch-bin-local from elastic-internal-elasticsearch-bin-local (rw) /mnt/elastic-internal/elasticsearch-config from elastic-internal-elasticsearch-config (ro) /mnt/elastic-internal/elasticsearch-config-local from elastic-internal-elasticsearch-config-local (rw) /mnt/elastic-internal/elasticsearch-plugins-local from elastic-internal-elasticsearch-plugins-local (rw) /mnt/elastic-internal/probe-user from elastic-internal-probe-user (ro) /mnt/elastic-internal/scripts from elastic-internal-scripts (ro) /mnt/elastic-internal/transport-certificates from elastic-internal-transport-certificates (ro) /mnt/elastic-internal/unicast-hosts from elastic-internal-unicast-hosts (ro) /mnt/elastic-internal/xpack-file-realm from elastic-internal-xpack-file-realm (ro) /usr/share/elasticsearch/config/http-certs from elastic-internal-http-certificates (ro) /usr/share/elasticsearch/config/transport-remote-certs/ from elastic-internal-remote-certificate-authorities (ro) /usr/share/elasticsearch/data from elasticsearch-data (rw) /usr/share/elasticsearch/logs from elasticsearch-logs (rw) Containers: elasticsearch: Container ID: Image: docker.elastic.co/elasticsearch/elasticsearch:7.12.1 Image ID: Ports: 9200/TCP, 9300/TCP Host Ports: 0/TCP, 0/TCP State: Waiting Reason: PodInitializing Ready: False Restart Count: 0 Limits: memory: 2Gi Requests: memory: 2Gi Readiness: exec [bash -c /mnt/elastic-internal/scripts/readiness-probe-script.sh] delay=10s timeout=5s period=5s #success=1 #failure=3 Environment: POD_IP: (v1:status.podIP) POD_NAME: es-logging-es-default-0 (v1:metadata.name) NODE_NAME: (v1:spec.nodeName) NAMESPACE: k8s-logging (v1:metadata.namespace) PROBE_PASSWORD_PATH: /mnt/elastic-internal/probe-user/elastic-internal-probe PROBE_USERNAME: elastic-internal-probe READINESS_PROBE_PROTOCOL: https HEADLESS_SERVICE_NAME: es-logging-es-default NSS_SDB_USE_CACHE: no Mounts: /mnt/elastic-internal/downward-api from downward-api (ro) /mnt/elastic-internal/elasticsearch-config from elastic-internal-elasticsearch-config (ro) /mnt/elastic-internal/probe-user from elastic-internal-probe-user (ro) /mnt/elastic-internal/scripts from elastic-internal-scripts (ro) /mnt/elastic-internal/unicast-hosts from elastic-internal-unicast-hosts (ro) /mnt/elastic-internal/xpack-file-realm from elastic-internal-xpack-file-realm (ro) /usr/share/elasticsearch/bin from elastic-internal-elasticsearch-bin-local (rw) /usr/share/elasticsearch/config from elastic-internal-elasticsearch-config-local (rw) /usr/share/elasticsearch/config/http-certs from elastic-internal-http-certificates (ro) /usr/share/elasticsearch/config/transport-certs from elastic-internal-transport-certificates (ro) /usr/share/elasticsearch/config/transport-remote-certs/ from elastic-internal-remote-certificate-authorities (ro) /usr/share/elasticsearch/data from elasticsearch-data (rw) /usr/share/elasticsearch/logs from elasticsearch-logs (rw) /usr/share/elasticsearch/plugins from elastic-internal-elasticsearch-plugins-local (rw) Conditions: Type Status Initialized False Ready False ContainersReady False PodScheduled True Volumes: elasticsearch-data: Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace) ClaimName: elasticsearch-data-es-logging-es-default-0 ReadOnly: false downward-api: Type: DownwardAPI (a volume populated by information about the pod) Items: metadata.labels -> labels elastic-internal-elasticsearch-bin-local: Type: EmptyDir (a temporary directory that shares a pod’s lifetime) Medium: SizeLimit: <unset> elastic-internal-elasticsearch-config: Type: Secret (a volume populated by a Secret) SecretName: es-logging-es-default-es-config Optional: false elastic-internal-elasticsearch-config-local: Type: EmptyDir (a temporary directory that shares a pod’s lifetime) Medium: SizeLimit: <unset> elastic-internal-elasticsearch-plugins-local: Type: EmptyDir (a temporary directory that shares a pod’s lifetime) Medium: SizeLimit: <unset> elastic-internal-http-certificates: Type: Secret (a volume populated by a Secret) SecretName: es-logging-es-http-certs-internal Optional: false elastic-internal-probe-user: Type: Secret (a volume populated by a Secret) SecretName: es-logging-es-internal-users Optional: false elastic-internal-remote-certificate-authorities: Type: Secret (a volume populated by a Secret) SecretName: es-logging-es-remote-ca Optional: false elastic-internal-scripts: Type: ConfigMap (a volume populated by a ConfigMap) Name: es-logging-es-scripts Optional: false elastic-internal-transport-certificates: Type: Secret (a volume populated by a Secret) SecretName: es-logging-es-default-es-transport-certs Optional: false elastic-internal-unicast-hosts: Type: ConfigMap (a volume populated by a ConfigMap) Name: es-logging-es-unicast-hosts Optional: false elastic-internal-xpack-file-realm: Type: Secret (a volume populated by a Secret) SecretName: es-logging-es-xpack-file-realm Optional: false elasticsearch-logs: Type: EmptyDir (a temporary directory that shares a pod’s lifetime) Medium: SizeLimit: <unset> QoS Class: Burstable Node-Selectors: <none> Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s node.kubernetes.io/unreachable:NoExecute op=Exists for 300s Events: Type Reason Age From Message


Warning FailedScheduling 24m default-scheduler 0/3 nodes are available: 3 pod has unbound immediate PersistentVolumeClaims. Normal Scheduled 23m default-scheduler Successfully assigned k8s-logging/es-logging-es-default-0 to k8s-node-02 Normal Pulled 22m (x5 over 23m) kubelet Container image “docker.elastic.co/elasticsearch/elasticsearch:7.12.1” already present on machine Normal Created 22m (x5 over 23m) kubelet Created container elastic-internal-init-filesystem Normal Started 22m (x5 over 23m) kubelet Started container elastic-internal-init-filesystem Warning BackOff 3m57s (x92 over 23m) kubelet Back-off restarting failed container

从示例可以看到,该pod最后的错误警告是:

Warning  BackOff           3m57s (x92 over 23m)  kubelet            Back-off restarting failed container

很不幸,Events事件列表中,只包含的比较简单的信息:容器启动失败。

3. 查看pod日志

当事件列表中找不到详细错误时,需要查看pod的详细日志来定位:

kubectl logs <pod name> -n <namespace>

示例

kubectl logs -n k8s-logging es-logging-es-default-0

Error from server (BadRequest): container “elasticsearch” in pod “es-logging-es-default-0” is waiting to start: PodInitializing

表示该pod的默认container是elasticsearch,而它还没有初始化成功,所以没有运行日志。

说明出错的不是默认的container。

4. 查看对应container的日志

这时候需要查看特定contaienr的日志:

kubectl logs <pod name> -n <namespace> -c <container>

示例

从刚刚pod的详情里,找到Init Containers的列表。

示例中,Init Containers只有一个elastic-internal-init-filesystem,这个信息与Events列表中也是一致的。

k8s-debug-01-describe-pod

kubectl logs -c elastic-internal-init-filesystem -n k8s-logging es-logging-es-default-0

以下是输出日志

Starting init script Linking /mnt/elastic-internal/xpack-file-realm/users to /usr/share/elasticsearch/config/users Linking /mnt/elastic-internal/xpack-file-realm/roles.yml to /usr/share/elasticsearch/config/roles.yml Linking /mnt/elastic-internal/xpack-file-realm/users_roles to /usr/share/elasticsearch/config/users_roles Linking /mnt/elastic-internal/elasticsearch-config/elasticsearch.yml to /usr/share/elasticsearch/config/elasticsearch.yml Linking /mnt/elastic-internal/unicast-hosts/unicast_hosts.txt to /usr/share/elasticsearch/config/unicast_hosts.txt File linking duration: 0 sec. Copying /usr/share/elasticsearch/config/* to /mnt/elastic-internal/elasticsearch-config-local/ removed ‘/mnt/elastic-internal/elasticsearch-config-local/elasticsearch.yml’ ‘/usr/share/elasticsearch/config/elasticsearch.yml’ -> ‘/mnt/elastic-internal/elasticsearch-config-local/elasticsearch.yml’ ‘/usr/share/elasticsearch/config/http-certs/..2021_05_14_02_35_50.501721750/ca.crt’ -> ‘/mnt/elastic-internal/elasticsearch-config-local/http-certs/..2021_05_14_02_35_50.501721750/ca.crt’ ‘/usr/share/elasticsearch/config/http-certs/..2021_05_14_02_35_50.501721750/tls.crt’ -> ‘/mnt/elastic-internal/elasticsearch-config-local/http-certs/..2021_05_14_02_35_50.501721750/tls.crt’ ‘/usr/share/elasticsearch/config/http-certs/..2021_05_14_02_35_50.501721750/tls.key’ -> ‘/mnt/elastic-internal/elasticsearch-config-local/http-certs/..2021_05_14_02_35_50.501721750/tls.key’ removed ‘/mnt/elastic-internal/elasticsearch-config-local/http-certs/ca.crt’ ‘/usr/share/elasticsearch/config/http-certs/ca.crt’ -> ‘/mnt/elastic-internal/elasticsearch-config-local/http-certs/ca.crt’ removed ‘/mnt/elastic-internal/elasticsearch-config-local/http-certs/tls.crt’ ‘/usr/share/elasticsearch/config/http-certs/tls.crt’ -> ‘/mnt/elastic-internal/elasticsearch-config-local/http-certs/tls.crt’ removed ‘/mnt/elastic-internal/elasticsearch-config-local/http-certs/tls.key’ ‘/usr/share/elasticsearch/config/http-certs/tls.key’ -> ‘/mnt/elastic-internal/elasticsearch-config-local/http-certs/tls.key’ removed ‘/mnt/elastic-internal/elasticsearch-config-local/http-certs/..data’ ‘/usr/share/elasticsearch/config/http-certs/..data’ -> ‘/mnt/elastic-internal/elasticsearch-config-local/http-certs/..data’ ‘/usr/share/elasticsearch/config/jvm.options’ -> ‘/mnt/elastic-internal/elasticsearch-config-local/jvm.options’ ‘/usr/share/elasticsearch/config/log4j2.file.properties’ -> ‘/mnt/elastic-internal/elasticsearch-config-local/log4j2.file.properties’ ‘/usr/share/elasticsearch/config/log4j2.properties’ -> ‘/mnt/elastic-internal/elasticsearch-config-local/log4j2.properties’ ‘/usr/share/elasticsearch/config/role_mapping.yml’ -> ‘/mnt/elastic-internal/elasticsearch-config-local/role_mapping.yml’ removed ‘/mnt/elastic-internal/elasticsearch-config-local/roles.yml’ ‘/usr/share/elasticsearch/config/roles.yml’ -> ‘/mnt/elastic-internal/elasticsearch-config-local/roles.yml’ ‘/usr/share/elasticsearch/config/transport-remote-certs/..2021_05_14_02_35_50.623420157/ca.crt’ -> ‘/mnt/elastic-internal/elasticsearch-config-local/transport-remote-certs/..2021_05_14_02_35_50.623420157/ca.crt’ removed ‘/mnt/elastic-internal/elasticsearch-config-local/transport-remote-certs/ca.crt’ ‘/usr/share/elasticsearch/config/transport-remote-certs/ca.crt’ -> ‘/mnt/elastic-internal/elasticsearch-config-local/transport-remote-certs/ca.crt’ removed ‘/mnt/elastic-internal/elasticsearch-config-local/transport-remote-certs/..data’ ‘/usr/share/elasticsearch/config/transport-remote-certs/..data’ -> ‘/mnt/elastic-internal/elasticsearch-config-local/transport-remote-certs/..data’ removed ‘/mnt/elastic-internal/elasticsearch-config-local/unicast_hosts.txt’ ‘/usr/share/elasticsearch/config/unicast_hosts.txt’ -> ‘/mnt/elastic-internal/elasticsearch-config-local/unicast_hosts.txt’ removed ‘/mnt/elastic-internal/elasticsearch-config-local/users’ ‘/usr/share/elasticsearch/config/users’ -> ‘/mnt/elastic-internal/elasticsearch-config-local/users’ removed ‘/mnt/elastic-internal/elasticsearch-config-local/users_roles’ ‘/usr/share/elasticsearch/config/users_roles’ -> ‘/mnt/elastic-internal/elasticsearch-config-local/users_roles’ Empty dir /usr/share/elasticsearch/plugins Copying /usr/share/elasticsearch/bin/* to /mnt/elastic-internal/elasticsearch-bin-local/ ‘/usr/share/elasticsearch/bin/elasticsearch’ -> ‘/mnt/elastic-internal/elasticsearch-bin-local/elasticsearch’ ‘/usr/share/elasticsearch/bin/elasticsearch-certgen’ -> ‘/mnt/elastic-internal/elasticsearch-bin-local/elasticsearch-certgen’ ‘/usr/share/elasticsearch/bin/elasticsearch-certutil’ -> ‘/mnt/elastic-internal/elasticsearch-bin-local/elasticsearch-certutil’ ‘/usr/share/elasticsearch/bin/elasticsearch-cli’ -> ‘/mnt/elastic-internal/elasticsearch-bin-local/elasticsearch-cli’ ‘/usr/share/elasticsearch/bin/elasticsearch-croneval’ -> ‘/mnt/elastic-internal/elasticsearch-bin-local/elasticsearch-croneval’ ‘/usr/share/elasticsearch/bin/elasticsearch-env’ -> ‘/mnt/elastic-internal/elasticsearch-bin-local/elasticsearch-env’ ‘/usr/share/elasticsearch/bin/elasticsearch-env-from-file’ -> ‘/mnt/elastic-internal/elasticsearch-bin-local/elasticsearch-env-from-file’ ‘/usr/share/elasticsearch/bin/elasticsearch-keystore’ -> ‘/mnt/elastic-internal/elasticsearch-bin-local/elasticsearch-keystore’ ‘/usr/share/elasticsearch/bin/elasticsearch-migrate’ -> ‘/mnt/elastic-internal/elasticsearch-bin-local/elasticsearch-migrate’ ‘/usr/share/elasticsearch/bin/elasticsearch-node’ -> ‘/mnt/elastic-internal/elasticsearch-bin-local/elasticsearch-node’ ‘/usr/share/elasticsearch/bin/elasticsearch-plugin’ -> ‘/mnt/elastic-internal/elasticsearch-bin-local/elasticsearch-plugin’ ‘/usr/share/elasticsearch/bin/elasticsearch-saml-metadata’ -> ‘/mnt/elastic-internal/elasticsearch-bin-local/elasticsearch-saml-metadata’ ‘/usr/share/elasticsearch/bin/elasticsearch-setup-passwords’ -> ‘/mnt/elastic-internal/elasticsearch-bin-local/elasticsearch-setup-passwords’ ‘/usr/share/elasticsearch/bin/elasticsearch-shard’ -> ‘/mnt/elastic-internal/elasticsearch-bin-local/elasticsearch-shard’ ‘/usr/share/elasticsearch/bin/elasticsearch-sql-cli’ -> ‘/mnt/elastic-internal/elasticsearch-bin-local/elasticsearch-sql-cli’ ‘/usr/share/elasticsearch/bin/elasticsearch-sql-cli-7.12.1.jar’ -> ‘/mnt/elastic-internal/elasticsearch-bin-local/elasticsearch-sql-cli-7.12.1.jar’ ‘/usr/share/elasticsearch/bin/elasticsearch-syskeygen’ -> ‘/mnt/elastic-internal/elasticsearch-bin-local/elasticsearch-syskeygen’ ‘/usr/share/elasticsearch/bin/elasticsearch-users’ -> ‘/mnt/elastic-internal/elasticsearch-bin-local/elasticsearch-users’ ‘/usr/share/elasticsearch/bin/x-pack-env’ -> ‘/mnt/elastic-internal/elasticsearch-bin-local/x-pack-env’ ‘/usr/share/elasticsearch/bin/x-pack-security-env’ -> ‘/mnt/elastic-internal/elasticsearch-bin-local/x-pack-security-env’ ‘/usr/share/elasticsearch/bin/x-pack-watcher-env’ -> ‘/mnt/elastic-internal/elasticsearch-bin-local/x-pack-watcher-env’ Files copy duration: 0 sec. chowning /usr/share/elasticsearch/data to elasticsearch:elasticsearch chown: changing ownership of ‘/usr/share/elasticsearch/data’: Operation not permitted failed to change ownership of ‘/usr/share/elasticsearch/data’ from 1024:users to elasticsearch:elasticsearch

在日志的最后,可以看到出错原因:

failed to change ownership of '/usr/share/elasticsearch/data' from 1024:users to elasticsearch:elasticsearch

k8s-debug-02-pod-container-log

根据对应的错误,查找原因即可。

更多调试方法

https://kubernetes.io/zh/docs/tasks/debug-application-cluster/debug-running-pod/