k8s pod健康检测

Pod健康检测机制

对于Pod的健康状态检测,kubernetes提供了两类探针(Probe)来执行对Pod的健康状态检测:

  • LivenessProbe探针:
            用于判断容器是否存活,即Pod是否为running状态,如果LivenessProbe探针探测到容器不健康,则kubelet将kill掉容器,并根据容器的重启策略是否重启,如果一个容器不包含LivenessProbe探针,则Kubelet认为容器的LivenessProbe探针的返回值永远成功.
  • ReadinessProbe探针:
            用于判断容器是否启动完成,即容器的Ready是否为True,可以接收请求,如果ReadinessProbe探测失败,则容器的Ready将为False,控制器将此Pod的Endpoint从对应的service的Endpoint列表中移除,从此不再将任何请求调度此Pod上,直到下次探测成功。

每类探针都支持三种探测方法:

  • ExecAction: 通过执行命令来检查服务是否正常,针对复杂检测或无HTTP接口的服务,命令返回值为0则表示容器健康。
  • HTTPGetAction: 通过发送http请求检查服务是否正常,返回200-399状态码则表明容器健康。
  • TCPSocketAction: 通过容器的IP和Port执行TCP检查,如果能够建立TCP连接,则表明容器健康。

探针探测的结果有以下三者之一:

  • Success: Container通过了检查
  • Failure: Container未通过检查
  • Unknown: 未能执行检查,因此不采取任何措施

LivenessProbe探针配置

例一:通过exec方式做健康探测

  • exec-liveness.yaml
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    $ cat > exec-liveness.yaml <<EOF
    apiVersion: v1
    kind: Pod
    metadata:
    labels:
    test: liveness
    name: liveness-exec
    spec:
    containers:
    - name: liveness
    image: busybox
    args:
    - /bin/sh
    - -c
    - touch /tmp/healthy; sleep 30; rm -rf /tmp/healthy; sleep 600
    livenessProbe:
    exec:
    command:
    - cat
    - /tmp/healthy
    initialDelaySeconds: 5
    periodSeconds: 5
    EOF

        在该配置文件中,对容器执行livenessProbe检查,periodSeconds字段指定kubelet每5s执行一次检查,检查的命令为cat /tmp/healthy,initialDelaySeconds字段告诉kubelet应该在执行第一次检查之前等待5秒,如果命令执行成功,则返回0,那么kubelet就认为容器是健康的,如果为非0,则Kubelet会Kill掉容器并根据重启策略来决定是否需要重启。

  • 当容器启动时,它会执行以下命令
    1
    /bin/sh -c "touch /tmp/healthy; sleep 30; rm -rf /tmp/healthy; sleep 600"

        对于容器的前30秒,有一个/tmp/healthy文件。因此,在前30秒内,该命令cat /tmp/healthy返回成功代码。30秒后,cat /tmp/healthy返回失败代码。

  • 创建Pod

    1
    2
    $  kubectl create -f  exec-liveness.yaml 
    pod/liveness-exec created
  • 在30秒内,查看Pod事件

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    $ kubectl describe pod liveness-exec
    …………
    QoS Class: BestEffort
    Node-Selectors: <none>
    Tolerations: node.kubernetes.io/not-ready:NoExecute for 360s
    node.kubernetes.io/unreachable:NoExecute for 360s
    Events:
    Type Reason Age From Message
    ---- ------ ---- ---- -------
    Normal Scheduled 23s default-scheduler Successfully assigned default/liveness-exec to 172.21.17.34
    Normal Pulling 20s kubelet, 172.21.17.34 Pulling image "busybox"
    Normal Pulled 2s kubelet, 172.21.17.34 Successfully pulled image "busybox"
    Normal Created 2s kubelet, 172.21.17.34 Created container liveness
    Normal Started 1s kubelet, 172.21.17.34 Started container liveness
  • 35秒后,再次查看Pod事件

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    $ kubectl describe pod liveness-exec
    …………
    Tolerations: node.kubernetes.io/not-ready:NoExecute for 360s
    node.kubernetes.io/unreachable:NoExecute for 360s
    Events:
    Type Reason Age From Message
    ---- ------ ---- ---- -------
    Normal Scheduled 58s default-scheduler Successfully assigned default/liveness-exec to 172.21.17.34
    Normal Pulling 55s kubelet, 172.21.17.34 Pulling image "busybox"
    Normal Pulled 37s kubelet, 172.21.17.34 Successfully pulled image "busybox"
    Normal Created 37s kubelet, 172.21.17.34 Created container liveness
    Normal Started 36s kubelet, 172.21.17.34 Started container liveness
    Warning Unhealthy 0s (x2 over 5s) kubelet, 172.21.17.34 Liveness probe failed: cat: can't open '/tmp/healthy': No such file or directory
  • 再等30秒,确认Container已重新启动

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    $ kubectl get pod liveness-exec
    NAME READY STATUS RESTARTS AGE
    liveness-exec 1/1 Running 1 115s

    $ kubectl describe pod liveness-exec
    ………………
    QoS Class: BestEffort
    Node-Selectors: <none>
    Tolerations: node.kubernetes.io/not-ready:NoExecute for 360s
    node.kubernetes.io/unreachable:NoExecute for 360s
    Events:
    Type Reason Age From Message
    ---- ------ ---- ---- -------
    Normal Scheduled 2m7s default-scheduler Successfully assigned default/liveness-exec to 172.21.17.34
    Warning Unhealthy 64s (x3 over 74s) kubelet, 172.21.17.34 Liveness probe failed: cat: can't open '/tmp/healthy': No such file or directory
    Normal Killing 64s kubelet, 172.21.17.34 Container liveness failed liveness probe, will be restarted
    Normal Pulling 34s (x2 over 2m4s) kubelet, 172.21.17.34 Pulling image "busybox"
    Normal Pulled 25s (x2 over 106s) kubelet, 172.21.17.34 Successfully pulled image "busybox"
    Normal Created 25s (x2 over 106s) kubelet, 172.21.17.34 Created container liveness
    Normal Started 25s (x2 over 105s) kubelet, 172.21.17.34 Started container liveness

例二: 通过HTTP方式做健康探测

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
$ cat > http-liveness.yaml <<EOF
---
apiVersion: v1
kind: Pod
metadata:
labels:
test: liveness
name: liveness-http
spec:
containers:
- name: liveness
image: carlziess/liveness
args:
- /server
livenessProbe:
httpGet:
path: /healthz
port: 8080
httpHeaders:
- name: X-Custom-Header
value: Awesome
initialDelaySeconds: 3
periodSeconds: 3
EOF

        创建一个Pod,其中periodSeconds字段指定kubelet每3秒执行一次探测,initialDelaySeconds字段告诉kubelet延迟等待3秒,探测方式为向容器中运行的服务发送HTTP GET请求,请求8080端口下的/healthz, 任何大于或等于200且小于400的代码表示成功。任何其他代码表示失败。

  • 创建pod

    1
    2
    $ kubectl apply -f http-liveness.yaml 
    pod/liveness-http created
  • 检查验证

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    $ kubectl describe pod liveness-http
    ………………
    Node-Selectors: <none>
    Tolerations: node.kubernetes.io/not-ready:NoExecute for 360s
    node.kubernetes.io/unreachable:NoExecute for 360s
    Events:
    Type Reason Age From Message
    ---- ------ ---- ---- -------
    Normal Scheduled 2m59s default-scheduler Successfully assigned default/liveness-http to 172.21.17.34
    Normal Pulled 119s (x3 over 2m46s) kubelet, 172.21.17.34 Successfully pulled image "carlziess/liveness"
    Normal Created 119s (x3 over 2m46s) kubelet, 172.21.17.34 Created container liveness
    Normal Started 118s (x3 over 2m45s) kubelet, 172.21.17.34 Started container liveness

    $ kubectl get pod
    NAME READY STATUS RESTARTS AGE
    liveness-http 1/1 Running 0 26s
  • httpGet探测方式有如下可选的控制字段

    • host: 要连接的主机名,默认为Pod IP,可以在http request head中设置host头部。
    • scheme: 用于连接host的协议,默认为HTTP。
    • path: http服务器上的访问URL
    • httpHeaders: 自定义HTTP请求headers,HTTP允许重复headers
    • port: 容器上要访问端口号或名称

例三: 通过TCP方式做健康探测

Kubelet将尝试在指定的端口上打开容器上的套接字,如果能建立连接,则表明容器健康。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
$ cat > tcp-liveness-readiness.yaml <<EOF
---
apiVersion: v1
kind: Pod
metadata:
name: goproxy
labels:
app: goproxy
spec:
containers:
- name: goproxy
image: goproxy/goproxy
ports:
- containerPort: 8080
readinessProbe:
tcpSocket:
port: 8080
initialDelaySeconds: 5
periodSeconds: 10
livenessProbe:
tcpSocket:
port: 8080
initialDelaySeconds: 15
periodSeconds: 20
EOF

        TCP检查方式和HTTP检查方式非常相似,示例中两种探针都使用了,在容器启动5秒后,kubelet将发送第一个readinessProbe探针,这将连接到容器的8080端口,如果探测成功,则该Pod将被标识为ready,10秒后,kubelet将进行第二次连接。
        除此,配置还包含了livenessProbe探针,在容器启动15秒后,kubelet将发送第一个livenessProbe探针,仍然尝试连接容器的8080端口,如果连接失败则重启容器。

  • 执行创建

    1
    2
    $ kubectl apply -f tcp-liveness-readiness.yaml
    pod/goproxy created
  • 15秒后,查看Pod事件以验证活动探测

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    $ kubectl describe pod goproxy
    ………………
    Node-Selectors: <none>
    Tolerations: node.kubernetes.io/not-ready:NoExecute for 360s
    node.kubernetes.io/unreachable:NoExecute for 360s
    Events:
    Type Reason Age From Message
    ---- ------ ---- ---- -------
    Normal Scheduled 26s default-scheduler Successfully assigned default/goproxy to 172.21.16.231
    Normal Pulling 22s kubelet, 172.21.16.231 Pulling image "goproxy/goproxy"

当容器有多个端口时,通常会给每个端口命名,所以在使用探针探测时,也可以直接写自定义的端口名称

1
2
3
4
5
6
7
8
ports:
- name: liveness-port
containerPort: 8080
hostPort: 8080
livenessProbe:
httpGet:
path: /healthz
port: liveness-port

ReadinessProbe探针配置

        ReadinessProbe探针的使用场景livenessProbe稍有不同,有的时候应用程序可能暂时无法接受请求,比如Pod已经Running了,但是容器内应用程序尚未启动成功,在这种情况下,如果没有ReadinessProbe,则Kubernetes认为它可以处理请求了,然而此时,我们知道程序还没启动成功是不能接收用户请求的,所以不希望kubernetes把请求调度给它,则使用ReadinessProbe探针。
        ReadinessProbe和livenessProbe可以使用相同探测方式,只是对Pod的处置方式不同,ReadinessProbe是将Pod IP:Port从对应的EndPoint列表中删除,而livenessProbe则Kill容器并根据Pod的重启策略来决定作出对应的措施。
        探针探测容器是否已准备就绪,如果未准备就绪则kubernetes不会将流量转发给此Pod。

        ReadinessProbe探针与livenessProbe一样也支持exec、httpGet、TCP的探测方式,配置方式相同,只不过是将livenessProbe字段修改为ReadinessProbe。

1
2
3
4
5
6
7
readinessProbe:
exec:
command:
- cat
- /tmp/healthy
initialDelaySeconds: 5
periodSeconds: 5

ReadinessProbe探针的HTTP、TCP的探测方式也与livenessProbe的基本一致。

例四: ReadinessProbe示例

        加入ReadinessProbe探针和一个没有ReadinessProbe探针的示例,该示例中,创建了一个deploy,名为JavaApp,启动的容器运行一个java应用程序,程序监听端口为9093。

  • 没有ReadinessProbe

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    $ cat > k8s.yaml << EOF
    ---
    apiVersion: v1
    kind: Service
    metadata:
    name: biz-gateway
    labels:
    app: biz-gateway
    namespace:
    spec:
    ports:
    - port: 9093
    name: biz-gateway
    selector:
    app: biz-gateway
    ---
    apiVersion: extensions/v1beta1
    kind: Deployment
    metadata:
    name: biz-gateway
    namespace:
    spec:
    replicas: 1
    template:
    metadata:
    labels:
    app: biz-gateway
    spec:
    containers:
    - name: biz-gateway
    image: docker.io/xxlaila/biz-gateway:dev-08c8a4e
    imagePullPolicy: IfNotPresent
    ports:
    - containerPort: 9093
    env:
    - name: RUN_ENV
    value: dev
    - name: CONFIG_API_SERVER
    value: http://api.conf.xxlaila.cn
    - name: RUN_CLUSTER
    value: default
    - name: RUN_MODE
    value: AUTO
    EOF
  • 执行创建

    1
    2
    3
    $ kubectl apply -f k8s.yaml 
    service/biz-gateway created
    deployment.extensions/biz-gateway created
  • 刚创建后,等一会后,查看Pod状态,记着要给image留下pull的时间

    1
    2
    3
    $ kubectl get pods  |grep "biz-gateway"
    NAME READY STATUS RESTARTS AGE
    biz-gateway-95f6b677f-rnz22 1/1 Running 0 2m8s

        可以看到,整个过程Pod用了2m8s,自身状态已Running,其READ字段,1/1 表示1个容器状态已准备就绪了,此时,对于kubernetes而言,已经可以接收请求了,而实际上服务还无法访问,因为JAVA程序还尚启动起来,2m8ss后方可正常访问,所以针对此类程序,必须配置ReadinessProbe。

  • 加入readinessProbe
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    $ cat > k8s.yaml << EOF
    ---
    apiVersion: v1
    kind: Service
    metadata:
    name: biz-gateway
    labels:
    app: biz-gateway
    namespace:
    spec:
    ports:
    - protocol: TCP
    port: 9093
    name: biz-gateway
    selector:
    app: biz-gateway
    ---
    apiVersion: extensions/v1beta1
    kind: Deployment
    metadata:
    name: biz-gateway
    namespace:
    spec:
    replicas: 1
    template:
    metadata:
    labels:
    app: biz-gateway
    spec:
    containers:
    - name: biz-gateway
    image: docker.io/xxlaila/biz-gateway:dev-08c8a4e
    imagePullPolicy: IfNotPresent
    ports:
    - containerPort: 9093
    readinessProbe:
    tcpSocket:
    port: 9093
    initialDelaySeconds: 140
    periodSeconds: 10
    env:
    - name: RUN_ENV
    value: dev
    - name: CONFIG_API_SERVER
    value: http://api.conf.xxlaila.cn
    - name: RUN_CLUSTER
    value: default
    - name: RUN_MODE
    value: AUTO
    EOF

        在该配置文件中,ReadinessProbe探针的探测方式为tcpSocket,因为程序监听在9093端口,所以这里探测为对9093建立连接,这里第一次探测时间是在Pod Runing后140秒后,间隔10秒后执行第二次探测。

  • 创建

    1
    2
    3
    kubectl apply -f ./
    service/biz-gateway created
    deployment.extensions/biz-gateway created
  • 查看验证

    1
    2
    3
    4
    5
    6
    7
    8
    9
    # 创建后等待了60s
    $ kubectl get pod -o wide
    NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
    biz-gateway-f69cc8678-qs8s7 0/1 Running 0 60s 172.30.56.6 172.21.17.40 <none> <none>

    # 继续等待一会
    $ kubectl get pod -o wide
    NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
    biz-gateway-f69cc8678-qs8s7 1/1 Running 0 2m36s 172.30.56.6 172.21.17.40 <none> <none>

        可以看到在2m36秒后,pod启动ok,在第一次查看的时候,Pod虽然已处于Runnig状态,但是由于第一次探测时间未到,所以READY字段为0/1,即容器的状态为未准备就绪,在未准备就绪的情况下,其Pod对应的Service下的Endpoint也为空,所以才不会有任何请求被调度进来。

  • 查看Endpoint
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    # 第一次执行
    $ kubectl get endpoints
    NAME ENDPOINTS AGE
    biz-gateway 57s
    kubernetes 172.21.16.110:6443,172.21.17.30:6443,172.21.17.31:6443 13d

    在2m36s后在次执行
    $ kubectl get endpoints
    NAME ENDPOINTS AGE
    biz-gateway 172.30.56.6:9093 2m41s
    kubernetes 172.21.16.110:6443,172.21.17.30:6443,172.21.17.31:6443 13d

配置探针(Probe)相关属性

        探针(Probe)有许多可选字段,可以用来更加精确的控制Liveness和Readiness两种探针的行为(Probe):

  • initialDelaySeconds:Pod启动后延迟多久才进行检查,单位:秒
  • periodSeconds:检查的间隔时间,默认为10,单位:秒。
  • timeoutSeconds:探测的超时时间,默认为1,单位:秒。
  • successThreshold:探测失败后认为成功的最小连接成功次数,默认为1,在Liveness探针中必须为1,最小值为1。
  • failureThreshold:探测失败的重试次数,重试一定次数后将认为失败,在readiness探针中,Pod会被标记为未就绪,默认为3,最小值为1。

        之前错误参考排查介绍: 在之前安装jenkins的时候,创建pod就一值处于running,但是过一会,界面就报错,错误如下图
img
        然后查看pod日志和系统系统,都没有任何问题,pod日志如下,然后就问了朋友,就说有可能是pod的健康检测机制,最后就修改了pod的健康检测机制,jenkins服务器部署ok。

1
2
3
4
5
6
7
8
9
10
11
kubectl log $(kubectl get pods -n kube-ops | awk '{print $1}' | grep jenkins) -n kube-ops
log is DEPRECATED and will be removed in a future version. Use logs instead.
VM settings:
Max. Heap Size: 3.00G
Ergonomics Machine Class: server
Using VM: OpenJDK 64-Bit Server VM

Running from: /usr/share/jenkins/jenkins.war
webroot: EnvVars.masterEnvVars.get("JENKINS_HOME")
2019-09-27 03:02:24.133+0000 [id=1] INFO org.eclipse.jetty.util.log.Log#initialized: Logging initialized @429ms to org.eclipse.jetty.util.log.JavaUtilLog
2019-09-27 03:02:24.247+0000 [id=1] INFO winstone.Logger#logInternal: Beginning extraction from war file

后续: 虽然健康检测可以取消,不加入,但是当我们在上生产环境的时候还是要加上,正如例四介绍的那样。如果我们在生产环境错故障自愈、轮询发布等。都需要这个东西,加入再升级的时候,服务器都还没起来,k8s就吧流量给调度过来,升级下一个pod,外部用户访问就会报错,那就是很尴尬

坚持原创技术分享,您的支持将鼓励我继续创作!
0%