K8s没有杀我的气streamweb服务器吊舱

我有k8s容器运行的气stream。

networking服务器遇到了DNS错误(无法将我的数据库的url转换为IP),networking服务器的工作人员遇害。

令我不安的是,K8并没有试图杀死吊舱,而是开始了一个新的吊舱。

Pod日志输出:

OperationalError: (psycopg2.OperationalError) could not translate host name "my.dbs.url" to address: Temporary failure in name resolution [2017-12-01 06:06:05 +0000] [2202] [INFO] Worker exiting (pid: 2202) [2017-12-01 06:06:05 +0000] [2186] [INFO] Worker exiting (pid: 2186) [2017-12-01 06:06:05 +0000] [2190] [INFO] Worker exiting (pid: 2190) [2017-12-01 06:06:05 +0000] [2194] [INFO] Worker exiting (pid: 2194) [2017-12-01 06:06:05 +0000] [2198] [INFO] Worker exiting (pid: 2198) [2017-12-01 06:06:06 +0000] [13] [INFO] Shutting down: Master [2017-12-01 06:06:06 +0000] [13] [INFO] Reason: Worker failed to boot. 

k8s的状态是RUNNING,但是当我在k8s UI中打开一个exec shell时,我得到了下面的输出(gunicorn似乎意识到它已经死了):

 root@webserver-373771664-3h4v9:/# ps -Al FS UID PID PPID C PRI NI ADDR SZ WCHAN TTY TIME CMD 4 S 0 1 0 0 80 0 - 107153 - ? 00:06:42 /usr/local/bin/ 4 Z 0 13 1 0 80 0 - 0 - ? 00:01:24 gunicorn: maste <defunct> 4 S 0 2206 0 0 80 0 - 4987 - ? 00:00:00 bash 0 R 0 2224 2206 0 80 0 - 7486 - ? 00:00:00 ps 

以下是我部署的YAML:

 apiVersion: extensions/v1beta1 kind: Deployment metadata: name: webserver namespace: airflow spec: replicas: 1 template: metadata: labels: app: airflow-webserver spec: volumes: - name: webserver-dags emptyDir: {} containers: - name: airflow-webserver image: my.custom.image :latest imagePullPolicy: Always resources: requests: cpu: 100m limits: cpu: 500m ports: - containerPort: 80 protocol: TCP env: - name: AIRFLOW_HOME value: /var/lib/airflow - name: AIRFLOW__CORE__SQL_ALCHEMY_CONN valueFrom: secretKeyRef: name: db1 key: sqlalchemy_conn volumeMounts: - mountPath: /var/lib/airflow/dags/ name: webserver-dags command: ["airflow"] args: ["webserver"] - name: docker-s3-to-backup image: my.custom.image:latest imagePullPolicy: Always resources: requests: cpu: 50m limits: cpu: 500m env: - name: ACCESS_KEY valueFrom: secretKeyRef: name: aws key: access_key_id - name: SECRET_KEY valueFrom: secretKeyRef: name: aws key: secret_access_key - name: S3_PATH value: s3://my-s3-bucket/dags/ - name: DATA_PATH value: /dags/ - name: CRON_SCHEDULE value: "*/5 * * * *" volumeMounts: - mountPath: /dags/ name: webserver-dags --- apiVersion: autoscaling/v1 kind: HorizontalPodAutoscaler metadata: name: webserver namespace: airflow spec: scaleTargetRef: apiVersion: apps/v1beta1 kind: Deployment name: webserver minReplicas: 2 maxReplicas: 20 targetCPUUtilizationPercentage: 75 --- apiVersion: v1 kind: Service metadata: labels: name: webserver namespace: airflow spec: type: NodePort ports: - port: 80 selector: app: airflow-webserver 

那么,当进程在一个容器中死亡时,这个容器将退出,kubelet将重新启动相同节点上的容器/在同一个容器内。 这里发生的事情绝不是kubernetes的错,但实际上是你的容器问题。 你在容器中启动的主要过程(不pipe是从CMD还是通过入口点)都需要死亡,以上情况发生,而你启动的过程没有(一个是僵尸模式,但没有收获,这是一个另一个问题的例子 – 僵尸收获 。生活探测将有助于在这种情况下(如@sfgroups所述),因为它会终止吊舱,如果它失败,但这是治疗症状,而不是根本原因(不是你不应该一般将探针定义为良好实践)。

您需要定义准备和活性探测器Kubernetes以检测POD状态。

就像这个页面上logging的那样 https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-probes/#define-a-tcp-liveness-probe

  - containerPort: 8080 readinessProbe: tcpSocket: port: 8080 initialDelaySeconds: 5 periodSeconds: 10 livenessProbe: tcpSocket: port: 8080 initialDelaySeconds: 15 periodSeconds: 20