kube-dns无法parsing'kubernetes.default.svc.cluster.local'

在使用kargo部署kubernetes集群之后,我发现kubedns pod无法正常工作:

$ kcsys get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE dnsmasq-alv8k 1/1 Running 2 1d 10.233.86.2 kubemaster dnsmasq-c9y52 1/1 Running 2 1d 10.233.82.2 kubeminion1 dnsmasq-sjouh 1/1 Running 2 1d 10.233.76.6 kubeminion2 kubedns-hxaj7 2/3 CrashLoopBackOff 339 22h 10.233.76.3 kubeminion2 

PS: kcsys kcsys 的一个别名 kubectl --namespace=kube-system

除了healthz容器,每个容器(kubedns,dnsmasq)的日志好像如下:

 2017/03/01 07:24:32 Healthz probe error: Result of last exec: nslookup: can't resolve 'kubernetes.default.svc.cluster.local' error exit status 1 

更新

kubedns rc的描述

 apiVersion: v1 kind: ReplicationController metadata: creationTimestamp: 2017-02-28T08:31:57Z generation: 1 labels: k8s-app: kubedns kubernetes.io/cluster-service: "true" version: v19 name: kubedns namespace: kube-system resourceVersion: "130982" selfLink: /api/v1/namespaces/kube-system/replicationcontrollers/kubedns uid: 5dc9f9f2-fd90-11e6-850d-005056a020b4 spec: replicas: 1 selector: k8s-app: kubedns version: v19 template: metadata: creationTimestamp: null labels: k8s-app: kubedns kubernetes.io/cluster-service: "true" version: v19 spec: containers: - args: - --domain=cluster.local. - --dns-port=10053 - --v=2 image: gcr.io/google_containers/kubedns-amd64:1.9 imagePullPolicy: IfNotPresent livenessProbe: failureThreshold: 5 httpGet: path: /healthz port: 8080 scheme: HTTP initialDelaySeconds: 60 periodSeconds: 10 successThreshold: 1 timeoutSeconds: 5 name: kubedns ports: - containerPort: 10053 name: dns-local protocol: UDP - containerPort: 10053 name: dns-tcp-local protocol: TCP readinessProbe: failureThreshold: 3 httpGet: path: /readiness port: 8081 scheme: HTTP initialDelaySeconds: 30 periodSeconds: 10 successThreshold: 1 timeoutSeconds: 5 resources: limits: cpu: 100m memory: 170Mi requests: cpu: 70m memory: 70Mi terminationMessagePath: /dev/termination-log - args: - --log-facility=- - --cache-size=1000 - --no-resolv - --server=127.0.0.1#10053 image: gcr.io/google_containers/kube-dnsmasq-amd64:1.3 imagePullPolicy: IfNotPresent name: dnsmasq ports: - containerPort: 53 name: dns protocol: UDP - containerPort: 53 name: dns-tcp protocol: TCP resources: limits: cpu: 100m memory: 170Mi requests: cpu: 70m memory: 70Mi terminationMessagePath: /dev/termination-log - args: - -cmd=nslookup kubernetes.default.svc.cluster.local 127.0.0.1 >/dev/null && nslookup kubernetes.default.svc.cluster.local 127.0.0.1:10053 >/dev/null - -port=8080 - -quiet image: gcr.io/google_containers/exechealthz-amd64:1.1 imagePullPolicy: IfNotPresent name: healthz ports: - containerPort: 8080 protocol: TCP resources: limits: cpu: 10m memory: 50Mi requests: cpu: 10m memory: 50Mi terminationMessagePath: /dev/termination-log dnsPolicy: Default restartPolicy: Always securityContext: {} terminationGracePeriodSeconds: 30 status: fullyLabeledReplicas: 1 observedGeneration: 1 replicas: 1` 

kubedns svc说明:

 apiVersion: v1 kind: Service metadata: creationTimestamp: 2017-02-28T08:31:58Z labels: k8s-app: kubedns kubernetes.io/cluster-service: "true" kubernetes.io/name: kubedns name: kubedns namespace: kube-system resourceVersion: "10736" selfLink: /api/v1/namespaces/kube-system/services/kubedns uid: 5ed4dd78-fd90-11e6-850d-005056a020b4 spec: clusterIP: 10.233.0.3 ports: - name: dns port: 53 protocol: UDP targetPort: 53 - name: dns-tcp port: 53 protocol: TCP targetPort: 53 selector: k8s-app: kubedns sessionAffinity: None type: ClusterIP status: loadBalancer: {} 

我在kubedns容器中发现了一些错误:

 1 reflector.go:199] pkg/dns/dns.go:145: Failed to list *api.Endpoints: Get https://10.233.0.1:443/api/v1/endpoints?resourceVersion=0: dial tcp 10.233.0.1:443: i/o timeout 1 reflector.go:199] pkg/dns/dns.go:148: Failed to list *api.Service: Get https://10.233.0.1:443/api/v1/services?resourceVersion=0: dial tcp 10.233.0.1:443: i/o timeout 

更新2

  1. 使用3个豆荚创build主机名服务时,由kube-proxy创build的iptables规则:

在这里输入图像说明

  1. controller-manager pod的标志: 在这里输入图像说明

  2. 豆荚状态

在这里输入图像说明

你可以看看ps auxf | grep dockerd的输出 ps auxf | grep dockerd

Kargo正在将设置iptables=false添加到docker守护程序。 据我所见,这导致容器与主机networking连接问题,因为连接到10.233.0.1:443将遵循将请求转发到主节点的api服务器之一的iptable规则。

其他kubernetes服务有他们的networking绑定到主机,所以你不会遇到这个问题。

我不确定这是否是根本问题,但是从docker守护程序设置中删除iptables=false已修复了我们遇到的任何问题。 这不是默认禁用,并且不希望被禁用使用像法兰绒这样的networking覆盖。

docker守护进程的iptables选项的移除可以从/etc/systemd/system/docker.service.d/docker-options.conf来完成,应该看起来像这样:

[root@k8s-joy-g2eqd2 ~]# cat /etc/systemd/system/docker.service.d/docker-options.conf [Service] Environment="DOCKER_OPTS=--insecure-registry=10.233.0.0/18 --graph=/var/lib/docker --iptables=false"

一旦更新,您可以运行systemctl daemon-reload来注册更改,然后systemctl restart docker

这将允许您testing这是否解决了您的问题。 一旦您确认这是修复,您可以覆盖kargo部署中的docker_optionsvariables以排除该规则:

docker_options: "--insecure-registry=10.233.0.0/18 --graph=/var/lib/docker"

根据您发布的错误, kubedns无法与API服务器通信:

 dial tcp 10.233.0.1:443: i/o timeout 

这可能意味着三件事情:


您的容器networking结构configuration不正确

  • 在您使用的networking解决scheme的日志中查找错误
  • 确保每个Docker deamon都使用自己的IP范围
  • validation容器networking是否与主机networking不重叠

您的kube-proxy存在问题,并且在使用kubernetes内部服务(10.233.0.1)时,networkingstream量不会转发到API服务器

  • 检查节点(kubeminion {1,2})上的kube-proxy日志,并更新您遇到的任何错误

如果您也看到身份validation错误:

kube-controller-manager不会生成有效的服务帐户令牌

  • 检查kube-controller-manager--service-account-private-key-filekube-controller-manager --root-ca-file标志是否设置为有效的key / cert并重新启动服务

  • 删除kube-system名称空间中的default-token-xxxx密钥并重新创buildkube-dns部署