kube-dns无法parsing'kubernetes.default.svc.cluster.local'
在使用kargo部署kubernetes集群之后,我发现kubedns pod无法正常工作:
$ kcsys get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE dnsmasq-alv8k 1/1 Running 2 1d 10.233.86.2 kubemaster dnsmasq-c9y52 1/1 Running 2 1d 10.233.82.2 kubeminion1 dnsmasq-sjouh 1/1 Running 2 1d 10.233.76.6 kubeminion2 kubedns-hxaj7 2/3 CrashLoopBackOff 339 22h 10.233.76.3 kubeminion2
PS: kcsys
是 kcsys
的一个别名 kubectl --namespace=kube-system
除了healthz容器,每个容器(kubedns,dnsmasq)的日志好像如下:
2017/03/01 07:24:32 Healthz probe error: Result of last exec: nslookup: can't resolve 'kubernetes.default.svc.cluster.local' error exit status 1
更新
kubedns rc的描述
apiVersion: v1 kind: ReplicationController metadata: creationTimestamp: 2017-02-28T08:31:57Z generation: 1 labels: k8s-app: kubedns kubernetes.io/cluster-service: "true" version: v19 name: kubedns namespace: kube-system resourceVersion: "130982" selfLink: /api/v1/namespaces/kube-system/replicationcontrollers/kubedns uid: 5dc9f9f2-fd90-11e6-850d-005056a020b4 spec: replicas: 1 selector: k8s-app: kubedns version: v19 template: metadata: creationTimestamp: null labels: k8s-app: kubedns kubernetes.io/cluster-service: "true" version: v19 spec: containers: - args: - --domain=cluster.local. - --dns-port=10053 - --v=2 image: gcr.io/google_containers/kubedns-amd64:1.9 imagePullPolicy: IfNotPresent livenessProbe: failureThreshold: 5 httpGet: path: /healthz port: 8080 scheme: HTTP initialDelaySeconds: 60 periodSeconds: 10 successThreshold: 1 timeoutSeconds: 5 name: kubedns ports: - containerPort: 10053 name: dns-local protocol: UDP - containerPort: 10053 name: dns-tcp-local protocol: TCP readinessProbe: failureThreshold: 3 httpGet: path: /readiness port: 8081 scheme: HTTP initialDelaySeconds: 30 periodSeconds: 10 successThreshold: 1 timeoutSeconds: 5 resources: limits: cpu: 100m memory: 170Mi requests: cpu: 70m memory: 70Mi terminationMessagePath: /dev/termination-log - args: - --log-facility=- - --cache-size=1000 - --no-resolv - --server=127.0.0.1#10053 image: gcr.io/google_containers/kube-dnsmasq-amd64:1.3 imagePullPolicy: IfNotPresent name: dnsmasq ports: - containerPort: 53 name: dns protocol: UDP - containerPort: 53 name: dns-tcp protocol: TCP resources: limits: cpu: 100m memory: 170Mi requests: cpu: 70m memory: 70Mi terminationMessagePath: /dev/termination-log - args: - -cmd=nslookup kubernetes.default.svc.cluster.local 127.0.0.1 >/dev/null && nslookup kubernetes.default.svc.cluster.local 127.0.0.1:10053 >/dev/null - -port=8080 - -quiet image: gcr.io/google_containers/exechealthz-amd64:1.1 imagePullPolicy: IfNotPresent name: healthz ports: - containerPort: 8080 protocol: TCP resources: limits: cpu: 10m memory: 50Mi requests: cpu: 10m memory: 50Mi terminationMessagePath: /dev/termination-log dnsPolicy: Default restartPolicy: Always securityContext: {} terminationGracePeriodSeconds: 30 status: fullyLabeledReplicas: 1 observedGeneration: 1 replicas: 1`
kubedns svc说明:
apiVersion: v1 kind: Service metadata: creationTimestamp: 2017-02-28T08:31:58Z labels: k8s-app: kubedns kubernetes.io/cluster-service: "true" kubernetes.io/name: kubedns name: kubedns namespace: kube-system resourceVersion: "10736" selfLink: /api/v1/namespaces/kube-system/services/kubedns uid: 5ed4dd78-fd90-11e6-850d-005056a020b4 spec: clusterIP: 10.233.0.3 ports: - name: dns port: 53 protocol: UDP targetPort: 53 - name: dns-tcp port: 53 protocol: TCP targetPort: 53 selector: k8s-app: kubedns sessionAffinity: None type: ClusterIP status: loadBalancer: {}
我在kubedns容器中发现了一些错误:
1 reflector.go:199] pkg/dns/dns.go:145: Failed to list *api.Endpoints: Get https://10.233.0.1:443/api/v1/endpoints?resourceVersion=0: dial tcp 10.233.0.1:443: i/o timeout 1 reflector.go:199] pkg/dns/dns.go:148: Failed to list *api.Service: Get https://10.233.0.1:443/api/v1/services?resourceVersion=0: dial tcp 10.233.0.1:443: i/o timeout
更新2
- 使用3个豆荚创build主机名服务时,由kube-proxy创build的iptables规则:
-
controller-manager pod的标志:
-
豆荚状态
你可以看看ps auxf | grep dockerd
的输出 ps auxf | grep dockerd
。
Kargo正在将设置iptables=false
添加到docker守护程序。 据我所见,这导致容器与主机networking连接问题,因为连接到10.233.0.1:443将遵循将请求转发到主节点的api服务器之一的iptable规则。
其他kubernetes服务有他们的networking绑定到主机,所以你不会遇到这个问题。
我不确定这是否是根本问题,但是从docker守护程序设置中删除iptables=false
已修复了我们遇到的任何问题。 这不是默认禁用,并且不希望被禁用使用像法兰绒这样的networking覆盖。
docker守护进程的iptables选项的移除可以从/etc/systemd/system/docker.service.d/docker-options.conf来完成,应该看起来像这样:
[root@k8s-joy-g2eqd2 ~]# cat /etc/systemd/system/docker.service.d/docker-options.conf [Service] Environment="DOCKER_OPTS=--insecure-registry=10.233.0.0/18 --graph=/var/lib/docker --iptables=false"
一旦更新,您可以运行systemctl daemon-reload
来注册更改,然后systemctl restart docker
。
这将允许您testing这是否解决了您的问题。 一旦您确认这是修复,您可以覆盖kargo部署中的docker_options
variables以排除该规则:
docker_options: "--insecure-registry=10.233.0.0/18 --graph=/var/lib/docker"
根据您发布的错误, kubedns
无法与API服务器通信:
dial tcp 10.233.0.1:443: i/o timeout
这可能意味着三件事情:
您的容器networking结构configuration不正确
- 在您使用的networking解决scheme的日志中查找错误
- 确保每个Docker deamon都使用自己的IP范围
- validation容器networking是否与主机networking不重叠
您的kube-proxy
存在问题,并且在使用kubernetes
内部服务(10.233.0.1)时,networkingstream量不会转发到API服务器
- 检查节点(kubeminion {1,2})上的
kube-proxy
日志,并更新您遇到的任何错误
如果您也看到身份validation错误:
kube-controller-manager
不会生成有效的服务帐户令牌
-
检查
kube-controller-manager
的--service-account-private-key-file
和kube-controller-manager
--root-ca-file
标志是否设置为有效的key / cert并重新启动服务 -
删除
kube-system
名称空间中的default-token-xxxx
密钥并重新创buildkube-dns
部署