节点不能joinSwarm集群

我有3个虚拟机。 他们都有docker工1.12,他们在centos7上运行。 所有的端口都打开了,虚拟主机能够ping通对方,我开始了集群

docker swarm init --advertise-addr 192.168.140.12 

Docker的信息显示了我:

 Swarm: active NodeID: 0drcj2nku1mv8t16fxva48edxx Is Manager: true ClusterID: cchn0yzospwoe1h9f55d7omxx Managers: 1 Nodes: 1 

现在我尝试join节点(其他虚拟机)到集群。 我使用启动我的经理后推荐的命令。

 docker swarm join \ --token SWMTKN-1-48ythur5k6ckkz90ttlprw37p9z3ldclws51qirw5wdyfmvevr-3sb2t66b2fj6e4dhmfo1vavxx \ 192.168.140.12:2377 

但是我得到了:

 Error response from daemon: Timeout was reached before node was joined. Attempt to join the cluster will continue in the background. Use "docker info" command to see the current swarm status of your node. 

Docker的信息显示了我:

 Swarm: pending NodeID: Error: rpc error: code = 1 desc = context canceled Is Manager: false Node Address: 192.168.140.14 

关于集群pipe理器:

 # netstat -tulpn | grep docker tcp6 0 0 :::2377 :::* LISTEN 1602/dockerd tcp6 0 0 :::7946 :::* LISTEN 1602/dockerd tcp6 0 0 :::8080 :::* LISTEN 3398/docker-proxy tcp6 0 0 :::32768 :::* LISTEN 3199/docker-proxy tcp6 0 0 :::32769 :::* LISTEN 3219/docker-proxy tcp6 0 0 :::32770 :::* LISTEN 3341/docker-proxy tcp6 0 0 :::32771 :::* LISTEN 3436/docker-proxy tcp6 0 0 :::2375 :::* LISTEN 1602/dockerd udp6 0 0 :::7946 :::* 1602/dockerd 

我怎样才能debugging这个问题,或者我忘了执行一些重要的步骤? 服务器是否需要SSH访问对方? 谢谢

login节点:

 Aug 8 09:50:24 localhost dockerd: time="2016-08-08T09:50:24.393432145-04:00" level=error msg="Handler for POST /v1.24/swarm/leave returned error: This node is not part of swarm" Aug 8 09:51:01 localhost su: (to root) worker1 on pts/1 Aug 8 09:51:34 localhost dockerd: time="2016-08-08T09:51:34.384408514-04:00" level=error msg="Handler for POST /v1.24/swarm/join returned error: Timeout was reached before node was joined. Attempt to join the cluster will continue in the background. Use \"docker info\" command to see the current swarm status of your node." Aug 8 09:51:40 localhost su: (to root) worker1 on pts/1 Aug 8 09:52:47 localhost dhclient[1277]: DHCPREQUEST on eno16777736 to 192.168.140.254 port 67 (xid=0x11f8fba8) Aug 8 09:52:47 localhost dhclient[1277]: DHCPACK from 192.168.140.254 (xid=0x11f8fba8) Aug 8 09:52:47 localhost NetworkManager[953]: <info> address 192.168.140.13 Aug 8 09:52:47 localhost NetworkManager[953]: <info> plen 24 (255.255.255.0) Aug 8 09:52:47 localhost NetworkManager[953]: <info> gateway 192.168.140.2 Aug 8 09:52:47 localhost NetworkManager[953]: <info> server identifier 192.168.140.254 Aug 8 09:52:47 localhost NetworkManager[953]: <info> lease time 1800 Aug 8 09:52:47 localhost NetworkManager[953]: <info> nameserver '192.168.140.2' Aug 8 09:52:47 localhost NetworkManager[953]: <info> domain name 'localdomain' Aug 8 09:52:47 localhost NetworkManager[953]: <info> (eno16777736): DHCPv4 state changed bound -> bound Aug 8 09:52:47 localhost dbus[878]: [system] Activating via systemd: service name='org.freedesktop.nm_dispatcher' unit='dbus-org.freedesktop.nm-dispatcher.service' Aug 8 09:52:47 localhost dbus-daemon: dbus[878]: [system] Activating via systemd: service name='org.freedesktop.nm_dispatcher' unit='dbus-org.freedesktop.nm-dispatcher.service' Aug 8 09:52:47 localhost systemd: Starting Network Manager Script Dispatcher Service... Aug 8 09:52:47 localhost dhclient[1277]: bound to 192.168.140.13 -- renewal in 713 seconds. Aug 8 09:52:47 localhost dbus[878]: [system] Successfully activated service 'org.freedesktop.nm_dispatcher' Aug 8 09:52:47 localhost dbus-daemon: dbus[878]: [system] Successfully activated service 'org.freedesktop.nm_dispatcher' Aug 8 09:52:47 localhost nm-dispatcher: Dispatching action 'dhcp4-change' for eno16777736 Aug 8 09:52:47 localhost systemd: Started Network Manager Script Dispatcher Service. 

有时会警告:

 level=warning msg="failed to retrieve remote root CA certificate: rpc error: code = 1 desc = context canceled 

我所有虚拟机的主机名是:localhost.localdomain。 我更改了每个服务器上/etc/hosts的主机名并重新启动。 现在我可以创build我的群集并成功添加节点。

也许你正在使用一个http代理。

您可以使用以下命令查看dockerd在做什么。

 # strace -Fp `pidof dockerd` 2>&1 |grep -v futex |grep -v epoll_wait |grep -v pselect 

我有同样的问题,并通过同步每个工人节点的date相同的主节点date解决。

 pi@workernode$sudo date --set="$(username@masternode date)" 

在此之后,尝试更新工作节点,它应该工作。

如果以下解决scheme都没有工作。 尝试在主服务器上禁用防火墙,看看它是否工作。

正如wenjianhn所解释的那样,确保你没有在你的worker节点上为Dockerconfiguration一个http代理(如下所述)。 事实上,Swarm节点通过http(默认端口2377)进行通信,所以如果你configuration了一个http代理,它将使用它,即使pipe理器节点在你的LAN中。

另外,请确保没有防火墙阻止端口2377上的通信:

 user@workernode$ telnet ip-of-manager 2377 

如果无法在端口2377上打开Telnet连接,则意味着该端口被防火墙(工作节点的防火墙或经理的防火墙)阻止。

我有同样的问题,要做到这一点,你通常不得不使用docker机的通用驱动程序,但我发现这个驱动程序不能像它应该工作…实际上docker机只能使用物理上的virtualbox驱动程序机器,而不是虚拟机和云驱动程序。

所以,如果你不能这样做,真的不知道如何,但据我search,并没有find一个解决scheme,与远程主机使用Swarm。

放弃Swarm之后,虽然Kubernetes可以工作,但是Kubernetes的问题是它必须用于云计算,或者实际上安装在VSphere和I上,所以我无法使用300多个VM的ESX来运行VSphere。 ..