Docker引擎在Azure Batch节点上失败

脚本

我创build了一个包含多个节点的池(基本映像是Ubuntu Server 16.04),并提供以下启动命令: /bin/bash -c 'set -o pipefail; export DEBIAN_FRONTEND=noninteractive ; sudo -E apt update ; sudo -E apt upgrade -y ; sudo -E apt-get install -y --no-install-recommends apt-transport-https curl software-properties-common ; curl -fsSL "https://sks-keyservers.net/pks/lookup?op=get&search=0xee6d536cf7dc86e2d7d56f59a178ac6c6238f52e" | sudo -E apt-key add - ; sudo -E apt-add-repository "deb https://packages.docker.com/1.13/apt/repo/ ubuntu-$(lsb_release -cs) main" ; sudo -E apt-get update ; sudo -E apt-get install -y docker-engine ; sudo usermod -a -G docker $USER ; sudo -E service docker start ; journalctl -xe; wait' /bin/bash -c 'set -o pipefail; export DEBIAN_FRONTEND=noninteractive ; sudo -E apt update ; sudo -E apt upgrade -y ; sudo -E apt-get install -y --no-install-recommends apt-transport-https curl software-properties-common ; curl -fsSL "https://sks-keyservers.net/pks/lookup?op=get&search=0xee6d536cf7dc86e2d7d56f59a178ac6c6238f52e" | sudo -E apt-key add - ; sudo -E apt-add-repository "deb https://packages.docker.com/1.13/apt/repo/ ubuntu-$(lsb_release -cs) main" ; sudo -E apt-get update ; sudo -E apt-get install -y docker-engine ; sudo usermod -a -G docker $USER ; sudo -E service docker start ; journalctl -xe; wait'

该命令服务器是安装Docker Engine的唯一目的。 还要注意,我删除了选项set -e ,以便能够运行命令journalctl -xe并捕获以下错误。

错误

当创build上述池时,一些节点将失败启动任务。 行为似乎是随机的,因为并不总是一个节点失败,并且,如上所述,其他节点不会失败。 行为不取决于节点的大小(我试过D2_v3和NC6)。

这是journalctl -xe的输出:

 Oct 12 09:19:40 7d8bb094c57c400582f6031d59f1630000000A systemd[1]: Listening on Docker Socket for the API. -- Subject: Unit docker.socket has finished start-up -- Defined-By: systemd -- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel -- -- Unit docker.socket has finished starting up. -- -- The start-up result is done. Oct 12 09:19:40 7d8bb094c57c400582f6031d59f1630000000A systemd[1]: Starting Docker Application Container Engine... -- Subject: Unit docker.service has begun start-up -- Defined-By: systemd -- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel -- -- Unit docker.service has begun starting up. Oct 12 09:19:40 7d8bb094c57c400582f6031d59f1630000000A dockerd[24473]: time="2017-10-12T09:19:40.605332263Z" level=info msg="libcontainerd: new containerd process, pid: 24492" Oct 12 09:19:41 7d8bb094c57c400582f6031d59f1630000000A dockerd[24473]: time="2017-10-12T09:19:41.608293321Z" level=info msg="[graphdriver] using prior storage driver: aufs" Oct 12 09:19:41 7d8bb094c57c400582f6031d59f1630000000A dockerd[24473]: time="2017-10-12T09:19:41.626089049Z" level=info msg="Graph migration to content-addressability took 0.00 seconds" Oct 12 09:19:41 7d8bb094c57c400582f6031d59f1630000000A dockerd[24473]: time="2017-10-12T09:19:41.626378756Z" level=warning msg="Your kernel does not support swap memory limit" Oct 12 09:19:41 7d8bb094c57c400582f6031d59f1630000000A dockerd[24473]: time="2017-10-12T09:19:41.626558660Z" level=warning msg="Your kernel does not support cgroup rt period" Oct 12 09:19:41 7d8bb094c57c400582f6031d59f1630000000A dockerd[24473]: time="2017-10-12T09:19:41.626698864Z" level=warning msg="Your kernel does not support cgroup rt runtime" Oct 12 09:19:41 7d8bb094c57c400582f6031d59f1630000000A dockerd[24473]: time="2017-10-12T09:19:41.626834867Z" level=warning msg="Your kernel does not support cgroup blkio weight" Oct 12 09:19:41 7d8bb094c57c400582f6031d59f1630000000A dockerd[24473]: time="2017-10-12T09:19:41.626970070Z" level=warning msg="Your kernel does not support cgroup blkio weight_device" Oct 12 09:19:41 7d8bb094c57c400582f6031d59f1630000000A dockerd[24473]: time="2017-10-12T09:19:41.627384080Z" level=info msg="Loading containers: start." Oct 12 09:19:41 7d8bb094c57c400582f6031d59f1630000000A dockerd[24473]: time="2017-10-12T09:19:41.630900065Z" level=info msg="Firewalld running: false" Oct 12 09:19:41 7d8bb094c57c400582f6031d59f1630000000A dockerd[24473]: time="2017-10-12T09:19:41.661877309Z" level=info msg="Default bridge (docker0) is assigned with an IP address 172.17.0.0/16. Daemon option --bip can be used to set a preferred IP address" Oct 12 09:19:41 7d8bb094c57c400582f6031d59f1630000000A kernel: IPv6: ADDRCONF(NETDEV_UP): docker0: link is not ready Oct 12 09:19:41 7d8bb094c57c400582f6031d59f1630000000A dockerd[24473]: time="2017-10-12T09:19:41.996853856Z" level=info msg="Loading containers: done." Oct 12 09:19:42 7d8bb094c57c400582f6031d59f1630000000A kernel: aufs au_opts_verify:1585:dockerd[24490]: dirperm1 breaks the protection by the permission bits on the lower branch Oct 12 09:19:45 7d8bb094c57c400582f6031d59f1630000000A systemd[1]: docker.service: Main process exited, code=killed, status=11/SEGV Oct 12 09:19:45 7d8bb094c57c400582f6031d59f1630000000A systemd[1]: Failed to start Docker Application Container Engine. -- Subject: Unit docker.service has failed -- Defined-By: systemd -- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel -- -- Unit docker.service has failed. -- -- The result is failed. Oct 12 09:19:45 7d8bb094c57c400582f6031d59f1630000000A systemd[1]: docker.service: Unit entered failed state. Oct 12 09:19:45 7d8bb094c57c400582f6031d59f1630000000A systemd[1]: docker.service: Failed with result 'signal'. 

看来networking接口的创build出了问题,但我不确定是什么,特别是如何解决这个问题。

更新回答2017-10-18:

Canonical UbuntuServer 16.04-LTS latest平台镜像已经解决了这个问题,并且再次与Go / Docker一起工作。

原始答案:

你的代码没有问题。 Canonical UbuntuServer 16.04-LTS 201709190平台映像(目前也是latest )和Go / Docker存在问题。

问题修复时,将映像的版本设置为暂时部署到201708151

另外:如果您使用的是Docker和Azure Batch,则应该查看提供此function的Batch Shipyard 。 (全面披露:我是这个代码的贡献者。)