消灭马拉松的docker集装箱杀死Mesos奴隶

我们有一个Mesos集群,并通过Marathon在带有Docker容器的Mesos-Slave上启动任务。

整个系统运行的非常好,但是时常出现了一个非常奇怪的问题:当我们试图通过马拉松来销毁/重新部署任务时,mesos-slave被目标Docker容器的退出所杀死。 这是我得到的错误日志:

Feb 29 19:31:51 mesos-slave3.ourcompany.com mesos-slave[4093]: I0229 19:31:51.465544 4094 docker.cpp:1592] Executor for container 'eadfb756-b653-42eb-977a-c16c78b1a7c5' has exited Feb 29 19:31:51 mesos-slave3.ourcompany.com mesos-slave[4093]: I0229 19:31:51.465736 4094 docker.cpp:1390] Destroying container 'eadfb756-b653-42eb-977a-c16c78b1a7c5' Feb 29 19:31:51 mesos-slave3.ourcompany.com mesos-slave[4093]: I0229 19:31:51.465812 4094 docker.cpp:1494] Running docker stop on container 'eadfb756-b653-42eb-977a-c16c78b1a7c5' Feb 29 19:31:51 mesos-slave3.ourcompany.com mesos-slave[4093]: I0229 19:31:51.466089 4098 slave.cpp:3440] Executor 'prod-xxxxxxx-data-collector-writer.6d832d68-d519-11e5-acca-00505692154c' of framework 8d26b713-c3cd-4e9b-956d-24f63b1320e0-0000 exited with status 0 Feb 29 19:31:51 mesos-slave3.ourcompany.com mesos-slave[4093]: I0229 19:31:51.466167 4098 slave.cpp:3544] Cleaning up executor 'prod-xxxxxxx-data-collector-writer.6d832d68-d519-11e5-acca-00505692154c' of framework 8d26b713-c3cd-4e9b-956d-24f63b1320e0-0000 Feb 29 19:31:51 mesos-slave3.ourcompany.com mesos-slave[4093]: F0229 19:31:51.470055 4098 slave.cpp:3570] CHECK_SOME(os::touch(path)): Failed to open file: No such file or directory Feb 29 19:31:51 mesos-slave3.ourcompany.com mesos-slave[4093]: *** Check failure stack trace: *** Feb 29 19:31:51 mesos-slave3.ourcompany.com mesos-slave[4093]: @ 0x7f8c3c2144dd google::LogMessage::Fail() Feb 29 19:31:51 mesos-slave3.ourcompany.com mesos-slave[4093]: @ 0x7f8c3c21621c google::LogMessage::SendToLog() Feb 29 19:31:51 mesos-slave3.ourcompany.com mesos-slave[4093]: I0229 19:31:51.566812 4099 docker.cpp:1592] Executor for container 'e2d9c750-88b7-4247-b696-6589665d6a66' has exited Feb 29 19:31:51 mesos-slave3.ourcompany.com mesos-slave[4093]: @ 0x7f8c3c2140cc google::LogMessage::Flush() Feb 29 19:31:51 mesos-slave3.ourcompany.com mesos-slave[4093]: I0229 19:31:51.569646 4099 docker.cpp:1390] Destroying container 'e2d9c750-88b7-4247-b696-6589665d6a66' Feb 29 19:31:51 mesos-slave3.ourcompany.com mesos-slave[4093]: I0229 19:31:51.569757 4099 docker.cpp:1592] Executor for container 'f51c68b8-c64d-47ea-a629-8516dcc90dba' has exited Feb 29 19:31:51 mesos-slave3.ourcompany.com mesos-slave[4093]: I0229 19:31:51.569787 4099 docker.cpp:1390] Destroying container 'f51c68b8-c64d-47ea-a629-8516dcc90dba' Feb 29 19:31:51 mesos-slave3.ourcompany.com mesos-slave[4093]: I0229 19:31:51.569818 4099 docker.cpp:1494] Running docker stop on container 'e2d9c750-88b7-4247-b696-6589665d6a66' Feb 29 19:31:51 mesos-slave3.ourcompany.com mesos-slave[4093]: I0229 19:31:51.569849 4099 docker.cpp:1494] Running docker stop on container 'f51c68b8-c64d-47ea-a629-8516dcc90dba' Feb 29 19:31:51 mesos-slave3.ourcompany.com mesos-slave[4093]: @ 0x7f8c3c216b19 google::LogMessageFatal::~LogMessageFatal() Feb 29 19:31:51 mesos-slave3.ourcompany.com mesos-slave[4093]: @ 0x7f8c3bc99f2e mesos::internal::slave::Slave::removeExecutor() Feb 29 19:31:51 mesos-slave3.ourcompany.com mesos-slave[4093]: @ 0x7f8c3bcaca60 mesos::internal::slave::Slave::executorTerminated() Feb 29 19:31:51 mesos-slave3.ourcompany.com mesos-slave[4093]: @ 0x7f8c3c1c6541 process::ProcessManager::resume() Feb 29 19:31:51 mesos-slave3.ourcompany.com mesos-slave[4093]: @ 0x7f8c3c1c683f process::internal::schedule() Feb 29 19:31:51 mesos-slave3.ourcompany.com mesos-slave[4093]: @ 0x7f8c3ad4a1e0 (unknown) Feb 29 19:31:51 mesos-slave3.ourcompany.com mesos-slave[4093]: @ 0x7f8c3afa3df5 start_thread Feb 29 19:31:51 mesos-slave3.ourcompany.com mesos-slave[4093]: @ 0x7f8c3a7b41ad __clone Feb 29 19:31:51 mesos-slave3.ourcompany.com systemd[1]: mesos-slave.service: main process exited, code=killed, status=6/ABRT Feb 29 19:31:51 mesos-slave3.ourcompany.com systemd[1]: Unit mesos-slave.service entered failed state. Feb 29 19:32:11 mesos-slave3.ourcompany.com systemd[1]: mesos-slave.service holdoff time over, scheduling restart. 

在Docker容器中启动的任务是一个AKKA应用程序,整个系统的环境信息是:

OS:

CentOS Linux release 7.1.1503 (Core)

核心:

3.10.0-229.el7.x86_64

所有机器上的JDK:

 java version "1.7.0_91" OpenJDK Runtime Environment (rhel-2.6.2.1.el7_1-x86_64 u91-b00) OpenJDK 64-Bit Server VM (build 24.91-b01, mixed mode) 

Mesos:

 0.25, installed by yum from mesosphere repo 

Mesos-Masterconfiguration:

 --zk=zk://zk-node1:2181,zk-node2:2181,zk-node3:2181,zk-node4:2181,zk-node5:2181/mesos-cluster --port=5050 --log_dir=/var/log/mesos --cluster=mesos-prod-cluster --hostname=<real hostname> --ip=<real ip> --quorum=3 --registry_fetch_timeout=5mins --work_dir=/var/lib/mesos 

Mesos-Slaveconfiguration:

 --master=zk://zk-node1:2181,zk-node2:2181,zk-node3:2181,zk-node4:2181,zk-node5:2181/mesos-cluster --log_dir=/var/log/mesos --attributes=env:prod --containerizers=docker,mesos --docker_remove_delay=2weeks --executor_registration_timeout=30mins --hostname=<real slave hostname> 

马拉松信息:

 { "name": "marathon", "version": "0.11.1", "elected": true, "leader": "<leader_ip>:8080", "frameworkId": "8d26b713-c3cd-4e9b-956d-24f63b1320e0-0000", "marathon_config": { "master": "zk-node1:2181,zk-node2:2181,zk-node3:2181,zk-node4:2181,zk-node5:2181/mesos-cluster", "failover_timeout": 604800, "framework_name": "marathon", "ha": true, "checkpoint": true, "local_port_min": 10000, "local_port_max": 20000, "executor": "//cmd", "hostname": "<hostname>", "webui_url": null, "mesos_role": null, "task_launch_timeout": 600000, "reconciliation_initial_delay": 15000, "reconciliation_interval": 300000, "marathon_store_timeout": 2000, "mesos_user": "root", "leader_proxy_connection_timeout_ms": 5000, "leader_proxy_read_timeout_ms": 10000, "mesos_leader_ui_url": "http://<leader_ip>:5050/" }, "zookeeper_config": { "zk": "zk-node1:2181,zk-node2:2181,zk-node3:2181,zk-node4:2181,zk-node5:2181/marathon-cluster", "zk_timeout": 10000, "zk_session_timeout": 1800000, "zk_max_versions": 25 }, "event_subscriber": { "type": "http_callback", "http_endpoints": null }, "http_config": { "assets_path": null, "http_port": 8080, "https_port": 8443 } 

}

Docker版本:

 Client: Version: 1.9.1 API version: 1.21 Go version: go1.4.2 Git commit: a34a1d5 Built: Fri Nov 20 13:25:01 UTC 2015 OS/Arch: linux/amd64 Server: Version: 1.9.1 API version: 1.21 Go version: go1.4.2 Git commit: a34a1d5 Built: Fri Nov 20 13:25:01 UTC 2015 OS/Arch: linux/amd64 

Docker信息:

 Containers: 330 Images: 509 Server Version: 1.9.1 Storage Driver: devicemapper Pool Name: docker-253:0-68977907-pool Pool Blocksize: 65.54 kB Base Device Size: 107.4 GB Backing Filesystem: Data file: /dev/loop0 Metadata file: /dev/loop1 Data Space Used: 23.68 GB Data Space Total: 107.4 GB Data Space Available: 27.51 GB Metadata Space Used: 63.75 MB Metadata Space Total: 2.147 GB Metadata Space Available: 2.084 GB Udev Sync Supported: true Deferred Removal Enabled: false Deferred Deletion Enabled: false Deferred Deleted Device Count: 0 Data loop file: /var/lib/docker/devicemapper/devicemapper/data Metadata loop file: /var/lib/docker/devicemapper/devicemapper/metadata Library Version: 1.02.93-RHEL7 (2015-01-28) Execution Driver: native-0.2 Logging Driver: json-file Kernel Version: 3.10.0-229.el7.x86_64 Operating System: CentOS Linux 7 (Core) CPUs: 4 Total Memory: 15.67 GiB Name: mesos-slave3.gz.yougola.com ID: QB4G:C2HK:CBPR:G5ID:6OCU:DFEC:USBP:ECLQ:FWOQ:ZGHS:JIU5:JNN4 

Docker,Mesos-Master,Mesos-Slave,Marathon等服务全部由systemdpipe理。

那很奇怪,不幸。 看起来这是没有这个检查: https : //github.com/apache/mesos/blob/0.25.0/src/slave/slave.cpp#L3570,因为它找不到执行者sentinel文件的path。

您可以在https://issues.apache.org/jira/browse/MESOS上提交新的JIRA,以便我们跟踪并解决这个问题吗&#xFF1F;