哪里可以find更明确的错误给出容器错误状态代码?

我实际上是通过一个使用Docker容器的Mesos堆栈来运行任务。

有时候,一些任务失败了。

以下是一些相关的TaskStatus消息和原因:

 message: Container exited with status 1 - reason: REASON_COMMAND_EXECUTOR_FAILED message: Container exited with status 42 - reason: REASON_COMMAND_EXECUTOR_FAILED message: Container exited with status 137 - reason: REASON_COMMAND_EXECUTOR_FAILED 

是否有一个对应表,将TaskStatus消息中的容器错误状态代码与更明确的错误链接起来?

命令任务可能由于以下原因而失败并设置正确的退出代码。 例如Docker 1.10设置这样的退出状态代码( 来自文档和这个答案 ):

docker运行的退出代码提供了有关容器为什么运行失败或为何退出的信息。 当docker run以非零代码退出时,退出代码遵循chroot标准,如下所示:

如果错误是Docker守护进程本身

 $ docker run --foo busybox; echo $? # flag provided but not defined: --foo See 'docker run --help'. 

126如果包含的命令不能被调用:

 $ docker run busybox /etc; echo $? # docker: Error response from daemon: Container command '/etc' could not be invoked. 

127如果包含的命令不能被find

 $ docker run busybox foo; echo $? # docker: Error response from daemon: Container command 'foo' not found or does not exist. 127 Exit code of contained command 

除此以外

 $ docker run busybox /bin/sh -c 'exit 3'; echo $? # 3 

另一个退出代码规则可以在这里find

 | Code | Meaning | Example | Comments | |-------|--------------------------------|-------------------------|--------------------------------------------------------------------------------------------------------------| | 1 | Catchall for general errors | let "var1 = 1/0" | Miscellaneous errors, such as "divide by zero" and other impermissible operations | | 2 | Misuse of shell builtins | empty_function() {} | Missing keyword or command, or permission problem (and diff return code on a failed binary file comparison). | | 126 | Command invoked cannot execute | /dev/null | Permission problem or command is not an executable | | 127 | "command not found" | illegal_command | Possible problem with $PATH or a typo | | 128 | Invalid argument to exit | exit 3.14159 | exit takes only integer args in the range 0 - 255 (see first footnote) | | 128+n | Fatal error signal "n" | kill -9 $PPID of script | $? returns 137 (128 + 9) | | 130 | Script terminated by Control-C | Ctl-C | Control-C is fatal error signal 2, (130 = 128 + 2, see above) | | 255* | Exit status out of range | exit -1 | exit takes only integer args in the range 0 - 255 | 

根据你的例子:

  • 137 – 内存不足 ; 128 + 9 = 137 (9 coming from SIGKILL) ,可以转码为内存不足错误并杀死。
  • 1 – 用1退出命令。 可能是由于configuration无效,内部应用程序错误或input无效。
  • 42

    回答生命,宇宙和万物的终极问题

如果你需要更多的信息来解释状态代码,你可以在Mesos TaskStatus更新中查看消息字段,例如Mesos把有关OOM的信息。 在Mesos日志中也可以find相同的信息。 要debugging为什么命令返回非零代码,您可以检查存储在执行器沙箱中的文件,特别是stderr / stdout或特定于命令的日志。

你想在mesos.proto复制枚举原因 (复制下面):

  enum Reason { // TODO(jieyu): The default value when a caller doesn't check for // presence is 0 and so ideally the 0 reason is not a valid one. // Since this is not used anywhere, consider removing this reason. REASON_COMMAND_EXECUTOR_FAILED = 0; REASON_CONTAINER_LAUNCH_FAILED = 21; REASON_CONTAINER_LIMITATION = 19; REASON_CONTAINER_LIMITATION_DISK = 20; REASON_CONTAINER_LIMITATION_MEMORY = 8; REASON_CONTAINER_PREEMPTED = 17; REASON_CONTAINER_UPDATE_FAILED = 22; REASON_EXECUTOR_REGISTRATION_TIMEOUT = 23; REASON_EXECUTOR_REREGISTRATION_TIMEOUT = 24; REASON_EXECUTOR_TERMINATED = 1; REASON_EXECUTOR_UNREGISTERED = 2; REASON_FRAMEWORK_REMOVED = 3; REASON_GC_ERROR = 4; REASON_INVALID_FRAMEWORKID = 5; REASON_INVALID_OFFERS = 6; REASON_IO_SWITCHBOARD_EXITED = 27; REASON_MASTER_DISCONNECTED = 7; REASON_RECONCILIATION = 9; REASON_RESOURCES_UNKNOWN = 18; REASON_SLAVE_DISCONNECTED = 10; REASON_SLAVE_REMOVED = 11; REASON_SLAVE_RESTARTED = 12; REASON_SLAVE_UNKNOWN = 13; REASON_TASK_CHECK_STATUS_UPDATED = 28; REASON_TASK_GROUP_INVALID = 25; REASON_TASK_GROUP_UNAUTHORIZED = 26; REASON_TASK_INVALID = 14; REASON_TASK_UNAUTHORIZED = 15; REASON_TASK_UNKNOWN = 16; }