在Mac上的Docker中运行Keras时的内存问题

运行Keras训练algorithm会在Mac上的Docker机器中运行时产生各种内存问题。

  • 训练algorithm在Docker之外的同一台机器上工作正常

  • 将Docker内存从1 GB设置为8 GB(机器的限制)不起作用

  • 最大video内存:128 MB

  • 不同的TensorFlow( 0.10.00.11.0 )和从Docker中提取的Theano后端都显示类似的错误

  • 其他Docker进程可能与docker ps -a冲突的列表是空的

问题是我得到的性能低得多 ,在Docker同一台机器上运行相同的训练algorithm 。 所有的错误都指向内存pipe理问题

1)最初的错误是MemoryError ,当在容器的docker build过程中运行训练脚本时,它甚至在训练开始之前退出了这个过程。

2)现在我得到ResourceExhaustedError:运行docker run 058785edc11d python train.py --run分配与形状张量[64,64,254,254] OOM docker run 058785edc11d python train.py --run一旦容器被build立(运行似乎更进一步):

 Training.. Train on 385 samples, validate on 40 samples Epoch 1/1 sample_weight=sample_weight) File "/usr/local/lib/python2.7/dist-packages/keras/engine/training.py", line 1046, in fit callback_metrics=callback_metrics) File "/usr/local/lib/python2.7/dist-packages/keras/engine/training.py", line 784, in _fit_loop outs = f(ins_batch) File "/usr/local/lib/python2.7/dist-packages/keras/backend/tensorflow_backend.py", line 641, in __call__ updated = session.run(self.outputs + self.updates, feed_dict=feed_dict) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 382, in run run_metadata_ptr) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 655, in _run feed_dict_string, options, run_metadata) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 723, in _do_run target_list, options, run_metadata) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 743, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors.ResourceExhaustedError: OOM when allocating tensor with shape[64,64,254,254] [[Node: transpose_2 = Transpose[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](Conv2D, transpose_2/perm)]] Caused by op u'transpose_2', defined at: File "train.py", line 138, in <module> run(extract=extract_mode, cont=continue_) File "train.py", line 79, in run model = m.get_model(n_outputs=num_categories, input_size=size) File "/tmp/model.py", line 24, in get_model conv.add(Convolution2D(64, 3, 3, activation='relu', input_shape=(3, input_size, input_size))) File "/usr/local/lib/python2.7/dist-packages/keras/models.py", line 110, in add layer.create_input_layer(batch_input_shape, input_dtype) File "/usr/local/lib/python2.7/dist-packages/keras/engine/topology.py", line 341, in create_input_layer self(x) File "/usr/local/lib/python2.7/dist-packages/keras/engine/topology.py", line 485, in __call__ self.add_inbound_node(inbound_layers, node_indices, tensor_indices) File "/usr/local/lib/python2.7/dist-packages/keras/engine/topology.py", line 543, in add_inbound_node Node.create_node(self, inbound_layers, node_indices, tensor_indices) File "/usr/local/lib/python2.7/dist-packages/keras/engine/topology.py", line 148, in create_node output_tensors = to_list(outbound_layer.call(input_tensors[0], mask=input_masks[0])) File "/usr/local/lib/python2.7/dist-packages/keras/layers/convolutional.py", line 341, in call filter_shape=self.W_shape) File "/usr/local/lib/python2.7/dist-packages/keras/backend/tensorflow_backend.py", line 997, in conv2d x = tf.transpose(x, (0, 3, 1, 2)) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/array_ops.py", line 1051, in transpose ret = gen_array_ops.transpose(a, perm, name=name) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_array_ops.py", line 2489, in transpose result = _op_def_lib.apply_op("Transpose", x=x, perm=perm, name=name) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 703, in apply_op op_def=op_def) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2310, in create_op original_op=self._default_original_op, op_def=op_def) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1232, in __init__ self._traceback = _extract_stack() 

3)删除退出docker集装箱后,减less培训批量大小我得到std :: bad_alloc

 Training.. Train on 404 samples, validate on 21 samples Epoch 1/1 terminate called after throwing an instance of 'std::bad_alloc' what(): std::bad_alloc 

4)另一个常见的错误资源耗尽:分配张量形状时的OOM [25088,4096]

 $ docker run f825faab715c python train.py --run --continue libdc1394 error: Failed to initialize libdc1394 Using TensorFlow backend. /tmp/data.py:134: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future val = np.random.choice(dataset_indx, size=number_of_samples) /tmp/data.py:127: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future train = np.random.choice(dataset_indx, size=number_of_samples) Loading data.. Number of categories: 2 Number of samples 425 Building and Compiling model.. W tensorflow/core/framework/op_kernel.cc:936] Resource exhausted: OOM when allocating tensor with shape[25088,4096] W tensorflow/core/framework/op_kernel.cc:936] Resource exhausted: OOM when allocating tensor with shape[4096,4096] W tensorflow/core/framework/op_kernel.cc:936] Resource exhausted: OOM when allocating tensor with shape[25088,4096] [[Node: gradients/MatMul_grad/MatMul_1 = MatMul[T=DT_FLOAT, transpose_a=true, transpose_b=false, _device="/job:localhost/replica:0/task:0/cpu:0"](cond_5/Merge, gradients/add_43_grad/Reshape)]] E tensorflow/core/client/tensor_c_api.cc:485] OOM when allocating tensor with shape[25088,4096] [[Node: gradients/MatMul_grad/MatMul_1 = MatMul[T=DT_FLOAT, transpose_a=true, transpose_b=false, _device="/job:localhost/replica:0/task:0/cpu:0"](cond_5/Merge, gradients/add_43_grad/Reshape)]] Training.. Train on 404 samples, validate on 21 samples Epoch 1/1 Traceback (most recent call last): File "train.py", line 138, in <module> run(extract=extract_mode, cont=continue_) File "train.py", line 100, in run sample_weight=None) File "/usr/local/lib/python2.7/dist-packages/keras/models.py", line 405, in fit sample_weight=sample_weight) File "/usr/local/lib/python2.7/dist-packages/keras/engine/training.py", line 1046, in fit callback_metrics=callback_metrics) File "/usr/local/lib/python2.7/dist-packages/keras/engine/training.py", line 784, in _fit_loop outs = f(ins_batch) File "/usr/local/lib/python2.7/dist-packages/keras/backend/tensorflow_backend.py", line 641, in __call__ updated = session.run(self.outputs + self.updates, feed_dict=feed_dict) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 382, in run run_metadata_ptr) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 655, in _run feed_dict_string, options, run_metadata) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 723, in _do_run target_list, options, run_metadata) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 743, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors.ResourceExhaustedError: OOM when allocating tensor with shape[25088,4096] [[Node: gradients/MatMul_grad/MatMul_1 = MatMul[T=DT_FLOAT, transpose_a=true, transpose_b=false, _device="/job:localhost/replica:0/task:0/cpu:0"](cond_5/Merge, gradients/add_43_grad/Reshape)]] Caused by op u'gradients/MatMul_grad/MatMul_1', defined at: File "train.py", line 138, in <module> run(extract=extract_mode, cont=continue_) File "train.py", line 100, in run sample_weight=None) File "/usr/local/lib/python2.7/dist-packages/keras/models.py", line 405, in fit sample_weight=sample_weight) File "/usr/local/lib/python2.7/dist-packages/keras/engine/training.py", line 1031, in fit self._make_train_function() File "/usr/local/lib/python2.7/dist-packages/keras/engine/training.py", line 658, in _make_train_function training_updates = self.optimizer.get_updates(trainable_weights, self.constraints, self.total_loss) File "/usr/local/lib/python2.7/dist-packages/keras/optimizers.py", line 307, in get_updates grads = self.get_gradients(loss, params) File "/usr/local/lib/python2.7/dist-packages/keras/optimizers.py", line 48, in get_gradients grads = K.gradients(loss, params) File "/usr/local/lib/python2.7/dist-packages/keras/backend/tensorflow_backend.py", line 666, in gradients return tf.gradients(loss, variables) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gradients.py", line 478, in gradients in_grads = _AsList(grad_fn(op, *out_grads)) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/math_grad.py", line 637, in _MatMulGrad math_ops.matmul(op.inputs[0], grad, transpose_a=True)) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/math_ops.py", line 1346, in matmul name=name) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_math_ops.py", line 1271, in _mat_mul transpose_b=transpose_b, name=name) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 703, in apply_op op_def=op_def) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2310, in create_op original_op=self._default_original_op, op_def=op_def) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1232, in __init__ self._traceback = _extract_stack() ...which was originally created as op u'MatMul', defined at: File "train.py", line 138, in <module> run(extract=extract_mode, cont=continue_) File "train.py", line 79, in run model = m.get_model(n_outputs=num_categories, input_size=size) File "/tmp/model.py", line 70, in get_model conv.add(Dense(4096)) File "/usr/local/lib/python2.7/dist-packages/keras/models.py", line 142, in add output_tensor = layer(self.outputs[0]) File "/usr/local/lib/python2.7/dist-packages/keras/engine/topology.py", line 485, in __call__ self.add_inbound_node(inbound_layers, node_indices, tensor_indices) File "/usr/local/lib/python2.7/dist-packages/keras/engine/topology.py", line 543, in add_inbound_node Node.create_node(self, inbound_layers, node_indices, tensor_indices) File "/usr/local/lib/python2.7/dist-packages/keras/engine/topology.py", line 148, in create_node output_tensors = to_list(outbound_layer.call(input_tensors[0], mask=input_masks[0])) File "/usr/local/lib/python2.7/dist-packages/keras/layers/core.py", line 628, in call output = K.dot(x, self.W) File "/usr/local/lib/python2.7/dist-packages/keras/backend/tensorflow_backend.py", line 214, in dot out = tf.matmul(x, y) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/math_ops.py", line 1346, in matmul name=name) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_math_ops.py", line 1271, in _mat_mul transpose_b=transpose_b, name=name) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 703, in apply_op op_def=op_def) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2310, in create_op original_op=self._default_original_op, op_def=op_def) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1232, in __init__ self._traceback = _extract_stack() 

可能是你的训练algorithm需要比8GB更多的内存。 以前我曾遇到过这样的问题,但增加记忆力总是解决问题。 你的错误ResourceExhaustedError:OOM当分配与形状张量[64,64,254,254]清楚地表明,你已经用完了资源,它将需要更多的内存来运行你的应用程序。