Spark节点使用错误的IP地址进行通信(Docker)

我有一个使用Docker创build的Spark(DataStax企业)集群,使用docker-compose绑定在一起。 这仅用于本地开发目的。

容器在他们自己的dockernetworking中: 172.18.0.0/16 。 我在运行Docker工具箱的Mac上,我可以直接从我的机器上访问这些容器,因为我已经在172.18.0.0/16上手动添加了一个到172.18.0.0/16的路由,这是Virtualbox在Mac上提供的虚拟networking。

vboxnet0接口的我的一面有IP 192.168.99.1 。 docker机端有192.168.99.101

这一切都很好,主Web UI出现在172.18.0.2:7080 ,所有的节点都显示正确,他们的172.x IP地址(并继续这样做,如果我扩大通过例如docker-compose scale spark=5 )。

但是,当我提交工作时,例如:

$SPARK_HOME/bin/spark-submit --master spark://172.18.0.2:7077 --class myapp.Main \ ./target/scala-2.10/myapp-assembly-1.0.0-SNAPSHOT.jar

这是非常缓慢的(我认为由于重试),我看到这样的错误,直到它终于成功:

 16/09/16 13:01:53 WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 5, 192.168.99.101): org.apache.spark.storage.BlockFetchException: Failed to fetch block from 1 locations. Most recent failure cause: at org.apache.spark.storage.BlockManager$$anonfun$doGetRemote$2.apply(BlockManager.scala:595) at org.apache.spark.storage.BlockManager$$anonfun$doGetRemote$2.apply(BlockManager.scala:585) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.storage.BlockManager.doGetRemote(BlockManager.scala:585) at org.apache.spark.storage.BlockManager.getRemote(BlockManager.scala:570) at org.apache.spark.storage.BlockManager.get(BlockManager.scala:630) at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:44) at org.apache.spark.rdd.RDD.iterator(RDD.scala:268) at com.datastax.spark.connector.rdd.CassandraJoinRDD.compute(CassandraJoinRDD.scala:224) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:87) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) at org.apache.spark.scheduler.Task.run(Task.scala:89) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.io.IOException: Failed to connect to /192.168.99.101:35306 at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:216) at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:167) at org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:90) at org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140) at org.apache.spark.network.shuffle.RetryingBlockFetcher.access$200(RetryingBlockFetcher.java:43) at org.apache.spark.network.shuffle.RetryingBlockFetcher$1.run(RetryingBlockFetcher.java:170) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) ... 3 more Caused by: java.net.ConnectException: Connection refused: /192.168.99.101:35306 at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:224) at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:289) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:112) ... 1 more 

但是,我不知道为什么它试图访问网关192.168.99.101资源。

我也看到这样的输出,这又不显示我期望的IP地址:

 16/09/16 13:01:36 INFO TaskSetManager: Starting task 1.0 in stage 2.0 (TID 4, 192.168.99.101, partition 1,PROCESS_LOCAL, 2316 bytes) 16/09/16 13:01:36 INFO TaskSetManager: Starting task 0.0 in stage 2.0 (TID 5, 192.168.99.101, partition 0,NODE_LOCAL, 2089 bytes) 16/09/16 13:01:36 INFO BlockManagerInfo: Added broadcast_5_piece0 in memory on 192.168.99.101:39885 (size: 7.5 KB, free: 511.1 MB) 16/09/16 13:01:51 INFO BlockManagerInfo: Added broadcast_5_piece0 in memory on 192.168.99.101:35306 (size: 7.5 KB, free: 511.0 MB) 

我试着在Mac上设置SPARK_LOCAL_IP=192.168.99.1 ,并且在节点上的172.xspark-env.sh每个节点的172.x地址,但是没有帮助。

我可以直接从Mac访问所有节点,也可以路由到docker-machine VM( 192.168.99.101 )和Mac( 192.168.99.1 )。 每个节点也可以通过名称和IP地址路由到其他节点。

我是否正确,BlockManager似乎使用了错误的IP地址? 有没有办法强制它使用正确的,而不是它以某种方式拾取的网关地址?

编辑:只是要添加 – 我也试图设置blockmanager端口到一个硬编码的,显然是随机分配的一个将不匹配的东西,我可以EXPOSE IP地址是否正确,但没有出现正在工作。 我在SPARK_MASTER_OPTSSPARK_WORKER_OPTS设置了-Dspark.blockManager.port=7005

编辑2:如果我把Java,Spark和我的App Jar放到一个新的空容器上(在同一个172.18 / 16networking上),并通过spark-submit从这里启动(即没有通过网关的stream量,只有容器一切正常,一切正常,似乎是从网关的另一端提交主机的网关IP的一些问题。