通过zeppelin从docker-hadoop-spark-workbench访问hdfs

我已经安装了https://github.com/big-data-europe/docker-hadoop-spark-workbench

然后用docker-compose up 。 我浏览了git自述文件中提到的各种URL,并且都显示出来了。

然后我开始了一个本地apache zeppelin:

 ./bin/zeppelin.sh start 

在zeppelin解释器设置中,我已经导航,然后启动解释器并更新主站点以指向安装了docker的本地群集

主:从local[*]更新为spark://localhost:8080

然后我在笔记本上运行下面的代码:

 import org.apache.hadoop.fs.{FileSystem,Path} FileSystem.get( sc.hadoopConfiguration ).listStatus( new Path("hdfs:///")).foreach( x => println(x.getPath )) 

我在zeppelin日志中得到这个exception:

  INFO [2017-12-15 18:06:35,704] ({pool-2-thread-2} Paragraph.java[jobRun]:362) - run paragraph 20171212-200101_1553252595 using null org.apache.zeppelin.interpreter.LazyOpenInterpreter@32d09a20 WARN [2017-12-15 18:07:37,717] ({pool-2-thread-2} NotebookServer.java[afterStatusChange]:2064) - Job 20171212-200101_1553252595 is finished, status: ERROR, exception: null, result: %text java.lang.NullPointerException at org.apache.zeppelin.spark.Utils.invokeMethod(Utils.java:38) at org.apache.zeppelin.spark.Utils.invokeMethod(Utils.java:33) at org.apache.zeppelin.spark.SparkInterpreter.createSparkContext_2(SparkInterpreter.java:398) at org.apache.zeppelin.spark.SparkInterpreter.createSparkContext(SparkInterpreter.java:387) at org.apache.zeppelin.spark.SparkInterpreter.getSparkContext(SparkInterpreter.java:146) at org.apache.zeppelin.spark.SparkInterpreter.open(SparkInterpreter.java:843) at org.apache.zeppelin.interpreter.LazyOpenInterpreter.open(LazyOpenInterpreter.java:70) at org.apache.zeppelin.interpreter.remote.RemoteInterpreterServer$InterpretJob.jobRun(RemoteInterpreterServer.java:491) at org.apache.zeppelin.scheduler.Job.run(Job.java:175) at org.apache.zeppelin.scheduler.FIFOScheduler$1.run(FIFOScheduler.java:139) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:748) 

我如何从zeppelin和java / spark代码访问hdfs?

例外的原因是在Zeppelin中sparkSession对象由于某种原因是null的。

参考: https : //github.com/apache/zeppelin/blob/master/spark/src/main/java/org/apache/zeppelin/spark/SparkInterpreter.java

 private SparkContext createSparkContext_2() { return (SparkContext) Utils.invokeMethod(sparkSession, "sparkContext"); } 

可能是一个configuration相关的问题。 请交叉validation设置/configuration和火花集群设置。 确保火花正常工作。

参考: https : //zeppelin.apache.org/docs/latest/interpreter/spark.html

希望这可以帮助。