Spark输出ORC格式的文件

社区管理员 发布于 2016/09/08 16:22
阅读 2K+
收藏 0

请问spark上RDD怎么输出到ORC文件里,我知道spark-sql可以做到,但是除了这种方法呢?

我想知道用saveAsNewAPIHadoopFile 这个API可以做到么?

我现在的代码是

ex.map(x => (NullWritable.get(), x)).repartition(partition.toInt).saveAsNewAPIHadoopFile("/user/tmp/orc", classOf[NullWritable], classOf[Model], classOf[OrcNewOutputFormat])



case class Model(uid: String, province: String, city: String)



里面的ex是RDD[Model]

运行报错

Model cannot be cast to org.apache.hadoop.hive.ql.io.orc.OrcSerde$OrcSerdeRow
at org.apache.hadoop.hive.ql.io.orc.OrcNewOutputFormat$OrcRecordWriter.write(OrcNewOutputFormat.java:37)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12$$anonfun$apply$4.apply$mcV$sp(PairRDDFunctions.scala:1113)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12$$anonfun$apply$4.apply(PairRDDFunctions.scala:1111)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12$$anonfun$apply$4.apply(PairRDDFunctions.scala:1111)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1251)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1119)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1091)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)



加载中
返回顶部
顶部