Apache Spark 0.9.1 发布,集群计算环境

发布于 2014年04月19日
收藏 34

Apache Spark 0.9.1 发布,这是一个维护版本,主要是 bug 修复、性能提升以及 YARN 的稳定性提升。


Spark 是一种与 Hadoop 相似的开源集群计算环境,但是两者之间还存在一些不同之处,这些有用的不同之处使 Spark 在某些工作负载方面表现得更加优越,换句话说,Spark 启用了内存分布数据集,除了能够提供交互式查询外,它还可以优化迭代工作负载。


Improvements and bug fixes in Spark core

  • Fixed hash collision bug in external spilling [SPARK-1113]

  • Fixed conflict with Spark’s log4j for users relying on other logging backends [SPARK-1190]

  • Fixed Graphx missing from Spark assembly jar in maven builds

  • Fixed silent failures due to map output status exceeding Akka frame size [SPARK-1244]

  • Removed Spark’s unnecessary direct dependency on ASM [SPARK-782]

  • Removed metrics-ganglia from default build due to LGPL license conflict [SPARK-1167]

  • Fixed bug in distribution tarball not containing spark assembly jar [SPARK-1184]

  • Fixed bug causing infinite NullPointerException failures due to a null in map output locations [SPARK-1124]

  • Fixed bugs in post-job cleanup of scheduler’s data structures

  • Added the ability to make distribution tarballs with Tachyon bundled in them. This eases the deployment of Spark with Tachyon.

  • Added support for HBase’s TableOutputFormat and other OutputFormats that extend Configurable

Stability improvements for Spark-on-YARN

Several bug fixes were made to YARN deployment mode:

  • Fixed bug in reading/writing files that the yarn user does not have permissions to but the submitting user does [SPARK-1051]

  • Fixed bug making Spark application stall when YARN registration fails [SPARK-1032]

  • Race condition in getting HDFS delegation tokens in yarn-client mode [SPARK-1203]

  • Fixed bug in yarn-client mode not exiting properly [SPARK-1049]

  • Fixed regression bug in ADD_JAR environment variable not correctly adding custom jars [SPARK-1089]

Improvements to other deployment scenarios

  • Added support for C3 EC2 instances were added to Spark’s EC2 scripts used to launch EC2 clusters.

  • Fixed bug in jar URL validation in standalone mode.

Optimizations to MLLib

  • Optimized memory usage of ALS [MLLIB-25]

  • Optimized computation of YtY for implicit ALS [SPARK-1237]

  • Support for negative implicit input in ALS [MLLIB-22]

  • Setting of a random seed in ALS [SPARK-1238]

  • Faster construction of features with intercept [SPARK-1260]

  • Check for intercept and weight in GLM’s addIntercept [SPARK-1327]

Bug fixes and better API parity for PySpark

  • Fixed bug in Python de-pickling [SPARK-1135]

  • Fixed bug in serialization of strings longer than 64K [SPARK-1043]

  • Fixed bug that made jobs hang when base file is not available [SPARK-1025]

  • Added Missing RDD operations to PySpark - top, zip, foldByKey, repartition, coalesce, getStorageLevel, setName and toDebugString

Improvements to documentation

  • Streaming: Added documentation on running streaming application from spark-shell

  • YARN: Added documentation on running spark-shell in yarn-client mode with secured HDFS


  • Aaron Davidson - Bug fix in mergeCombiners

  • Aaron Kimball - Improvements to streaming programming guide

  • Andrew Ash - Bug fix in worker registration logging and improvements to docs

  • Andrew Or - Bug fixes in map output status size and hash collision in external spilling,  and improvements to streaming programming guide

  • Andrew Tulloch - Minor updates to MLLib

  • Bijay Bisht - Fix for hadoop-client for Hadoop < 1.0.1 and for bug in Spark on Mesos + CDH4.5.0

  • Bouke van der Bijl - Bug fix in Python depickling

  • Bryn Keller  - Support for HBase’s TableOutputFormat

  • Chen Chao - Bug fix in spark-shell script, and improvements to streaming programming guide

  • Christian Lundgren - Support for C3 EC2 instance type

  • Diana Carroll - Improvements to PySpark programming guide

  • Emtiaz Ahmed - UI bug fix

  • Frank Dai - Code cleanup for MLLib

  • Henry Saputra - Changes in use of Scala Option

  • jianghan - Bug fixes in Java examples

  • Josh Rosen - Bug fix in PySpark string serialization and exception handling

  • Jyotiska NK  - Improvements to PySpark doc and examples

  • Kay Ousterhout - Multiple bug fixes in scheduler’s handling of task failures

  • Kousuke Saruta - Use of https to access github

  • Mark Grover  - Bug fix in distribution tar.gz

  • Matei Zaharia - Bug fixes in handling of task failures due to NPE,  and cleaning up of scheduler data structures

  • Nan Zhu - Bug fixes in PySpark RDD.takeSample and adding of JARs using ADD_JAR -  and improvements to docs

  • Nick Lanham - Added ability to make distribution tarballs with Tachyon

  • Patrick Wendell - Bug fixes in ASM shading, fixes for log4j initialization, removing Ganglia due to LGPL license, and other miscallenous bug fixes

  • Prabin Banka - RDD.zip and other missing RDD operations in PySpark

  • Prashant Sharma - RDD.foldByKey in PySpark, and other PySpark doc improvements

  • Qiuzhuang - Bug fix in standalone worker

  • Raymond Liu - Changed working directory in ZookeeperPersistenceEngine

  • Reynold Xin  - Improvements to docs and test infrastructure

  • Sandy Ryza - Multiple important Yarn bug fixes and improvements

  • Sean Owen - Bug fixes and improvements for MLLib’s ALS

  • Shixiong Zhu - Fixed thread-unsafe use of SimpleDateFormat

  • shiyun.wxm - UI bug fix

  • Stevo Slavić - Bug fix in window’s run-example script

  • Tathagata Das - Improvements to streaming docs

  • Tom Graves - Bug fixes in YARN deployment modes

  • Xiangrui Meng - Improvements to ALS and GLM, and MLLib programming guide

转载请注明:文章转载自 开源中国社区 [http://www.oschina.net]
本文标题:Apache Spark 0.9.1 发布,集群计算环境



这都出了10天了,这边才发布消息,哈哈 难道是为了迎合今天CSDN举办的那个spark峰会咩