HBase offers both scalability and the economy of sharing the same infrastructure as Hadoop, but will its flaws hold it back? NoSQL experts square off.

HBase is modeled after Google BigTable and is part of the world's most popular big data processing platform, Apache Hadoop. But will this pedigree guarantee HBase a dominant role in the competitive and fast-growing NoSQL database market?

Michael Hausenblas of MapR argues that Hadoop's popularity and HBase's scalability and consistency ensure success. The growing HBase community will surpass other open-source movements and will overcome a few technical wrinkles that have yet to be worked out.

Jonathan Ellis of DataStax, the support provider behind open-source Cassandra, argues that HBase flaws are too numerous and intrinsic to Hadoop's HDFS architecture to overcome. These flaws will forever limit HBase's applicability to high-velocity workloads, he says.

Read what our two NoSQL experts have to say, and then weigh in with your opinion in the comments section below.

HBase既提供了可伸缩性,又提供了共享与Hadoop相同的基础设施的经济性,但它的缺陷是否把后腿扯下来了呢? NoSQL专家摆好了辩论架式。

HBase是仿照谷歌BigTable的,是世界上最受欢迎的大数据处理平台Apache Hadoop的一部分。但这一血统能否担保HBase在充满竞争和快速发展的NoSQL数据库市场中定会担当一个主导的角色呢?

MapR公司的Michael Hausenblas 认为Hadoop的受欢迎程度与HBase的可伸缩性和一致性可确保成功。日益增长的HBase社区将超过其他开源运动,并会克服一些还需进一步研究的技术问题。

在开源项目Cassandra的幕后支持供应商DataStax工作的Jonathan Ellis认为HBase需要克服的缺陷太多,而且内含于Hadoop的HDFS架构。他说这些缺陷将永远限制HBase适用于高速工作负载的项目。


For The Motion

 Michael Hausenblas
Michael Hausenblas
Chief Data Engineer EMEA, MapR Technologies

Integration With Hadoop Will Drive Adoption

The answer to the question is a crystal-clear "Yes, but…"

In order to appreciate this response, we need to step back a bit and understand the question in context. Both Martin Fowler, in 2011, and Mike Stonebraker, in 2005, took up the polyglot persistence argument that "one size does not fit it all."

Hence, I'm going to interpret the "dominant" in the question not in the sense of the market-share measures applied to relational databases over the past 10 years, but along the line of, "Will Apache HBase be used across a wider range of use cases and have a bigger community behind it than other NoSQL databases?"


 Michael Hausenblas
Michael Hausenblas


为了领会这个回答,我们需要退后一步,从语境上理解问题。Martin Fowler在2011年和Mike Stonebraker在2005年都拿着“通晓多种语言的持久化”认为“一种尺寸不能适用于一切”。

因此,我要解释问题中的“主导”不是在过去十年里应用于关系数据库的市场份额措施意义上的,而是沿着“Apache HBase是否会被使用在更广泛的情况中和有一个比其他NoSQL数据库更大的社区的支持?”的主线来讨论(有点狡辩的意味)。

This is a bold assertion given that there are more than 100 different NoSQL options to choose from, including MongoDB, Riak, Couchbase, Cassandra and many, many others. But in this big-data era, the trend is away from specialized information silos to large-scale processing of varied data, so even a popular solution such as MongoDB will be surpassed by HBase.

Why? MongoDB has well-documented scalability issues, and with the fast-growing adoption of Hadoop, the NoSQL solution that integrates directly with Hadoop has a marked advantage in scale and popularity. HBase has a huge and diverse community under its belt in all respects: users, developers, multiple commercial vendors and availability in the cloud, the last through Amazon Web Services (AWS), for example.

考虑到现在有超过 100 个不同的NoSQL方案可供选择,包括MongoDB, Riak, Couchbase, Cassandra 和许多许多其它方案,上面的观点可以说是一个大胆的推断。但是在大数据时代,潮流正从专业的信息存储转向大规模的异构数据处理,所以即使像MongoDB这样的流行方案也会被HBase赶超。

为什么? MongoDB有着显而易见的可扩展性方面的问题,随着Hadoop使用率的快速增长,能直接和Hadoop整合的NoSQL方案将会在规模和流行度上有明显的优势。HBase拥有一个庞大而多样的社区,它连接着各个方面: 用户,开发者,多个商业销售商,云端可用性等等,比如最后一点是通过 Amazon Web Services (AWS)实现的。

Historically, both HBase and Cassandra have a lot in common. HBase was created in 2007 at Powerset (later acquired by Microsoft) and was initially part of Hadoop and then became a Top-Level-Project. Cassandra originated at Facebook in 2007, was open sourced and then incubated at Apache, and is nowadays also a Top-Level-Project. Both HBase and Cassandra are wide-column key-value datastores that excel at ingesting and serving huge volumes of data while being horizontally scalable, robust and providing elasticity.

There are philosophical differences in the architectures: Cassandra borrows many design elements from Amazon's DynamoDB system, has an eventual consistency model and is write-optimized while HBase is a Google BigTable clone with read-optimization and strong consistency. An interesting proof point for the superiority of HBase is the fact that Facebook, the creator of Cassandra, replaced Cassandra with HBase for their internal use.

在发展历史上,HBase和Cassandra有许多相似之处。HBase 由Powerset公司创建于2007年(该公司不久被Microsoft收购),一开始它是Hadoop的一部分随后成为一个顶级项目。Cassandra最早由Facebook在2007年发起,是开源的,随后成为Apache的孵化项目,目前已经成为一个顶级项目。HBase和Cassandra都是多列的key-value数据存储库,擅长于接受和提供大数据集,同时具有横向可扩展性,鲁棒性和灵活性。

它们的架构在设计哲学上是有差异的: Cassandra从Amazon's DynamoDB系统中借用了许多设计元素,有一个最终一致性的模型并且优化了写操作,而HBase是Google BigTable的克隆版, 优化了读操作并且有强一致性。关于HBase优越性的一个有趣的证据论点是, 作为Cassandra创建者的Facebook,已经在其内部使用HBase替代了Cassandra。

From an application developer's point of view, HBase is preferable as it offers strong consistency, making life easier. One of the misconceptions about eventual consistency is that it improves write speed: given a sustained write traffic, latency is affected and one ends up paying the "eventual consistency tax" without getting its benefits.

There are some technical limitations with almost all NoSQL solutions, like compactions affecting consistent low latency, inability to shard automatically, reliability issues and long recovery times for node outages. Here at MapR, we've created a "next version" of enterprise HBase that includes instant recovery, seamless sharding and high availability, and that gets rid of compactions. We brought it into GA under the label M7 in May 2013 and it's available in the cloud via AWS Elastic MapReduce.

从一个应用开发者的角度来看,HBase更好,因为它提供了强一致性,让生活变得更容易。关于最终一致性的一个错误理解是它提高了写入速度: 假如有一个持续的写操作的阻塞,影响了等待时间,而最后的结果是交了"最终一致性税"却没有得到它的好处。 

几乎所有的NoSQL方案都有一些技术上的限制,比如压缩对低延时性的影响,无法自动碎片化,可靠性问题,以及节点宕机时的长恢复周期等。在MapR这里,我们已经创建了一个"未来版"企业级HBase,它包括瞬时恢复,无缝碎片化和高可用性,并且它摒弃了压缩。2013年5月我们把它纳入到了标记为M7的GA版本中,同时通过AWS Elastic MapReduce,它也在云端可用。

Last but not least, HBase has -- through its legacy as a Hadoop contribution project -- a strong and solid integration into the entire Hadoop ecosystem, including Apache Hive and Apache Pig.

Summarizing, HBase will be the dominant NoSQL platform for use cases where fast and small-size updates and look-ups at scale are required. Recent innovations have also provided architectural advantages to eliminate compactions and provide truly decentralized co-ordination.

Michael Hausenblas is chief data engineer, EMEA, at MapR Technologies. His background is in large-scale data integration research and development, advocacy and standardization.

最后同样重要的是,HBase拥有 -- 通过作为Hadoop的贡献项目而得到的遗产 -- 一个强大而可靠的整合进整个Hadoop生态系统的方式,包括Apache Hive和Apache Pig。

概括起来讲,在那些需要进行快速的小规模的更新和大规模的查询的用例场景中,HBase 将会成为统治性的NoSQL平台。最近的改进也给HBase带来了架构上的优势,包括消除了压缩并且提供了真正的分散协作。

Michael Hausenblas 是MapR Technologies公司EMEA大区的首席数据工程师。他的工作背景是大规模数据集成的研究和开发,倡导和标准化。

Against The Motion

 Jonathan Ellis
Jonathan Ellis
Co-founder & CTO,

HBase Is Plagued By Too Many Flaws

NoSQL includes several specialties such as graph databases and document stores where HBase does not compete, but even within its category of partitioned row store, HBase lags behind the leaders. The technical shortcomings driving HBase's lackluster adoption fall into two major categories: engineering problems that can be addressed given enough time and manpower, and architectural flaws that are inherent to the design and cannot be fixed.


 Jonathan Ellis
Jonathan Ellis
联合创始人 & CTO,
HBase 受到太多缺点的困扰

NoSQL包括了几个特性,比如图形数据库和文档存储,这些都是HBase不具备的,而且即使在它所属的分区行存储这一类型中,HBase也落后于领跑者。技术上的缺陷可以把HBase的失败使用案例分为两个主要类型: 一是工程问题,如果时间和人力充足,该问题可以处理,二是架构上的缺陷,这是设计层面固有的问题所以无法修复。

Engineering Problems

-- Operations are complex and failure prone. Deploying HBase involves configuring at a minimum a Zookeeper ensemble, primary HMaster, secondary HMaster, RegionServers, active NameNode, standby NameNode, HDFS quorum journal manager and DataNodes. Installation can be automated, but if it's too difficult to install without help, how are you going to troubleshoot it when something goes wrong during, for instance, RegionServer failover or a lower-level NameNode failure? HBase requires substantial expertise to even know what to monitor, and God help you if you need regular backups.

-- RegionServer failover takes 10 to 15 minutes. HBase partitions rows into regions, each managed by a RegionServer. The RegionServer is a single point of failure for its region; when it goes down, a new one must be selected and write-ahead logs must be replayed before writes or reads can be served again.


--操作复杂,且容易发生故障。HBase的部署需要配置的文件包括:最小Zookeeper集群,一级HMaster,二级 HMaster,RegionServers,活动NameNode,备用NameNode,HDFS管理,还有DataNodes。尽管HBase可以被自动安装,但是要是没有帮助就想成功安装太难了,比如说RegionServers出现故障或者一个低级别NameNode出现故障了怎么 办?HBase使用需要足够多的专业知识甚至需要知道要监视什么。只用上帝才能帮助你进行定期备份吧。

--RegionServer故障转移需要花费10到15分钟的时间,HBase将分区形成区域,每个区域由RegionServer来进行管理。 RegionServer对于其管理的区域来说只允许单次故障。当它发生故障时,就必须选择一个新的区域服务器,而且在新服务器工作之前还得必须重新写入之前服务器的日志。

-- Developing against HBase is painful. HBase's API is clunky and Java centric. Non-Java clients are relegated to the second-class Thrift or REST gateways. Contrast that with the Cassandra Query Language, which offers developers a familiar, productive experience in all languages.

-- The HBase community is fragmented. The Apache mainline is widely understood to be unstable. Cloudera, Hortonworks, and advanced users maintain their own patch trees on top. Leadership is divided and there is no clear roadmap. Conversely, the open-source Cassandra community includes committers from DataStax, Netflix, Spotify, Blue Mountain Capital, and others working together without cliques or forks.

Overall, the engineering gap between HBase and other NoSQL platforms has increased since I've been observing the NoSQL ecosystem. When I first evaluated them, I would have put HBase six months behind Cassandra in engineering progress, but today that lead has widened to about two years.

-- 用HBase进行开发很痛苦。HBase的 API很笨拙而且是以Java为中心的。非Java客户端被降级到第二级别的Thrift或REST入口。与此相对的是Cassandra 查询语言,它提供给开发者一个在所有语言中都熟悉的、富有成效开发体验。

-- HBase社区是一盘散沙。Apache的主线不稳定是广为人知的。Cloudera, Hortonworks,和高级用户们都在顶层维护着他们自己的补丁树。领导权被拆分开了而且没有清晰的发展路线图。相反的,开源的Cassandra社区的贡献者来自包括DataStax、Netflix、Spotify、Blue Mountain Capital和其他组织,并且没有派系或分支。


Architectural Flaws

-- Master-oriented design makes HBase operationally inflexible. Routing all reads and writes through the RegionServer master means that active/active asynchronous replication across multiple datacenters is not possible for HBase, nor can you perform workload separation across different replicas in a cluster. By contrast, Cassandra's peer-to-peer replication allows seamless integration of Hadoop, Solr and Cassandra with no ETL while allowing you to opt in to lightweight transactions in the rare cases when you need linearizability.

-- Failover means downtime. Even one minute of downtime is simply not acceptable in many applications, and this is an intrinsic problem with HBase's design; each RegionServer is a single point of failure. A fully distributed design instead means that when one replica goes down, there is no need for special-case histrionics to recover; the system keeps functioning normally with the other replicas and can catch up the failed one later.


--面向Master的设计使得HBase的操作很不灵活。通过RegionServer master来路由所有的读和写意味着HBase不可能在多个数据中心之间进行主动/主动结构的异步复制,还意味着你不能把工作负载分给一个集群上的各个复制者。相比之下,Cassandra的P2P复制允许在没有ETL的情况下无缝地集成Hadoop,Solar和Cassandra,而且在你需要极少出现的线性一致性的情况下还允许你进行 轻量级事务处理