大数据的辩论:HBase 将主导 NoSQL 吗? 已翻译 100%

oschina 投递于 2013/08/07 07:41 (共 11 段, 翻译完成于 08-10)
阅读 9784
收藏 64
4
加载中

HBase offers both scalability and the economy of sharing the same infrastructure as Hadoop, but will its flaws hold it back? NoSQL experts square off.

HBase is modeled after Google BigTable and is part of the world's most popular big data processing platform, Apache Hadoop. But will this pedigree guarantee HBase a dominant role in the competitive and fast-growing NoSQL database market?

Michael Hausenblas of MapR argues that Hadoop's popularity and HBase's scalability and consistency ensure success. The growing HBase community will surpass other open-source movements and will overcome a few technical wrinkles that have yet to be worked out.

Jonathan Ellis of DataStax, the support provider behind open-source Cassandra, argues that HBase flaws are too numerous and intrinsic to Hadoop's HDFS architecture to overcome. These flaws will forever limit HBase's applicability to high-velocity workloads, he says.

Read what our two NoSQL experts have to say, and then weigh in with your opinion in the comments section below.

已有 1 人翻译此段
我来翻译

For The Motion

 Michael Hausenblas
Michael Hausenblas
Chief Data Engineer EMEA, MapR Technologies

Integration With Hadoop Will Drive Adoption


The answer to the question is a crystal-clear "Yes, but…"

In order to appreciate this response, we need to step back a bit and understand the question in context. Both Martin Fowler, in 2011, and Mike Stonebraker, in 2005, took up the polyglot persistence argument that "one size does not fit it all."

Hence, I'm going to interpret the "dominant" in the question not in the sense of the market-share measures applied to relational databases over the past 10 years, but along the line of, "Will Apache HBase be used across a wider range of use cases and have a bigger community behind it than other NoSQL databases?"

已有 1 人翻译此段
我来翻译

This is a bold assertion given that there are more than 100 different NoSQL options to choose from, including MongoDB, Riak, Couchbase, Cassandra and many, many others. But in this big-data era, the trend is away from specialized information silos to large-scale processing of varied data, so even a popular solution such as MongoDB will be surpassed by HBase.

Why? MongoDB has well-documented scalability issues, and with the fast-growing adoption of Hadoop, the NoSQL solution that integrates directly with Hadoop has a marked advantage in scale and popularity. HBase has a huge and diverse community under its belt in all respects: users, developers, multiple commercial vendors and availability in the cloud, the last through Amazon Web Services (AWS), for example.

已有 1 人翻译此段
我来翻译

Historically, both HBase and Cassandra have a lot in common. HBase was created in 2007 at Powerset (later acquired by Microsoft) and was initially part of Hadoop and then became a Top-Level-Project. Cassandra originated at Facebook in 2007, was open sourced and then incubated at Apache, and is nowadays also a Top-Level-Project. Both HBase and Cassandra are wide-column key-value datastores that excel at ingesting and serving huge volumes of data while being horizontally scalable, robust and providing elasticity.

There are philosophical differences in the architectures: Cassandra borrows many design elements from Amazon's DynamoDB system, has an eventual consistency model and is write-optimized while HBase is a Google BigTable clone with read-optimization and strong consistency. An interesting proof point for the superiority of HBase is the fact that Facebook, the creator of Cassandra, replaced Cassandra with HBase for their internal use.

已有 1 人翻译此段
我来翻译

From an application developer's point of view, HBase is preferable as it offers strong consistency, making life easier. One of the misconceptions about eventual consistency is that it improves write speed: given a sustained write traffic, latency is affected and one ends up paying the "eventual consistency tax" without getting its benefits.

There are some technical limitations with almost all NoSQL solutions, like compactions affecting consistent low latency, inability to shard automatically, reliability issues and long recovery times for node outages. Here at MapR, we've created a "next version" of enterprise HBase that includes instant recovery, seamless sharding and high availability, and that gets rid of compactions. We brought it into GA under the label M7 in May 2013 and it's available in the cloud via AWS Elastic MapReduce.

已有 1 人翻译此段
我来翻译

Last but not least, HBase has -- through its legacy as a Hadoop contribution project -- a strong and solid integration into the entire Hadoop ecosystem, including Apache Hive and Apache Pig.

Summarizing, HBase will be the dominant NoSQL platform for use cases where fast and small-size updates and look-ups at scale are required. Recent innovations have also provided architectural advantages to eliminate compactions and provide truly decentralized co-ordination.

Michael Hausenblas is chief data engineer, EMEA, at MapR Technologies. His background is in large-scale data integration research and development, advocacy and standardization.

已有 1 人翻译此段
我来翻译

Against The Motion

 Jonathan Ellis
Jonathan Ellis
Co-founder & CTO,
DataStax

HBase Is Plagued By Too Many Flaws


NoSQL includes several specialties such as graph databases and document stores where HBase does not compete, but even within its category of partitioned row store, HBase lags behind the leaders. The technical shortcomings driving HBase's lackluster adoption fall into two major categories: engineering problems that can be addressed given enough time and manpower, and architectural flaws that are inherent to the design and cannot be fixed.

已有 1 人翻译此段
我来翻译

Engineering Problems

-- Operations are complex and failure prone. Deploying HBase involves configuring at a minimum a Zookeeper ensemble, primary HMaster, secondary HMaster, RegionServers, active NameNode, standby NameNode, HDFS quorum journal manager and DataNodes. Installation can be automated, but if it's too difficult to install without help, how are you going to troubleshoot it when something goes wrong during, for instance, RegionServer failover or a lower-level NameNode failure? HBase requires substantial expertise to even know what to monitor, and God help you if you need regular backups.

-- RegionServer failover takes 10 to 15 minutes. HBase partitions rows into regions, each managed by a RegionServer. The RegionServer is a single point of failure for its region; when it goes down, a new one must be selected and write-ahead logs must be replayed before writes or reads can be served again.

已有 1 人翻译此段
我来翻译

-- Developing against HBase is painful. HBase's API is clunky and Java centric. Non-Java clients are relegated to the second-class Thrift or REST gateways. Contrast that with the Cassandra Query Language, which offers developers a familiar, productive experience in all languages.

-- The HBase community is fragmented. The Apache mainline is widely understood to be unstable. Cloudera, Hortonworks, and advanced users maintain their own patch trees on top. Leadership is divided and there is no clear roadmap. Conversely, the open-source Cassandra community includes committers from DataStax, Netflix, Spotify, Blue Mountain Capital, and others working together without cliques or forks.

Overall, the engineering gap between HBase and other NoSQL platforms has increased since I've been observing the NoSQL ecosystem. When I first evaluated them, I would have put HBase six months behind Cassandra in engineering progress, but today that lead has widened to about two years.

已有 1 人翻译此段
我来翻译

Architectural Flaws

-- Master-oriented design makes HBase operationally inflexible. Routing all reads and writes through the RegionServer master means that active/active asynchronous replication across multiple datacenters is not possible for HBase, nor can you perform workload separation across different replicas in a cluster. By contrast, Cassandra's peer-to-peer replication allows seamless integration of Hadoop, Solr and Cassandra with no ETL while allowing you to opt in to lightweight transactions in the rare cases when you need linearizability.

-- Failover means downtime. Even one minute of downtime is simply not acceptable in many applications, and this is an intrinsic problem with HBase's design; each RegionServer is a single point of failure. A fully distributed design instead means that when one replica goes down, there is no need for special-case histrionics to recover; the system keeps functioning normally with the other replicas and can catch up the failed one later.

已有 1 人翻译此段
我来翻译
本文中的所有译文仅用于学习和交流目的,转载请务必注明文章译者、出处、和本文链接。
我们的翻译工作遵照 CC 协议,如果我们的工作有侵犯到您的权益,请及时联系我们。
加载中

评论(25)

n
newlife867
Michael Hausenblas 我听过此人的讲演,也和他聊过一聊,感觉他有水分。
站在MapR的立场上,他当然是狂定HBase的,
因为定制Hadoop 就是MapR的主营业务。

Hadoop,HBase的使用和维护,那是众所周知的麻烦。
所以才有MapR这类的公司存在。

不过嘛,开源的,大量被使用的,比较通用的大数据处理工具目前也基本只有 hadoop
萌龙
萌龙

引用来自“BreakJoa”的评论

引用来自“萌龙”的评论

引用来自“BreakJoa”的评论

引用来自“唐阳”的评论

引用来自“viney”的评论

riak无疑是我的最爱,其次是redis。

一个都不会

redis貌似只能单机器吧?

你确定你了解redis吗

不是很了解,只是知道这个东西,我这样理解不知道对不对,多机的主从机制,好像真正运行的只是一台redis的机器,如果要并发处理,貌似要自己写hash算法……是个人理解误区。

redis确实没有提供像hbase那么完善的分布式存储,但是你要知道,安装、维护hbase有多么的麻烦,一旦出现问题,就是运维人员的噩梦。redis就是设计成轻量级的,目前本身支持复制。再使用redis的连接池,读写分离,分布式存储什么的也都不是问题,关键是轻量级,维护起来方便的多。
梦远寄从无
梦远寄从无

引用来自“萌龙”的评论

引用来自“BreakJoa”的评论

引用来自“唐阳”的评论

引用来自“viney”的评论

riak无疑是我的最爱,其次是redis。

一个都不会

redis貌似只能单机器吧?

你确定你了解redis吗

不是很了解,只是知道这个东西,我这样理解不知道对不对,多机的主从机制,好像真正运行的只是一台redis的机器,如果要并发处理,貌似要自己写hash算法……是个人理解误区。
萌龙
萌龙

引用来自“BreakJoa”的评论

引用来自“唐阳”的评论

引用来自“viney”的评论

riak无疑是我的最爱,其次是redis。

一个都不会

redis貌似只能单机器吧?

你确定你了解redis吗
bewdx3
bewdx3
2013年8月了还有人为cassandra招魂的,面对互联网线上吞吐量级别的半结构化数据,hbase还就是唯一解了,虽然这个解不完美.
捧cassandra的赶紧部署上,我等着看下一篇xx为什么逃离cassandra.
Hansoul
Hansoul
架构缺陷是硬伤,hadoop同样如此。
jackerx
jackerx
工程复杂 确实是个问题 好想各个大公司搞HBase的都是华人 呵呵
架构上的缺陷 目前也正是HBase社区改进的方向
Hadoop 会专门对HBase作优化
外国用的不少 血统纯正 搭Hadoop的顺风车
梦远寄从无
梦远寄从无

引用来自“唐阳”的评论

引用来自“viney”的评论

riak无疑是我的最爱,其次是redis。

一个都不会

redis貌似只能单机器吧?
FeiFan
FeiFan
不看好
技术揣摩
技术揣摩
HBASE大数据处理听说很强,但复杂业务对待事务的支持一直都是NOSQL的硬伤,如果Hadoop十分流行的话,对于标配来说流行是应该的吧
返回顶部
顶部