加载中

Though there is some decent documentation, I found that setting up Hive with a HBase back-end to be somewhat fiddly. Hopefully this guide will help you get started quicker. This article presumes that you already have HBase set up. If not, see my HBase quickstart.

Note: these directions are for development. They don’t use HDFS, for example. For a full guide on production deployment, see the excellent CDH4 directions.

Linux

sudo apt-get install hive

# create directory that Hive stores data in by default
sudo mkdir -p /user/hive/warehouse
sudo chown -R myusername:myusername /user/hive/warehouse/

# copy HBase JARs into the Hive lib
sudo cp /usr/share/hbase/hbase-0.92.1.jar /usr/lib/hive/lib
sudo cp /usr/share/hadoop-zookeeper/zookeeper-3.4.3.jar /usr/lib/hive/lib

OSX

brew install hive

即使是有一些正式的文档,构建以HBase为后端的Hive仍然是多少需要一些技巧的。希望这份手册能让你入门快一些。本文假定你已经安装好HBase,如果没有,参考我写的另一篇文章 HBase 快速入门

注: 这些方法是用于开发环境的,例如,其中并没有用到HDFS。关于产品部署的完整手册,参考 CDH4 指南

Linux

sudo apt-get install hive

# 创建Hive的默认数据存储目录
sudo mkdir -p /user/hive/warehouse
sudo chown -R myusername:myusername /user/hive/warehouse/

# copy HBase JARs into the Hive lib
sudo cp /usr/share/hbase/hbase-0.92.1.jar /usr/lib/hive/lib
sudo cp /usr/share/hadoop-zookeeper/zookeeper-3.4.3.jar /usr/lib/hive/lib

OSX

brew install hive

Connect to HBase

Now, you can fire up hive with the hive command and create a table that’s backed by HBase. For this example, my HBase table is called test, and has a column family of integer values called values. Note that the dropping/creating of tables is just effecting Hive meta-data; no actual changes are made in HBase.

DROP TABLE IF EXISTS test;

CREATE EXTERNAL TABLE
    test(key string, values map<string, int>)
STORED BY
    'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES (
    "hbase.columns.mapping" = ":key,values:"
    )
TBLPROPERTIES (
    "hbase.table.name" = "test"
    );

SELECT * FROM test;

>c4ca4-0000001-79879483-000000000124-000000000000000000000000000025607621 {'comments':0, 'likes':0}
>c4ca4-0000001-79879483-000000000124-000000000000000000000000000025607622 {'comments':0, 'likes':0}

连接到HBase

现在你可以使用hive命令启动hive,在后端的HBase上创建一张表。例子中的表名为test,有一个叫values的整数的列簇(Cloumn Family)。注意对表的删除/创建只会影响Hive的元数据;并没有真正在HBase生效。

DROP TABLE IF EXISTS test;

CREATE EXTERNAL TABLE
    test(key string, values map<string, int>)
STORED BY
    'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES (
    "hbase.columns.mapping" = ":key,values:"
    )
TBLPROPERTIES (
    "hbase.table.name" = "test"
    );

SELECT * FROM test;

>c4ca4-0000001-79879483-000000000124-000000000000000000000000000025607621 {'comments':0, 'likes':0}
>c4ca4-0000001-79879483-000000000124-000000000000000000000000000025607622 {'comments':0, 'likes':0}

Simple Map Reduce Example

Give the above raw data in the table, here is example GROUP/SUM map reduce where you sum up the various HBase columns in the values column family. This example creates a view to handle the blowing apart of the HBase rowkey. You can use an INSERT OVERWRITE statement at the end to write the results back into Hbase.

CREATE VIEW
    test_view AS
SELECT
    substr(key, 0, 36) as org_date_asset_prefix,
    split(key, '-')[2] as inverse_date_str,
    stats['comments'] as comments,
    stats['likes'] as likes
FROM
    test;

SELECT
    org_date_asset_prefix,
    map(
      'comments', SUM(comments),
      'likes', SUM(likes)
    ) as stats
FROM
    test_view
GROUP BY
    org_date_asset_prefix;

Thrift REST API

If you want to connect to Hive via thrift, you can start the thrift service with hive --service hiveserver. Hiver is a nice little Python API wrapper.

import hiver
client = hiver.connect(host, port)
client.execute('SHOW TABLES')
rows = client.fetchAll()

简单的Map Reduce例子

假设表中有上面给出的裸数据,下面是一个使用GROUP/SUM的map reduce例子,用来汇总values列簇的不同列。这个例子创建了一个view,用于划分并处理HBase的rowkey。你可以用INSERT OVERWRITE语句将结果写回到Hbase。

CREATE VIEW
    test_view AS
SELECT
    substr(key, 0, 36) as org_date_asset_prefix,
    split(key, '-')[2] as inverse_date_str,
    stats['comments'] as comments,
    stats['likes'] as likes
FROM
    test;

SELECT
    org_date_asset_prefix,
    map(
      'comments', SUM(comments),
      'likes', SUM(likes)
    ) as stats
FROM
    test_view
GROUP BY
    org_date_asset_prefix;

Thrift REST API

如果你想用thrift连接Hive,你可以用hive --service hiveserver启动thrift服务。Hiver 是一个不错的Python API的轻量级包装。

import hiver
client = hiver.connect(host, port)
client.execute('SHOW TABLES')
rows = client.fetchAll()
返回顶部
顶部