在 Oracle 数据库中实现 MapReduce 已翻译 100%

看客-申君健 投递于 2014/09/13 09:05 (共 6 段, 翻译完成于 09-15)
阅读 4767
收藏 82
3
加载中

The Map-Reduce model has become a popular way for programmers to describe and implement parallel programs. These custom map-reduce programs are often used to process a large data set in parallel. This post shows how to implement Map-Reduce Programs within the Oracle database using Parallel Pipelined Table Functions and parallel operations.

The Theory

Pipelined Table Functions were introduced in Oracle 9i as a way of embedding procedural logic within a data flow. At a logical level, a Table Function is a function that can appear in the FROM clause and thus functions as a table returning a stream of rows. Table Functions can also take a stream of rows as an input. Since Pipelined Table Functions are embedded in the data flow they allow data to be 'streamed' to a SQL statement avoiding intermediate materialization in most cases. Additionally, Pipelined Table Functions can be parallelized.

已有 1 人翻译此段
我来翻译

To parallelize a Table Function the programmer specifies a key to repartition the input data. Table Functions can be implemented natively in PL/SQL, Java, and C. You can find more information and examples about Table Functions and the functionality mentioned above at the following URL:

http://download.oracle.com/docs/cd/B10501_01/appdev.920/a96624/08_subs.htm#19677

Pipelined Table Functions have been used by customers for several releases and are a core part of Oracle's extensibility infrastructure. Both external users and Oracle Development have used Table Functions as an efficient and easy way of extending the database kernel.

Examples of table functions being used within Oracle are the implementation of a number of features in Oracle Spatial and Oracle Warehouse Builder. Oracle Spatial usages include spatial joins and several spatial data mining operations. Oracle Warehouse Builder allows end users to leverage Table Functions to parallelize procedural logic in data flows such as the Match-Merge algorithm and other row-by-row processing algorithms.

已有 1 人翻译此段
我来翻译

Step-by-Step Example

All examples are available in plain text in this file: omr.sql.

To illustrate the usage of parallelism, and Pipelined Table Functions to write a Map-Reduce algorithm inside the Oracle database, we describe how to implement the canonical map-reduce example: a word count. For those unfamiliar with the example, the goal of word count is to return all distinct words within a set of documents as well as a count of how often this word occurs within this set of documents.

The procedural code in this word count example is implemented in PL/SQL but, as said before, Oracle allows you to pick your language of choice to implement said procedural logic.

已有 1 人翻译此段
我来翻译

Step 1 - Setting up the Environment

We will be looking at a set of documents, these documents can be either files outside of the database, or they can be stored as Secure Files/CLOB columns within the database. Within this table our documents are stored, effectively reflecting a file system.

In this case we are going to create an table within the database using the following definition:

CREATE TABLE documents (a CLOB)      
  LOB(a) STORE AS SECUREFILE(TABLESPACE sysaux);

Each row in this table corresponds to a single document. We populate this table with a very simple corpus resulting in 3 documents with the text shown here:

INSERT INTO documents VALUES ('abc def');      
INSERT INTO documents VALUES ('def ghi');        
INSERT INTO documents VALUES ('ghi jkl');        
commit;

The end result of both the map function and the reduce table function are going to live in a package, keeping the code nice and tidy. To show the steps to be taken we will take snippets from the overall package and show those in the section to follow. The actual package will contain a set of types, which are required for the code to work. All code was tested on Oracle Database 11g (11.1.0.6).

Download the full code here.

The following figures show the package being deployed.

已有 1 人翻译此段
我来翻译

Step 2 - Creating the Mapper and the Reducer

First we need to create a generic function to "map" (as in map-reduce) or tokenize a document. Note that the goal is not to show the best map function, but how this will work in principle in the database. This specific map function is very basic and better implementations may be found elsewhere. 

You can use the aggregation engine to get the results and only use the mapper. A query and a result would look like this:

The aggregation is done in SQL, no reducer required.

Of course, you could write your own aggregation Table Function to count the occurrences of words in a document. That is what you would do if you were writing the map-reduce program without leveraging the Oracle aggregation engine as we did before. This aggregation Table Function is the reducer of the map-reduce program unit.

The Table Function specifies that it's input be partitioned by word and could (to use the Oracle execution engine's sort) ask for the data to ordered or clustered by word. We show a sample count program in this post to complete the example.

已有 1 人翻译此段
我来翻译

Step 3 - In-Database Map-Reduce

After you have completed both the mapper and the reducer you are ready to do the full map-reduce in the database. Running a query using this Table Function will give us a parallel workload on external documents, doing what the typical map-reduce programs do.

Summary

Oracle Table Functions are a proven technology, used by many internal and external parties to extend Oracle Database 11g.

Oracle Table Functions are a robust scalable way to implement Map-Reduce within the Oracle database and leverage the scalability of the Oracle Parallel Execution framework. Using this in combination with SQL provides an efficient and simple mechanism for database developers to develop Map-Reduce functionality within the environment they understand and with the languages they know.

Download the code here: omr.sql. For the example, I ran this in OE (as you can see on the SQL screens). No special privileges required. 

已有 1 人翻译此段
我来翻译
本文中的所有译文仅用于学习和交流目的,转载请务必注明文章译者、出处、和本文链接。
我们的翻译工作遵照 CC 协议,如果我们的工作有侵犯到您的权益,请及时联系我们。
加载中

评论(4)

大连馋师
大连馋师
传统数据库系统中的表函数、临时表用的好,确实能解决很多问题!有意想不到的效果和成就感。
中山野鬼
中山野鬼
数据库的应用者是不需要关心,数据的实际组织和计算原理,而更多关注业务描述。map-reduce这种涉及到数据流和具体处理方法层面的思想,并不适合给数据库的应用者开发。开放的越多,逻辑越混乱,效果还越差。哈。
小小Summer
小小Summer
值得推荐的东西
看客-申君健
看客-申君健
翻译完才发现全文没实实在在讲技术,有点伤心呀!大家自己看代码吧!
table function也可译为表函数,但我用“table函数”!
返回顶部
顶部