翻译于 2013/01/26 11:32
实现pairs方法是很直接明了的,当map功能被调用时会遍历每行传递的值,我们将分隔一个区间创建一个String数组。下一步将会去构造两个循环。外部循环在数组中迭代遍历每个语句,内部循环将迭代"neighbors"的当前语句。许多内部循环的迭代被我们"window"捕获neighbor的当前语句所影响。在内部循环每个迭代的下面,我么将发布一个WordPair项目(两部分组成:当前语句在左边,neighbor语句在右边)作为键,计数的一个作为值,下面是Pairs实现的代码
public class PairsOccurrenceMapper extends Mapper<LongWritable, Text, WordPair, IntWritable> { private WordPair wordPair = new WordPair(); private IntWritable ONE = new IntWritable(1); @Override protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { int neighbors = context.getConfiguration().getInt("neighbors", 2); String[] tokens = value.toString().split("\\s+"); if (tokens.length > 1) { for (int i = 0; i < tokens.length; i++) { wordPair.setWord(tokens[i]); int start = (i - neighbors < 0) ? 0 : i - neighbors; int end = (i + neighbors >= tokens.length) ? tokens.length - 1 : i + neighbors; for (int j = start; j <= end; j++) { if (j == i) continue; wordPair.setNeighbor(tokens[j]); context.write(wordPair, ONE); } } } } }
Reducer对Pairs实现将简单计算出给定WordPair键的总和
public class PairsReducer extends Reducer<WordPair,IntWritable,WordPair,IntWritable> { private IntWritable totalCount = new IntWritable(); @Override protected void reduce(WordPair key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int count = 0; for (IntWritable value : values) { count += value.get(); } totalCount.set(count); context.write(key,totalCount); } }
实现stripes方法去共生同样是很直接明了的,方法相同,但所有的"neighbor"语句被放在一个HashMap中时就用neighbor语句作为键,整数count作为值。当所有的值遍历完后被放在一个给定命令下时(外部循环的下面),word和hashmap才会输出。下面是Stripes实现的代码
public class StripesOccurrenceMapper extends Mapper<LongWritable,Text,Text,MapWritable> { private MapWritable occurrenceMap = new MapWritable(); private Text word = new Text(); @Override protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { int neighbors = context.getConfiguration().getInt("neighbors", 2); String[] tokens = value.toString().split("\\s+"); if (tokens.length > 1) { for (int i = 0; i < tokens.length; i++) { word.set(tokens[i]); occurrenceMap.clear(); int start = (i - neighbors < 0) ? 0 : i - neighbors; int end = (i + neighbors >= tokens.length) ? tokens.length - 1 : i + neighbors; for (int j = start; j <= end; j++) { if (j == i) continue; Text neighbor = new Text(tokens[j]); if(occurrenceMap.containsKey(neighbor)){ IntWritable count = (IntWritable)occurrenceMap.get(neighbor); count.set(count.get()+1); }else{ occurrenceMap.put(neighbor,new IntWritable(1)); } } context.write(word,occurrenceMap); } } } }由于需要迭代所有maps的集合 Reducer对Stripes方法 稍微涉及多点,然后对每个集合,迭代map中的所有值。
public class StripesReducer extends Reducer<Text, MapWritable, Text, MapWritable> { private MapWritable incrementingMap = new MapWritable(); @Override protected void reduce(Text key, Iterable<MapWritable> values, Context context) throws IOException, InterruptedException { incrementingMap.clear(); for (MapWritable value : values) { addAll(value);
现在来比较两种算法,看得出相较于Stripes算法,Pairs算法会产生更多的键值对。而且,Pairs 算法捕获到的是单个的共生事件而Stripes 算法能够捕获到所有的共生事件。Pairs算法和Stripes算法的实现都非常适宜于使用Combiner。因为这两种算法实现产生的结果都是可交换与可结合【译者注:可使用combiner的数据必须能够满足交换律与结合律,忘了这是那篇文档中提出的了】的,所以我们可以简单地重用reducer作为Combiner。如前所述,共生矩阵不仅仅能应用于文本处理,它会是我们手中的一项重要武器。谢谢你读到这里。
评论删除后,数据将无法恢复
评论(3)