mapreduce - hadoop job with single mapper and two different reducers -
i have big document corpus input mapreduce job (old hadoop api). in mapper, can produce 2 kinds of output (one counting words , 1 producing minhash signatures). need is:
give word counting output 1 reducer class (the 1 sums occurences) , give give minhash signatures reducer class (performing calculations on size of buckets).the input same corpus of documents , there no need process twice. think multipleoutputs not solution, cannot find way give mapper output 2 different cut down classes.
in nutshell, need following:
wordcounting reducer --> wordcount output / input --> mapper
\ minhash buckets reducer --> minhash output is there way utilize same mapper, or should split in 2 jobs?
you can it, involve coding tricks (partitioner , prefix convention). thought mapper output word prefixed "w:" , minhash prefixed "m:". utilize partitioner decide partition (aka reducer) needs go into.
pseudo code main method:
set number of reducers 2 mapper:
.... parse word ... ... generate minhash .. context.write("w:" + word, 1); context.write("m:" + minhash, 1); partitioner:
if key starts "w:" { homecoming 0; } // reducer 1 if key starts "m:" { homecoming 1; } // reducer 2 combiner:
if key starts "w:" { iterate on values , sum; context.write(key, sum); return;} iterate , context.write of values reducer:
if key starts "w:" { iterate on values , sum; context.write(key, sum); return;} if key starts "m:" { perform min hash logic } in output part-0000 word counts , part-0001 min hash calculations.
unfortunately not possible provide different reducer classes, if , prefix can simulate it.
also having 2 reducers might not efficient performance point of view, play partitioner allocate first n partitions word count.
if not prefix thought need implement secondary sort custom writablecomparable class key. worth effort in more sophisticated cases.
hadoop mapreduce
No comments:
Post a Comment