Tuesday, 15 January 2013

mapreduce - hadoop job with single mapper and two different reducers -



mapreduce - hadoop job with single mapper and two different reducers -

i have big document corpus input mapreduce job (old hadoop api). in mapper, can produce 2 kinds of output (one counting words , 1 producing minhash signatures). need is:

give word counting output 1 reducer class (the 1 sums occurences) , give give minhash signatures reducer class (performing calculations on size of buckets).

the input same corpus of documents , there no need process twice. think multipleoutputs not solution, cannot find way give mapper output 2 different cut down classes.

in nutshell, need following:

wordcounting reducer --> wordcount output /

input --> mapper

\ minhash buckets reducer --> minhash output

is there way utilize same mapper, or should split in 2 jobs?

you can it, involve coding tricks (partitioner , prefix convention). thought mapper output word prefixed "w:" , minhash prefixed "m:". utilize partitioner decide partition (aka reducer) needs go into.

pseudo code main method:

set number of reducers 2

mapper:

.... parse word ... ... generate minhash .. context.write("w:" + word, 1); context.write("m:" + minhash, 1);

partitioner:

if key starts "w:" { homecoming 0; } // reducer 1 if key starts "m:" { homecoming 1; } // reducer 2

combiner:

if key starts "w:" { iterate on values , sum; context.write(key, sum); return;} iterate , context.write of values

reducer:

if key starts "w:" { iterate on values , sum; context.write(key, sum); return;} if key starts "m:" { perform min hash logic }

in output part-0000 word counts , part-0001 min hash calculations.

unfortunately not possible provide different reducer classes, if , prefix can simulate it.

also having 2 reducers might not efficient performance point of view, play partitioner allocate first n partitions word count.

if not prefix thought need implement secondary sort custom writablecomparable class key. worth effort in more sophisticated cases.

hadoop mapreduce

No comments:

Post a Comment