Breeding: Hadoop Map Whole File in Java -

Saturday, 15 February 2014

Hadoop Map Whole File in Java -

i trying utilize hadoop in java multiple input files. @ moment have 2 files, big 1 process , smaller 1 serves sort of index.

my problem need maintain whole index file unsplitted while big file distributed each mapper. there way provided hadoop api create such thing?

in case if have not expressed myself correctly, here link image represents trying achieve: picture

update:

following instructions provided santiago, able insert file (or uri, @ least) amazon's s3 distributed cache this:

job.addcachefile(new path("s3://mybucket/input/index.txt").touri());

however, when mapper tries read 'file not found' exception occurs, seems odd me. have checked s3 location , seems fine. have used other s3 locations introduce input , output file.

error (note single slash after s3:)

filenotfoundexception: s3:/mybucket/input/index.txt (no such file or directory)

the next code utilize read file distributed cache:

uri[] cachefile = output.getcachefiles(); bufferedreader br = new bufferedreader(new filereader(cachefile[0].tostring())); while ((line = br.readline()) != null) {      //do stuff         }

i using amazon's emr, s3 , version 2.4.0 of hadoop.

as mentioned above, add together index file distributed cache , access same in mapper. behind scenes. hadoop framework ensure index file sent task trackers before task executed , available processing. in case, info transferred 1 time , available tasks related job.

however, instead of add together index file distributed cache in mapper code, create driver code implement toolrunner interface , override run method. provides flexibility of passing index file distributed cache through command prompt while submitting job

if using toolrunner, can add together files distributed cache straight command line when run job. no need re-create file hdfs first. utilize -files alternative add together files

hadoop jar yourjarname.jar yourdriverclassname -files cachefile1, cachefile2, cachefile3, ...

you can access files in mapper or reducer code below:

file f1 = new file("cachefile1"); file f2 = new file("cachefile2"); file f3 = new file("cachefile3");

file hadoop split mapper

Breeding

Saturday, 15 February 2014

Hadoop Map Whole File in Java -

No comments:

Post a Comment