Tuesday, 15 May 2012

scala - how to make saveAsTextFile NOT split output into multiple file? -



scala - how to make saveAsTextFile NOT split output into multiple file? -

when using scala in spark, whenever dump result out using saveastextfile, seems split output multiple part. i'm passing parameter(path) it.

val year = sc.textfile("apat63_99.txt").map(_.split(",")(1)).flatmap(_.split(",")).map((_,1)).reducebykey((_+_)).map(_.swap) year.saveastextfile("year")

does number of output correspond number of reducer uses? mean output compressed?

i know can combine output using bash, there alternative store output in single text file, without splitting?? looked @ api docs, doesn't much this.

the reason saves multiple files because computation distributed. if output little plenty such think can fit on 1 machine, can end programme

val arr = year.collect()

and save resulting array file, way utilize custom partitioner, partitionby, , create goes 1 partition though isn't advisable because won't parallelization.

if require file saved saveastextfile can utilize coalesce(1,true).saveastextfile(). means computation coalesce 1 partition. can utilize repartition(1) wrapper coalesce shuffle argument set true. looking through source of rdd.scala how figured of stuff out, should take look.

scala apache-spark

No comments:

Post a Comment