scala - how to make saveAsTextFile NOT split output into multiple file? -
when using scala in spark, whenever dump result out using saveastextfile, seems split output multiple part. i'm passing parameter(path) it.
val year = sc.textfile("apat63_99.txt").map(_.split(",")(1)).flatmap(_.split(",")).map((_,1)).reducebykey((_+_)).map(_.swap) year.saveastextfile("year") does number of output correspond number of reducer uses? mean output compressed?
i know can combine output using bash, there alternative store output in single text file, without splitting?? looked @ api docs, doesn't much this.
the reason saves multiple files because computation distributed. if output little plenty such think can fit on 1 machine, can end programme
val arr = year.collect() and save resulting array file, way utilize custom partitioner, partitionby, , create goes 1 partition though isn't advisable because won't parallelization.
if require file saved saveastextfile can utilize coalesce(1,true).saveastextfile(). means computation coalesce 1 partition. can utilize repartition(1) wrapper coalesce shuffle argument set true. looking through source of rdd.scala how figured of stuff out, should take look.
scala apache-spark
No comments:
Post a Comment