python - cogroup IN PySpark -
the tutorial suggests this:
>>> x = sc.parallelize([("a", 1), ("b", 4)]) >>> y = sc.parallelize([("a", 2)]) >>> sorted(x.cogroup(y).collect()) [('a', ([1], [2])), ('b', ([4], []))]
however, on running next output:
('a', (<pyspark.resultiterable.resultiterable object @ 0x1d8b190>, <pyspark.resultiterable.resultiterable object @ 0x1d8b150>)) ('b', (<pyspark.resultiterable.resultiterable object @ 0x1d8b210>, <pyspark.resultiterable.resultiterable object @ 0x1d8b1d0>))
this has 3 level of nesting, if store output in 'r' , this:
for in r: j in i[1]: print list(j)
i right cogrouped numbers:
1) why cogroup not homecoming numbers rightjoin/leftouterjoin etc. in pyspark? 2) why can't replicate illustration on pyspark shell?
easy answer, because that's cogroup supposed return
/** * each key k in `this` or `other`, homecoming resulting rdd contains tuple * list of values key in `this` `other`. */ def cogroup[w](other: rdd[(k, w)]): rdd[(k, (iterable[v], iterable[w]))]
joins in spark implemented cogroup, bring together breaks iterables cogroup tuples. here implantation of bring together spark.
def join[w](other: rdd[(k, w)], partitioner: partitioner): rdd[(k, (v, w))] = { this.cogroup(other, partitioner).flatmapvalues { case (vs, ws) => (v <- vs; w <- ws) yield (v, w) } }
as slight difference in interpreter output (keep in mind output same except pyspark iterable doesn't show it's contents), can't sure unless see tutorial. tutorial may showing output clearer if that's not appears. 1 more thing ran similar script in scala shell, , shows output.
array((a,(arraybuffer(1),arraybuffer(2))), (b,(arraybuffer(4),arraybuffer())))
python apache-spark
No comments:
Post a Comment