Breeding: mapreduce - Best separator character for Hadoop files -

Wednesday, 15 June 2011

mapreduce - Best separator character for Hadoop files -

if i'm writing csv style files out of scheme consumed hadoop. best column separator utilize within file? have tried ctrl-a it's pain imo because other programs don't show it, eg might view file using vi, notepad, web browser, excel. comma pain because info might contains commas. thinking of standardising on tab. there best practice in regards hadoop or doesn't matter. have done fair bit of searching , can't find much on basic question.

there tradeoffs each. depends care about.

commas- if care interoperability. every tool works csv. commas in info pain if writing scheme doesn't escape properly, or reading scheme doesn't respect escaping. hive handles escaping correctly, far know.

tabs- if care interoperability , expect commas in info no tabs. you're less have tabs in data, less given tool supports tsv.

ctrl+a- if care hadoop-ecosystem functionality. has become de-facto hadoop standard, hadoop supports commas , tabs. upside don't have care escaping.

in end, think it's toss-up, assuming you're escaping correctly (and should be!). there's no best practice. if find worrying lot kind of thing, might want step more serious serialization format, avro, well-supported in hadoop-world.

hadoop mapreduce hive

Breeding

Wednesday, 15 June 2011

mapreduce - Best separator character for Hadoop files -

No comments:

Post a Comment