Tuesday, 15 September 2015

bigdata - how is flume distributed? -



bigdata - how is flume distributed? -

i working flume ingest ton of info hdfs (about petabytes of data). know how flume making utilize of distributed architecture? have on 200 servers , have installed flume in 1 of them info (aka info source) , sink hdfs. (hadoop running on serengeti in these servers). not sure whether flume distributes on cluster or have installed incorrectly. followed apache's user guide flume installation , post of so.

how install , configure apache flume?

http://flume.apache.org/flumeuserguide.html#setup

i newbie flume , trying understand more it..any help appreciated. thanks!!

i'm not going speak cloudera's specific recommendations instead apache flume itself.

it's distributed decide distribute it. decide on own topology , implement it.

you should think of flume durable pipe. has source (you can take number), channel (you can take number) , sink (again, can take number). pretty typical utilize avro sink in 1 agent connect avro source in another.

assume installing flume gather apache webserver logs. mutual architecture install flume on each apache webserver machine. utilize spooling directory source apache logs , syslog source syslog. utilize memory channel speed , not impact server (at cost of durability) , utilize avro sink.

that avro sink connected, via flume load balancing, 2 or more collectors. collectors avro source, file channel , whatever wanted (elasticsearch?, hdfs?) sink. may add together tier of agents handle final output.

bigdata flume

No comments:

Post a Comment