bigdata - how is flume distributed? -
i working flume ingest ton of info hdfs (about petabytes of data). know how flume making utilize of distributed architecture? have on 200 servers , have installed flume in 1 of them info (aka info source) , sink hdfs. (hadoop running on serengeti in these servers). not sure whether flume distributes on cluster or have installed incorrectly. followed apache's user guide flume installation , post of so.
how install , configure apache flume?
http://flume.apache.org/flumeuserguide.html#setup
i newbie flume , trying understand more it..any help appreciated. thanks!!
i'm not going speak cloudera's specific recommendations instead apache flume itself.
it's distributed decide distribute it. decide on own topology , implement it.
you should think of flume durable pipe. has source (you can take number), channel (you can take number) , sink (again, can take number). pretty typical utilize avro sink in 1 agent connect avro source in another.
assume installing flume gather apache webserver logs. mutual architecture install flume on each apache webserver machine. utilize spooling directory source apache logs , syslog source syslog. utilize memory channel speed , not impact server (at cost of durability) , utilize avro sink.
that avro sink connected, via flume load balancing, 2 or more collectors. collectors avro source, file channel , whatever wanted (elasticsearch?, hdfs?) sink. may add together tier of agents handle final output.
bigdata flume
No comments:
Post a Comment