Saturday, 15 September 2012

machine learning - Apache Spark (MLLib) for real time analytics -



machine learning - Apache Spark (MLLib) for real time analytics -

i have few questions related utilize of apache spark real-time analytics using java. when spark application submitted, info stored in cassandra database loaded , processed via machine learning algorithm (support vector machine). throughout spark's streaming extension when new info arrive, persisted in database, existing dataset re-trained , svm algorithm executed. output of process stored in database.

apache spark's mllib provides implementation of linear back upwards vector machine. in case non-linear svm implementation, should implement own algorithm or may utilize existing libraries such libsvm or jkernelmachines? these implementations not based on spark's rdds, there way without implementing algorithm scratch using rdd collections? if not, huge effort if test several algorithms. is mllib providing out of box utilities info scaling before executing svm algorithm? http://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf defined in section 2.2 while new dataset streamed, need re-train hole dataset? there way add together new info trained data?

to reply questions piecewise,

spark provides mlutils class allows load info libsvm format rdds - info load portion won't stop utilizing library. implement own algorithms if know you're doing, although recommendation take existing 1 , tweak objective function , see how runs. spark provides functionality of distributed stochastic gradient descent process - can it. not know of. else knows answer. what mean re-training when whole info streamed?

from docs,

.. except fitting occurs on each batch of data, model continually updates reflect info stream.

machine-learning cassandra apache-spark

No comments:

Post a Comment