Friday, 15 August 2014

hadoop - Which open-source recommendation system should I choose to deal with big dataset -



hadoop - Which open-source recommendation system should I choose to deal with big dataset -

i want build recommendation system, , target deal big info set, 1 tb data.

and each user has huge amount of items, number of user small, thousands or 10 thousands.

i have search google, found there open-source recommendation engine based on hadoop mahout, guess may have ability deal such big data, i'm not sure.

i find engine write in c++ python, php, don't think script languages can deal such big data, cause memory can't contain whole dataset.

or i'm wrong? give me recommendation?

your question title is:

which opensource recommendation scheme should take deal big dataset?

and in first line

i want build recommendation system, , target deal big info set, > 1 tb data.

and asking recommendation answer.

to reply sec question first. in experience of building recommender systems advise not "build" recommender scheme ground if can avoid it. recommender systems complex , can utilize wide range of techniques provide user recommendation. recommendation unless committed, , have team of people range of experience , knowledge in recommender systems, statistics, , software engineering science implement existing recommender scheme rather building own.

in terms of open source recommender scheme should choose, pretty hard reply great accuracy. allow me seek reply breaking down.

consider open source license, restrictions , requirements. consider algorithm want utilize create recommendations consider environment running recommender scheme on.

i recommend more algorithm side determining factor tool can use, or whether need roll own. start reading here http://www.ibm.com/developerworks/library/os-recommender1/ brief insight in different approaches recommender systems use. in summary different approaches are:

content based neighbourhood / collaborative filtering based constraint based graph-based

in case maintain things relatively straightforward sounds should consider user-user collaborative filtering algorithm this. reasons being:

neighbourhood collaborative filtering quite intuitive understand , can relatively easy implement. with method can justify recommendations users in basic way there no requirement build model training, , processing of neighbours can done "offline", provide quick recommendations end user. storing neighbours quite memory efficient, means improve scalability. sounds need lots of.

the user-based part of suggestion because sounds have less users items. in user-based nearest neighbourhood predicted rating of new item user u calculated looking @ other users have rated item , similar user u. because have fewer users items in scheme faster compute user-based collaborative filtering compared item-based collaborative filtering.

within user-based collaborative filtering need consider rating normalisation (mean-centering vs z-score) want use, similarity weight computation method (e.g. cosine vs pearsons correlation vs other similarity measures) want use, neighbourhood selection criteria (pre-filtering of neighbours, number of neighbours involved in prediction), , dimensionality reduction methods (svd, svd++) want implement (with big dataset yours want consider dm).

so instead of looking open source able process info set should consider algorithm selection first, find tool has implementation of algorithm, , assess whether can process volume involved in dataset.

in saying of that, if take go downwards user-based collaborative filtering route confident apache mahout able solve problem, , if not help understand complexity involved in building own (just @ source code).

please note advice consider algorithm choice. "good" recommender systems much more beingness able process big dataset. need think accuracy, coverage, confidence, novelty, serendipity, diversity, robustness, privacy, risk user trust, , scalability. should consider how going perform experiments , evaluate recommendations, remember if recommendations churning out rubbish , turning users off there no point have recommender system!

it such big area lots think about, there no 1 single tool going help everything, prepared lot of reading , research implementing lots of different open source tools help you.

in saying that, start looking @ apache mahout. going break-down of 3 areas said should think about.

it has commercial-friendly open-source license, it has great implementation of algorithms going need use, , it can work on distributed environments (read scalable).

hope helps, , luck.

hadoop recommendation-engine

No comments:

Post a Comment