Saturday, 15 January 2011

machine learning - document similarity with documents using synonyms -



machine learning - document similarity with documents using synonyms -

i have bunch of documents of documents re-create of other documents text jumbled , of words replaced synonyms. mentioned below 1 such illustration of sentence:

article 1 (original) : caught john snow in town making purchases @ kingslanding hardware store repair broken tractor. snow has farmed soybeans entire life, did father , fathers. asked him life on farm.

article 2 (duplicate) : obtained john snow in city in purchases create rising of hardware @ kingslanding repair broken motor tractor. snow have soya broad beans finish life have been treated, such father , fathers. asked him concerning life on agriculture company.

article 3 (duplicate) : took above john snow in city made purchases in warehouse of hardware of kingslanding repair broken tractor. snow has cultivated soybeans whole life, father , parents. asked him life in farm.

article 4 (duplicate) : caught myself compared john snow downtown making of purchases kingslanding store of material repair broken tractor. snow cultivated soya life whole, his/her father , fathers. questioned life farm.

i want document similarity ends tagging these documents in same group. suggestions along examples or tutorials appreciated.

it seems textbook case of locality sensitive hashing. check out this thread

machine-learning nlp scikit-learn stanford-nlp information-retrieval

No comments:

Post a Comment