Thursday, 15 August 2013

php - proposed nlp algorithm for text tagging -



php - proposed nlp algorithm for text tagging -

i looking opensource tool can help identify tags user post on social media , identifying topic/off-topic or spam comment on post. after looking entire day, not find suitable tool/library.

here have proposed own algorithm tagging user post belonging 7 categories (jobs, discussion, events, articles, services, buy/sell, talents).

initially when user makes post, tags post. tags can marketing, suggestion, entrepreneurship, mnc etc. consider posts have tags , category belongs.

steps:

perform pos (part of speech) tagging on user post. here 2 things can done.

considering nouns. nouns may represent tag post more intuitively guess

considering nouns , adjectives both. here can collect big numbers of nouns , adjectives. frequency of such words can used identify tag post.

for each user defined tag, collect pos post belonging particular tag. example. consider user assigned tag marketing , post tag contains pos words seo , adwords. suppose 10 post of marketing tag contains seo , adwords 5 , 7 times respectively. next time when user post comes not have tag contains pos word seo. seo occurring maximum times 7 in marketing tag, predict marketing tag post

next steps identify spam or off-topic comment post. consider 1 user post job category. post contains tag marketing. check in database top frequent 10-15 part of speech tags(i.e. nouns , adjective) marketing.

parallel have pos tag comment. check whether pos(noun & adj) of post contains top frequent tags(we can consider 15-20 such pos tags) belonging marketing.

if pos in comments not match of frequent, top pos marketing comment can said off-topic/span

do have suggestion create algo more intuitive??

i guess svm can help classification, suggestion this?

apart machine learning technique can help here larn scheme predict tag , spam(off topic) comments

the main problem see feature modeling. while picking out nouns help cut down feature space, step potentially important error rate. , care whether looking @ market/n , not market/v?

most mainline text classification implementations using naive bayesian classifiers ignore pos, , count each distinct word form independent feature. (you brute-force stemming cut down market, markets, , marketing single stem form , single feature. tends work in english, might not adequate if working in different language.)

a compromise pos filtering when train classifier. word forms not have noun reading end 0 score in classifier, don't have filter them out when utilize resulting classifier.

empirically, svm tends accomplish high accuracy, comes @ cost of complexity, both in implementation , behavior. naive bayesian classifier has distinct advantage can understand exactly how arrived @ particular conclusion. (well, of mortals cannot claim have same grasp of mathematics behind svm.) perhaps way proceed prototype bayes, , iron out kinks while learning how scheme whole behaves, maybe later consider switching svm 1 time other parts stable?

the "spam" category going harder well-defined content category. tempting suggest doesn't fit of content categories off-topic, if going utilize verdict automatic spam filtering, cause false positives @ to the lowest degree in stages. possible alternative train classifiers particular spam categories -- 1 medications, running shoes, etc.

php algorithm nlp data-mining stanford-nlp

No comments:

Post a Comment