php - proposed nlp algorithm for text tagging -
i looking opensource tool can help identify tags user post on social media , identifying topic/off-topic or spam comment on post. after looking entire day, not find suitable tool/library.
here have proposed own algorithm tagging user post belonging 7 categories (jobs, discussion, events, articles, services, buy/sell, talents).
initially when user makes post, tags post. tags can marketing, suggestion, entrepreneurship, mnc etc. consider posts have tags , category belongs.
steps:
perform pos (part of speech) tagging on user post. here 2 things can done.
considering nouns. nouns may represent tag post more intuitively guess
considering nouns , adjectives both. here can collect big numbers of nouns , adjectives. frequency of such words can used identify tag post.
for each user defined tag, collect pos post belonging particular tag. example. consider user assigned tag marketing , post tag contains pos words seo , adwords. suppose 10 post of marketing tag contains seo , adwords 5 , 7 times respectively. next time when user post comes not have tag contains pos word seo. seo occurring maximum times 7 in marketing tag, predict marketing tag post
next steps identify spam or off-topic comment post. consider 1 user post job category. post contains tag marketing. check in database top frequent 10-15 part of speech tags(i.e. nouns , adjective) marketing.
parallel have pos tag comment. check whether pos(noun & adj) of post contains top frequent tags(we can consider 15-20 such pos tags) belonging marketing.
if pos in comments not match of frequent, top pos marketing comment can said off-topic/span
do have suggestion create algo more intuitive??
i guess svm can help classification, suggestion this?
apart machine learning technique can help here larn scheme predict tag , spam(off topic) comments
the main problem see feature modeling. while picking out nouns help cut down feature space, step potentially important error rate. , care whether looking @ market/n , not market/v?
most mainline text classification implementations using naive bayesian classifiers ignore pos, , count each distinct word form independent feature. (you brute-force stemming cut down market, markets, , marketing single stem form , single feature. tends work in english, might not adequate if working in different language.)
a compromise pos filtering when train classifier. word forms not have noun reading end 0 score in classifier, don't have filter them out when utilize resulting classifier.
empirically, svm tends accomplish high accuracy, comes @ cost of complexity, both in implementation , behavior. naive bayesian classifier has distinct advantage can understand exactly how arrived @ particular conclusion. (well, of mortals cannot claim have same grasp of mathematics behind svm.) perhaps way proceed prototype bayes, , iron out kinks while learning how scheme whole behaves, maybe later consider switching svm 1 time other parts stable?
the "spam" category going harder well-defined content category. tempting suggest doesn't fit of content categories off-topic, if going utilize verdict automatic spam filtering, cause false positives @ to the lowest degree in stages. possible alternative train classifiers particular spam categories -- 1 medications, running shoes, etc.
php algorithm nlp data-mining stanford-nlp
No comments:
Post a Comment