Crawling different sites with nutch 1.8 -
i using nutch 1.8 crawling info sites has different patterns same field. writing plugins each of sites , when start nutch, first plugin matching sites, others not exists.
if first plugin not matched site, skip next 1 , check them, etc until find right plugin site?
not clear why getting this. writing htmlparsefilter? exit parse method if current document's url not match given pattern or alternatively pass metadata seeds utilize determine htmlparsefilter implementation use.
btw you'd more relevant audience posting on nutch user list (see http://nutch.apache.org/mailing_lists.html)
nutch
No comments:
Post a Comment