Breeding: groovy - How to get the resource types from a webpage using JSoup? -

Sunday, 15 September 2013

groovy - How to get the resource types from a webpage using JSoup? -

i trying create webcrawler in groovy. looking extract resource types webpage. need check if particular webpage has next resource types:

pdfs

jmp files

swf files

zip files

mp3 files

images

movie files

jsl files

i working crawler4j crawling , jsoup parsing. in general know approach getting resource type may need in future. tried next in basiccrawler.groovy. tells content type of page i.e. text/html or text/xml. need types of resource on page. please right me going wrong:

@override void visit(page page) {     println "inside visit"     int docid = page.getweburl().getdocid()     url =  page.getweburl().geturl()     string domain = page.getweburl().getdomain()     string path = page.getweburl().getpath()     string subdomain = page.getweburl().getsubdomain()     parenturl = page.getweburl().getparenturl()     string anchor = page.getweburl().getanchor()     println("docid: ${docid} ")     println("url: ${url}  ")     document doc = jsoup.connect(url).get();     elements nextlinks = doc.body().select("[href]");     for( element link : nextlinks ) {         string contenttype = new url(link.attr("href")).openconnection().getcontenttype();         println url + "***" + contenttype     }     if (page.getparsedata() instanceof htmlparsedata) {         htmlparsedata htmlparsedata = (htmlparsedata) page.getparsedata()         string text = htmlparsedata.gettext()         string html = htmlparsedata.gethtml()         list<weburl> links = htmlparsedata.getoutgoingurls()      }     println("finished crawling")     def crawlobj = new resource(url : url)     if (!crawlobj.save(flush: true, failonerror: true)) {         crawlobj.errors.each { println }     } }

after printing 2 doc ids, throws error: error crawler.webcrawler - exception while running visit method. message: 'unknown protocol: tel' @ java.net.url.<init>(url.java:592)

you check urls in document , inquire server content type. here quick+dirty example:

document doc = jsoup.connect("http://yourpage").get(); elements elements = doc.body().select("[href]"); (element element : elements) {     string contenttype = new url(element.attr("href")).openconnection().getcontenttype(); }

for images, embedded elements , on should search src attribute.

types groovy resources jsoup crawler4j

Breeding

Sunday, 15 September 2013

groovy - How to get the resource types from a webpage using JSoup? -

No comments:

Post a Comment