Sunday, 15 September 2013

groovy - How to get the resource types from a webpage using JSoup? -



groovy - How to get the resource types from a webpage using JSoup? -

i trying create webcrawler in groovy. looking extract resource types webpage. need check if particular webpage has next resource types:

pdfs

jmp files

swf files

zip files

mp3 files

images

movie files

jsl files

i working crawler4j crawling , jsoup parsing. in general know approach getting resource type may need in future. tried next in basiccrawler.groovy. tells content type of page i.e. text/html or text/xml. need types of resource on page. please right me going wrong:

@override void visit(page page) { println "inside visit" int docid = page.getweburl().getdocid() url = page.getweburl().geturl() string domain = page.getweburl().getdomain() string path = page.getweburl().getpath() string subdomain = page.getweburl().getsubdomain() parenturl = page.getweburl().getparenturl() string anchor = page.getweburl().getanchor() println("docid: ${docid} ") println("url: ${url} ") document doc = jsoup.connect(url).get(); elements nextlinks = doc.body().select("[href]"); for( element link : nextlinks ) { string contenttype = new url(link.attr("href")).openconnection().getcontenttype(); println url + "***" + contenttype } if (page.getparsedata() instanceof htmlparsedata) { htmlparsedata htmlparsedata = (htmlparsedata) page.getparsedata() string text = htmlparsedata.gettext() string html = htmlparsedata.gethtml() list<weburl> links = htmlparsedata.getoutgoingurls() } println("finished crawling") def crawlobj = new resource(url : url) if (!crawlobj.save(flush: true, failonerror: true)) { crawlobj.errors.each { println } } }

after printing 2 doc ids, throws error: error crawler.webcrawler - exception while running visit method. message: 'unknown protocol: tel' @ java.net.url.<init>(url.java:592)

you check urls in document , inquire server content type. here quick+dirty example:

document doc = jsoup.connect("http://yourpage").get(); elements elements = doc.body().select("[href]"); (element element : elements) { string contenttype = new url(element.attr("href")).openconnection().getcontenttype(); }

for images, embedded elements , on should search src attribute.

types groovy resources jsoup crawler4j

No comments:

Post a Comment