Tuesday, 15 March 2011

vb.net - HtmlAgilityPack not finding nodes from HttpWebRequest's returned HTML -



vb.net - HtmlAgilityPack not finding nodes from HttpWebRequest's returned HTML -

i little new htmlagilitypack. want utilize httpwebrequest can homecoming html of webpage , parse html htmlagilitypack. want find div's specific class , inner text of within div's. have far. request returns webpage html:

public function mygetreq(byval myurl string, byref thecookie cookiecontainer) dim getreq httpwebrequest = directcast(httpwebrequest.create(myurl), httpwebrequest) getreq.method = "get" getreq.keepalive = true getreq.cookiecontainer = thecookie getreq.useragent = "mozilla/5.0 (windows nt 6.3; wow64; rv:29.0) gecko/20100101 firefox/29.0" dim getresponse httpwebresponse getresponse = directcast(getreq.getresponse, httpwebresponse) dim getreqreader new streamreader(getresponse.getresponsestream()) dim thepage = getreqreader.readtoend 'clean streams , response. getreqreader.close() getresponse.close() homecoming thepage end function

this function returns html. set html this:

'the html shows in richtextbox richtextbox1.text = mygetreq("http://someurl.com", thecookie) dim htmldoc = new htmlagilitypack.htmldocument() htmldoc.loadhtml(richtextbox1.text) dim htmlnodes htmlnodecollection htmlnodes = htmldoc.documentnode.selectnodes("//div[@class='someclass']") if htmlnodes isnot nil each node in htmlnodes messagebox.show(node.innertext()) next end if

the problem is, htmlnodes coming null. final if then loop won't run. finds nothing, know fact div , class exists in html page because can see html in richtextbox1:

<div class="someclass"> inner text </div>

what problem here? htmldoc.loadhtml not type of string mygetreq returns page html?

does have html entities? thepage contains < , > brackets. not entitied.

i saw post here (c#) utilize htmlweb class, not sure how set up. of code written httpwebrequest.

thanks reading , helping.

if willing switch, utilize csquery, along these lines:

dim q new cq(mygetreq("http://someurl.com", thecookie)) each node in q("div.someclass") console.writeline(node.innertext) next

you may want add together error handling, overall should start you.

you can add together csquery project via nuget:

install-package csquery

and don't forget utilize imports csquery @ top of code file.

this may not straight solve problem, should create easier experiment info (via immediate window, example).

interesting read (performance comparison):

csquery performance vs. html agility pack , fizzler

vb.net httpwebrequest html-agility-pack

No comments:

Post a Comment