Sunday, 15 July 2012

java - Jsoup incorrect value children size -



java - Jsoup incorrect value children size -

jsoup wrong counts number of children:

document document = jsoup .parse(teststring); element div = document.select("div").first(); elements divchildren = div.children(); system.out.println(divchildren.size());

for example, if teststring =

<div><div><p>text1</p></div><p>text2</p></div>

or

<div><h1><p>text1</p></h1><p>text2</p></div>

then divchildren.size() = 2

if teststring =

<div><p><p>text1</p></p><p>text2</p></div>

then divchildren.size() = 4

what doing wrong?

if take @ document holding after parsing

string teststring ="<div><p><p>text1</p></p><p>text2</p></div>";

you see

<html> <head></head> <body> <div> <p></p> <p>text1</p> <p></p> <p>text2</p> </div> </body> </html>

as @rejesh pointed p can't contain other block-level-elements p jsoup prevents closing such wrong outer p elements (separate closure opening tag , closing tag). in case

<p><p>text</p></p>

will become

<p></p><p>text1</p><p></p>

so div

<div><p><p>text1</p></p><p>text2</p></div>

will parsed as

<div> <p></p> <p>text1</p> <p></p> <p>text2</p> </div>

and see there 4 children (two empty p , 2 p text).

if want turn off validating mechanism can utilize xml parser instead of standard html parser

string teststring ="<div><p><p>text1</p></p><p>text2</p></div>"; document document = jsoup.parse(teststring,"",parser.xmlparser()); system.out.println(document); element div = document.select("div").first(); elements divchildren = div.children(); system.out.println(divchildren.size());

will print 2.

java html parsing dom jsoup

No comments:

Post a Comment