Tuesday, 15 February 2011

python - Web (weird)wrapped text to plain text string -



python - Web (weird)wrapped text to plain text string -

i'm trying convert wrapped text plane text string endlines , all. wrapping of wierd kind have never seen before. text gained xml file cdata section

<font color="#bfffffff" size="12"></font><font color="#ff00ff00" size="12">my fellow muppets,<br><br>we sorry devilish intetions not going work out muppet brigade sorry guys not active ebough how ever extend arm players leave , bring together dynacorp. if of intrested drop me mail service , best of luck in future endevors. <br><br>o7 <br><br><br/></br></br></br></br></br></br></font><font color="#ff007fff" size="14">john milbroc<br/></font><font color="#bfffffff" size="14">--------------------------<br/></font><font color="#ff007fff" size="14">the muppet brigade ceo</font>

i've tryed next tough:

z = beautifulsoup(string) z.get_text()

however beautifulsoup not seem doing anything. i'm rather new python sorry if realy easy problem.

i think maybe beatifulsoup module broken because when :

from bs4 import beautifulsoup html_doc =""" hi.<br><br>this message.<br><br> """ print(html_doc) soup = beautifulsoup(html_doc) print(soup.text)

it prints:

hi.<br><br>this message.<br><br> none

after trying messed around other stuff , found if do

soup.get_text()

instead of

soup.txt

it wil print parsed text. wierd worked. te encouragement , keeping me on right track.

why not parse html using beautifulsoup? example:

html_doc = """ ## re-create here html text """"

then parse :

from bs4 import beautifulsoup soup = beautifulsoup(html_doc)

you extract text :

print soup.text fellow muppets,we sorry devilish intetions not going work out muppet brigade sorry guys not active ebough how ever extend arm players leave , bring together dynacorp. if of intrested drop me mail service , best of luck in future endevors. o7 john milbroc-------------------------- muppet brigade ceo

python

No comments:

Post a Comment