Friday, 15 January 2010

Python: How to use BeautifulSoup to deal with encoding issues? -



Python: How to use BeautifulSoup to deal with encoding issues? -

this first time using beautifulsoup.

basically, utilize beautifulsoup extract data. trying build table in csv based on webtable. , illustration row of table looks this:

[<td>1</td>, <td> chief executives , senior officials</td>, <td>£120,830</td>,<td>-3.8</td>]

now, problem when utilize .text.encode('utf8'), output becomes:

('1', ' chief executives , senior officials', '\xc2\xa3120,830', '-3.8')

the figure £120,830 becomes \xc2\xa3120,830, have no thought kind of encoding is. there way can proper output £120,830 rather crazy encoding ?

alternatively, there way create crazy encoded thing \xc2\xa3120,830 £120,830 in csv ? know how deal these kind of problem ?

another alternative remove <td> tags , maintain content, how can in python ? there efficient way of getting rid of these tags ? help appreciated. thanks

that how £ comes out when encode utf-8. if that's not want, why encoding it?

in more detail, utf-8 encodes u+00a3 byte sequence 0xc2 0xa3 (two bytes) python displays in string '\xc2\xa3'.

if want in file , want file utf-8 encoded, nil wrong, except maybe using @ file.

python beautifulsoup

No comments:

Post a Comment