Thursday, 15 March 2012

python - Scrapy linkextractors fail -



python - Scrapy linkextractors fail -

i failing alot srapys link extractors. e.g:

scrapy shell "http://www.dachser.com/de/de/" # within shell scrapy.contrib.linkextractors.sgml import sgmllinkextractor sgmllinkextractor().extract_links(response) # yields: sgmlparseerror: expected name token @ '<!/iorangereddotmode'

now, require list of links why switched sgmllinkextractor basic htmlparserlinkextractor. works url above, lets take url , fails:

scrapy shell "http://www.yourfirm.de" # within shell scrapy.contrib.linkextractors.htmlparser import htmlparserlinkextractor htmlparserlinkextractor().extract_links(response) # yields: unicodedecodeerror: 'ascii' codec can't decode byte 0xc3 in position 26: ordinal not in range(128)

whats going on here? plan on extracting links various websites more foolproof link extraction much welcomed.

update: okay, figured out ascii error can resolved on windows setting utf-8 systemdefault encoding, see here. others fail though.. scrapy shell "http://grunwald-wangen.de" causing unicodedecodeerror: 'utf8' codec can't decode byte 0xfc in position 17: invalid start byte.

the htmlparserlinkextractor passes response.body htmlparser.

altering source code recevies response.body_as_unicode() fixes issue. doc states unicode advised. made pull request on github.

as berendt stated in comments, sgmllinkextractor seems choke on malformed htmls.

python html-parsing scrapy

No comments:

Post a Comment