python - Scrapy linkextractors fail -
i failing alot srapys link extractors. e.g:
scrapy shell "http://www.dachser.com/de/de/" # within shell scrapy.contrib.linkextractors.sgml import sgmllinkextractor sgmllinkextractor().extract_links(response) # yields: sgmlparseerror: expected name token @ '<!/iorangereddotmode' now, require list of links why switched sgmllinkextractor basic htmlparserlinkextractor. works url above, lets take url , fails:
scrapy shell "http://www.yourfirm.de" # within shell scrapy.contrib.linkextractors.htmlparser import htmlparserlinkextractor htmlparserlinkextractor().extract_links(response) # yields: unicodedecodeerror: 'ascii' codec can't decode byte 0xc3 in position 26: ordinal not in range(128) whats going on here? plan on extracting links various websites more foolproof link extraction much welcomed.
update: okay, figured out ascii error can resolved on windows setting utf-8 systemdefault encoding, see here. others fail though.. scrapy shell "http://grunwald-wangen.de" causing unicodedecodeerror: 'utf8' codec can't decode byte 0xfc in position 17: invalid start byte.
the htmlparserlinkextractor passes response.body htmlparser.
altering source code recevies response.body_as_unicode() fixes issue. doc states unicode advised. made pull request on github.
as berendt stated in comments, sgmllinkextractor seems choke on malformed htmls.
python html-parsing scrapy
No comments:
Post a Comment