python - My SGML Link extractor is not matching the regex in scrapy -
this code:
class myspider(crawlspider): name = "scraper" allowed_domains = ["amazon.com"] start_urls = ["http://www.amazon.com/kindle-ebooks/b?ie=utf8&node=154606011"] rules = [rule(sgmllinkextractor(allow=('.*?/\gp/\product.*?')),callback='parse_items',follow=true)] def parse_items(self, response): sel=selector(response) items = [] url=response.url item = amazonscraper() print 'inside' print sel.css('#btasintitle::text').extract() item ["title"] = ''.join(sel.css('#btasintitle::text').extract()) print '-----',item["title"] print response.url item ["digitalprice"] = ''.join(sel.css('.digitallistprice>.listprice::text').extract()) item["digitalprice"]=re.sub('\s+','',item["digitalprice"]) item ["listprice"] = ''.join(sel.css('.listprice::text').extract()) item["listprice"]=re.sub('\s+','',item["listprice"]) item ["kindleprice"] = ''.join(sel.css('.pricelarge::text').extract()) item["kindleprice"]=re.sub('\s+','',item["kindleprice"]) if item["digitalprice"] != none , item["listprice"] != none , item["kindleprice"] != none: items.append(item) print items homecoming items i'm getting urls not match regex also. why that? want crawl book links in seed page.
as suggested in comment, perhaps @ regex.
here's rather lengthly (by number of links, skipped of them) scrapy shell session (from france, maybe response not same in part of world), , seems fetching quite lot of product links:
paul@paul-satellite-r830:~$ scrapy shell "http://www.amazon.com/kindle-ebooks/b?ie=utf8&node=154606011" --set user_agent="mozilla/5.0 (x11; linux i686) applewebkit/537.36 (khtml, gecko) chrome/34.0.1847.137 safari/537.36" 2014-06-20 12:58:05+0200 [scrapy] info: scrapy 0.22.2 started (bot: scrapybot) ... 2014-06-20 12:58:06+0200 [default] info: spider opened 2014-06-20 12:58:08+0200 [default] debug: crawled (200) <get http://www.amazon.com/kindle-ebooks/b?ie=utf8&node=154606011> (referer: none) [s] available scrapy objects: [s] crawler <scrapy.crawler.crawler object @ 0x7f6ec6fb4310> [s] item {} [s] request <get http://www.amazon.com/kindle-ebooks/b?ie=utf8&node=154606011> [s] response <200 http://www.amazon.com/kindle-ebooks/b?ie=utf8&node=154606011> [s] sel <selector xpath=none data=u'<html>\n <head>\n <meta http-equ'> [s] settings <crawlersettings module=none> [s] spider <spider 'default' @ 0x7f6ec6740590> [s] useful shortcuts: [s] shelp() shell help (print help) [s] fetch(req_or_url) fetch request (or url) , update local objects [s] view(response) view response in browser in [1]: scrapy.contrib.linkextractors.sgml import sgmllinkextractor in [2]: lx = sgmllinkextractor(allow=('.*?/\gp/\product.*?',)) in [3]: import pprint in [4]: pprint.pprint([link.url link in lx.extract_links(response)]) ['http://www.amazon.com/gp/product/b00dbybnee/ref=gno_joinprmlogo/181-5939241-1829655', 'http://www.amazon.com/gp/product/b00dbybnee/ref=nav_prime_join/181-5939241-1829655', 'http://www.amazon.com/gp/product/b007hccnju/ref=topnav_storetab_kstore/181-5939241-1829655', 'http://www.amazon.com/gp/product/b00fl3yl7o/ref=amb_link_410918762_2/181-5939241-1829655?pf_rd_i=154606011&pf_rd_m=atvpdkikx0der&pf_rd_p=1775973302&pf_rd_r=17jqxd2h3n2ez3m7cf1r&pf_rd_s=merchandised-search-top-1&pf_rd_t=101', 'http://www.amazon.com/gp/product/b00gl3mgti/ref=s9_al_bw_g351_i1/181-5939241-1829655?pf_rd_i=154606011&pf_rd_m=atvpdkikx0der&pf_rd_p=1826829602&pf_rd_r=17jqxd2h3n2ez3m7cf1r&pf_rd_s=merchandised-search-5&pf_rd_t=101', 'http://www.amazon.com/gp/product-reviews/b00gl3mgti/ref=s9_al_bw_rs1/181-5939241-1829655?ie=utf8&pf_rd_i=154606011&pf_rd_m=atvpdkikx0der&pf_rd_p=1826829602&pf_rd_r=17jqxd2h3n2ez3m7cf1r&pf_rd_s=merchandised-search-5&pf_rd_t=101&showviewpoints=1', 'http://www.amazon.com/gp/product/b00hwi5op4/ref=s9_al_bw_g351_i2/181-5939241-1829655?pf_rd_i=154606011&pf_rd_m=atvpdkikx0der&pf_rd_p=1826829602&pf_rd_r=17jqxd2h3n2ez3m7cf1r&pf_rd_s=merchandised-search-5&pf_rd_t=101', 'http://www.amazon.com/gp/product-reviews/b00hwi5op4/ref=s9_al_bw_rs2/181-5939241-1829655?ie=utf8&pf_rd_i=154606011&pf_rd_m=atvpdkikx0der&pf_rd_p=1826829602&pf_rd_r=17jqxd2h3n2ez3m7cf1r&pf_rd_s=merchandised-search-5&pf_rd_t=101&showviewpoints=1', 'http://www.amazon.com/gp/product/b009nf6z2k/ref=s9_al_bw_g351_i3/181-5939241-1829655?pf_rd_i=154606011&pf_rd_m=atvpdkikx0der&pf_rd_p=1826829602&pf_rd_r=17jqxd2h3n2ez3m7cf1r&pf_rd_s=merchandised-search-5&pf_rd_t=101', 'http://www.amazon.com/gp/product-reviews/b009nf6z2k/ref=s9_al_bw_rs3/181-5939241-1829655?ie=utf8&pf_rd_i=154606011&pf_rd_m=atvpdkikx0der&pf_rd_p=1826829602&pf_rd_r=17jqxd2h3n2ez3m7cf1r&pf_rd_s=merchandised-search-5&pf_rd_t=101&showviewpoints=1', ... 'http://www.amazon.com/gp/product-reviews/b00dn7baug/ref=s9_hps_bw_rs3/181-5939241-1829655?ie=utf8&pf_rd_i=154606011&pf_rd_m=atvpdkikx0der&pf_rd_p=1819075922&pf_rd_r=17jqxd2h3n2ez3m7cf1r&pf_rd_s=merchandised-search-12&pf_rd_t=101&showviewpoints=1', 'http://www.amazon.com/gp/product/b00a7h2cfw/ref=s9_hps_bw_g351_i4/181-5939241-1829655?pf_rd_i=154606011&pf_rd_m=atvpdkikx0der&pf_rd_p=1819075922&pf_rd_r=17jqxd2h3n2ez3m7cf1r&pf_rd_s=merchandised-search-12&pf_rd_t=101', 'http://www.amazon.com/gp/product-reviews/b00a7h2cfw/ref=s9_hps_bw_rs4/181-5939241-1829655?ie=utf8&pf_rd_i=154606011&pf_rd_m=atvpdkikx0der&pf_rd_p=1819075922&pf_rd_r=17jqxd2h3n2ez3m7cf1r&pf_rd_s=merchandised-search-12&pf_rd_t=101&showviewpoints=1', 'http://www.amazon.com/gp/product/b00b52iqna/ref=s9_al_bw_g351_i1/181-5939241-1829655?pf_rd_i=154606011&pf_rd_m=atvpdkikx0der&pf_rd_p=1711163122&pf_rd_r=17jqxd2h3n2ez3m7cf1r&pf_rd_s=merchandised-search-right-5&pf_rd_t=101', 'http://www.amazon.com/gp/product/b00b52iqna/ref=s9_al_bw_g351_t1/181-5939241-1829655?pf_rd_i=154606011&pf_rd_m=atvpdkikx0der&pf_rd_p=1711163122&pf_rd_r=17jqxd2h3n2ez3m7cf1r&pf_rd_s=merchandised-search-right-5&pf_rd_t=101', 'http://www.amazon.com/gp/product/b00b52iqt4/ref=s9_al_bw_g351_i2/181-5939241-1829655?pf_rd_i=154606011&pf_rd_m=atvpdkikx0der&pf_rd_p=1711163122&pf_rd_r=17jqxd2h3n2ez3m7cf1r&pf_rd_s=merchandised-search-right-5&pf_rd_t=101', 'http://www.amazon.com/gp/product/b00b52iqt4/ref=s9_al_bw_g351_t2/181-5939241-1829655?pf_rd_i=154606011&pf_rd_m=atvpdkikx0der&pf_rd_p=1711163122&pf_rd_r=17jqxd2h3n2ez3m7cf1r&pf_rd_s=merchandised-search-right-5&pf_rd_t=101', 'http://www.amazon.com/gp/product/b00b52iqsa/ref=s9_al_bw_g351_i3/181-5939241-1829655?pf_rd_i=154606011&pf_rd_m=atvpdkikx0der&pf_rd_p=1711163122&pf_rd_r=17jqxd2h3n2ez3m7cf1r&pf_rd_s=merchandised-search-right-5&pf_rd_t=101', 'http://www.amazon.com/gp/product/b00b52iqsa/ref=s9_al_bw_g351_t3/181-5939241-1829655?pf_rd_i=154606011&pf_rd_m=atvpdkikx0der&pf_rd_p=1711163122&pf_rd_r=17jqxd2h3n2ez3m7cf1r&pf_rd_s=merchandised-search-right-5&pf_rd_t=101', 'http://www.amazon.com/gp/product/b00dgaltqa/ref=amb_link_409685542_1/181-5939241-1829655?pf_rd_i=154606011&pf_rd_m=atvpdkikx0der&pf_rd_p=1749675842&pf_rd_r=17jqxd2h3n2ez3m7cf1r&pf_rd_s=merchandised-search-right-6&pf_rd_t=101', 'http://www.amazon.com/gp/product/b00dgaltqa/ref=amb_link_409685542_3/181-5939241-1829655?pf_rd_i=154606011&pf_rd_m=atvpdkikx0der&pf_rd_p=1749675842&pf_rd_r=17jqxd2h3n2ez3m7cf1r&pf_rd_s=merchandised-search-right-6&pf_rd_t=101', 'http://www.amazon.com/gp/product/b00dgaltqa/ref=amb_link_409685542_4/181-5939241-1829655?pf_rd_i=154606011&pf_rd_m=atvpdkikx0der&pf_rd_p=1749675842&pf_rd_r=17jqxd2h3n2ez3m7cf1r&pf_rd_s=merchandised-search-right-6&pf_rd_t=101', 'http://www.amazon.com/gp/product/b00fl3yl6k/ref=amb_link_410240162_1/181-5939241-1829655?pf_rd_i=154606011&pf_rd_m=atvpdkikx0der&pf_rd_p=1752410382&pf_rd_r=17jqxd2h3n2ez3m7cf1r&pf_rd_s=merchandised-search-right-7&pf_rd_t=101', 'http://www.amazon.com/gp/product/b00fl3yl6k/ref=amb_link_410240162_3/181-5939241-1829655?pf_rd_i=154606011&pf_rd_m=atvpdkikx0der&pf_rd_p=1752410382&pf_rd_r=17jqxd2h3n2ez3m7cf1r&pf_rd_s=merchandised-search-right-7&pf_rd_t=101', 'http://www.amazon.com/gp/product/b00dzqe2y6/ref=amb_link_410240162_4/181-5939241-1829655?pf_rd_i=154606011&pf_rd_m=atvpdkikx0der&pf_rd_p=1752410382&pf_rd_r=17jqxd2h3n2ez3m7cf1r&pf_rd_s=merchandised-search-right-7&pf_rd_t=101', 'http://www.amazon.com/gp/product/b00c7xtoms/ref=s9_al_bw_g351_i1/181-5939241-1829655?pf_rd_i=154606011&pf_rd_m=atvpdkikx0der&pf_rd_p=1711175222&pf_rd_r=17jqxd2h3n2ez3m7cf1r&pf_rd_s=merchandised-search-right-8&pf_rd_t=101', 'http://www.amazon.com/gp/product/b00c7xtoms/ref=s9_al_bw_g351_more/181-5939241-1829655?pf_rd_i=154606011&pf_rd_m=atvpdkikx0der&pf_rd_p=1711175222&pf_rd_r=17jqxd2h3n2ez3m7cf1r&pf_rd_s=merchandised-search-right-8&pf_rd_t=101', 'http://www.amazon.com/gp/product/b00juwygdq/ref=s9_al_bw_g351_i1/181-5939241-1829655?pf_rd_i=154606011&pf_rd_m=atvpdkikx0der&pf_rd_p=1814488482&pf_rd_r=17jqxd2h3n2ez3m7cf1r&pf_rd_s=merchandised-search-right-9&pf_rd_t=101', 'http://www.amazon.com/gp/product/b00juwygdq/ref=s9_al_bw_g351_more/181-5939241-1829655?pf_rd_i=154606011&pf_rd_m=atvpdkikx0der&pf_rd_p=1814488482&pf_rd_r=17jqxd2h3n2ez3m7cf1r&pf_rd_s=merchandised-search-right-9&pf_rd_t=101'] in [5]: lx = sgmllinkextractor(allow=('/gp/product/',)) in [6]: pprint.pprint([link.url link in lx.extract_links(response)]) ['http://www.amazon.com/gp/product/b00dbybnee/ref=gno_joinprmlogo/181-5939241-1829655', 'http://www.amazon.com/gp/product/b00dbybnee/ref=nav_prime_join/181-5939241-1829655', ... 'http://www.amazon.com/gp/product/b00juwygdq/ref=s9_al_bw_g351_i1/181-5939241-1829655?pf_rd_i=154606011&pf_rd_m=atvpdkikx0der&pf_rd_p=1814488482&pf_rd_r=17jqxd2h3n2ez3m7cf1r&pf_rd_s=merchandised-search-right-9&pf_rd_t=101', 'http://www.amazon.com/gp/product/b00juwygdq/ref=s9_al_bw_g351_more/181-5939241-1829655?pf_rd_i=154606011&pf_rd_m=atvpdkikx0der&pf_rd_p=1814488482&pf_rd_r=17jqxd2h3n2ez3m7cf1r&pf_rd_s=merchandised-search-right-9&pf_rd_t=101'] in [7]: len([link.url link in lx.extract_links(response)]) out[7]: 106 so 106 /gp/product/ link compared 185 regex.
python web-scraping scrapy
No comments:
Post a Comment