python - How to use the Rule class in scrapy -
i trying use rule class go next page in crawler. here code
from scrapy.contrib.spiders import crawlspider,rule scrapy.contrib.linkextractors.sgml import sgmllinkextractor crawler.items import gdreview class gdspider(crawlspider): name = "gd" allowed_domains = ["glassdoor.com"] start_urls = [ "http://www.glassdoor.com/reviews/johnson-and-johnson-reviews-e364_p1.htm" ] rules = ( # extract next links , parse them spider's method parse_item rule(sgmllinkextractor(restrict_xpaths=('//li[@class="next"]/a/@href',)), follow= true) ) def parse(self, response): company_name = response.xpath('//*[@id="eihdrmodule"]/div[3]/div[2]/p/text()').extract() '''loop on every review in page''' sel in response.xpath('//*[@id="employerreviews"]/ol/li'): review = item() review['company_name'] = company_name review['id'] = str(sel.xpath('@id').extract()[0]).split('_')[1] #sel.xpath('@id/text()').extract() review['body'] = sel.xpath('div/div[3]/div/div[2]/p/text()').extract() review['date'] = sel.xpath('div/div[1]/div/time/text()').extract() review['summary'] = sel.xpath('div/div[2]/div/div[2]/h2/tt/a/span/text()').extract() yield review
my question rules section. in rule, link extracted doesn't contain domain name. example, return "/reviews/johnson-and-johnson-reviews-e364_p1.htm"
how can make sure crawler append domain returned link?
thanks
you can sure since default behavior of link extractors in scrapy (source code).
also, restrict_xpaths
argument should not point @href
attribute, instead should either point a
elements or containers having a
elements descendants. plus, restrict_xpaths
can defined string.
in other words, replace:
restrict_xpaths=('//li[@class="next"]/a/@href',)
with:
restrict_xpaths='//li[@class="next"]/a'
besides, need switch lxmllinkextractor
sgmllinkextractor
:
sgmlparser based link extractors unmantained , usage discouraged. recommended migrate lxmllinkextractor if still using sgmllinkextractor.
personally, use linkexractor
shortcut lxmllinkextractor
:
from scrapy.contrib.linkextractors import linkextractor
to summarize, have in rules
:
rules = [ rule(linkextractor(restrict_xpaths='//li[@class="next"]/a'), follow=true) ]
Comments
Post a Comment