python - How to use the Rule class in scrapy -


i trying use rule class go next page in crawler. here code

from scrapy.contrib.spiders import crawlspider,rule scrapy.contrib.linkextractors.sgml import sgmllinkextractor crawler.items import gdreview   class gdspider(crawlspider):     name = "gd"     allowed_domains = ["glassdoor.com"]     start_urls = [         "http://www.glassdoor.com/reviews/johnson-and-johnson-reviews-e364_p1.htm"     ]      rules = (          # extract next links , parse them spider's method parse_item         rule(sgmllinkextractor(restrict_xpaths=('//li[@class="next"]/a/@href',)), follow= true)     )       def parse(self, response):         company_name = response.xpath('//*[@id="eihdrmodule"]/div[3]/div[2]/p/text()').extract()          '''loop on every review in page'''         sel in response.xpath('//*[@id="employerreviews"]/ol/li'):             review = item()             review['company_name'] = company_name             review['id'] = str(sel.xpath('@id').extract()[0]).split('_')[1] #sel.xpath('@id/text()').extract()             review['body'] = sel.xpath('div/div[3]/div/div[2]/p/text()').extract()             review['date'] = sel.xpath('div/div[1]/div/time/text()').extract()             review['summary'] = sel.xpath('div/div[2]/div/div[2]/h2/tt/a/span/text()').extract()              yield review 

my question rules section. in rule, link extracted doesn't contain domain name. example, return "/reviews/johnson-and-johnson-reviews-e364_p1.htm"

how can make sure crawler append domain returned link?

thanks

you can sure since default behavior of link extractors in scrapy (source code).

also, restrict_xpaths argument should not point @href attribute, instead should either point a elements or containers having a elements descendants. plus, restrict_xpaths can defined string.

in other words, replace:

restrict_xpaths=('//li[@class="next"]/a/@href',) 

with:

restrict_xpaths='//li[@class="next"]/a' 

besides, need switch lxmllinkextractor sgmllinkextractor:

sgmlparser based link extractors unmantained , usage discouraged. recommended migrate lxmllinkextractor if still using sgmllinkextractor.

personally, use linkexractor shortcut lxmllinkextractor:

from scrapy.contrib.linkextractors import linkextractor 

to summarize, have in rules:

rules = [     rule(linkextractor(restrict_xpaths='//li[@class="next"]/a'), follow=true) ] 

Comments

Popular posts from this blog

c++ - Delete matches in OpenCV (Keypoints and descriptors) -

java - Could not locate OpenAL library -

sorting - opencl Bitonic sort with 64 bits keys -