python - How to use the Rule class in scrapy -

- September 15, 2013

i trying use rule class go next page in crawler. here code

from scrapy.contrib.spiders import crawlspider,rule scrapy.contrib.linkextractors.sgml import sgmllinkextractor crawler.items import gdreview   class gdspider(crawlspider):     name = "gd"     allowed_domains = ["glassdoor.com"]     start_urls = [         "http://www.glassdoor.com/reviews/johnson-and-johnson-reviews-e364_p1.htm"     ]      rules = (          # extract next links , parse them spider's method parse_item         rule(sgmllinkextractor(restrict_xpaths=('//li[@class="next"]/a/@href',)), follow= true)     )       def parse(self, response):         company_name = response.xpath('//*[@id="eihdrmodule"]/div[3]/div[2]/p/text()').extract()          '''loop on every review in page'''         sel in response.xpath('//*[@id="employerreviews"]/ol/li'):             review = item()             review['company_name'] = company_name             review['id'] = str(sel.xpath('@id').extract()[0]).split('_')[1] #sel.xpath('@id/text()').extract()             review['body'] = sel.xpath('div/div[3]/div/div[2]/p/text()').extract()             review['date'] = sel.xpath('div/div[1]/div/time/text()').extract()             review['summary'] = sel.xpath('div/div[2]/div/div[2]/h2/tt/a/span/text()').extract()              yield review

my question rules section. in rule, link extracted doesn't contain domain name. example, return "/reviews/johnson-and-johnson-reviews-e364_p1.htm"

how can make sure crawler append domain returned link?

thanks

you can sure since default behavior of link extractors in scrapy (source code).

also, restrict_xpaths argument should not point @href attribute, instead should either point a elements or containers having a elements descendants. plus, restrict_xpaths can defined string.

in other words, replace:

restrict_xpaths=('//li[@class="next"]/a/@href',)

with:

restrict_xpaths='//li[@class="next"]/a'

besides, need switch lxmllinkextractor sgmllinkextractor:

sgmlparser based link extractors unmantained , usage discouraged. recommended migrate lxmllinkextractor if still using sgmllinkextractor.

personally, use linkexractor shortcut lxmllinkextractor:

from scrapy.contrib.linkextractors import linkextractor

to summarize, have in rules:

rules = [     rule(linkextractor(restrict_xpaths='//li[@class="next"]/a'), follow=true) ]

Search This Blog

Print F

python - How to use the Rule class in scrapy -

Comments

Post a Comment

Popular posts from this blog

node.js - How to mock a third-party api calls in the backend -

node.js - Why do I get "SOCKS connection failed. Connection not allowed by ruleset" for some .onion sites? -

matlab - 0-by-1 sym - What do I need to change in order to get proper symbolic results? -