python - Scrapy: Attempts to extract data from selector list not right -
i trying scrape football fixtures website , spider not quite right either same fixture repeated selectors or hometeam
, awayteam
variables huge arrays contain home sides or away sides respectively. either way should reflect home vs away format.
this current attempt:
class fixturespider(crawlspider): name = "fixturesspider" allowed_domains = ["www.bbc.co.uk"] start_urls = [ "http://www.bbc.co.uk/sport/football/premier-league/fixtures" ] def parse(self, response): sel in response.xpath('//table[@class="table-stats"]/tbody/tr[@class="preview"]'): item = fixture() item['kickoff'] = str(sel.xpath("//table[@class='table-stats']/tbody/tr[@class='preview']/td[3]/text()").extract()[0].strip()) item['hometeam'] = str(sel.xpath("//table[@class='table-stats']/tbody/tr/td[2]/p/span/a/text()").extract()[0].strip()) item['awayteam'] = str(sel.xpath("//table[@class='table-stats']/tbody/tr/td[2]/p/span/a/text()").extract()[1].strip()) yield item
this returns below information repeatedly incorrect:
2015-03-20 21:41:40+0000 [fixturesspider] debug: scraped <200 http://www.bbc.co.uk/sport/football/premier-league/fixtures> {'awayteam': 'west brom', 'hometeam': 'man city', 'kickoff': '12:45'} 2015-03-20 21:41:40+0000 [fixturesspider] debug: scraped <200 http://www.bbc.co.uk/sport/football/premier-league/fixtures> {'awayteam': 'west brom', 'hometeam': 'man city', 'kickoff': '12:45'}
could let me know i'm going wrong?
the problem xpath expressions using in loop absolute - start root element, should relative current row sel
pointing to. in other words, need search in current row context.
fixed version:
for sel in response.xpath('//table[@class="table-stats"]/tbody/tr[@class="preview"]'): item = fixture() item['kickoff'] = str(sel.xpath("td[3]/text()").extract()[0].strip()) item['hometeam'] = str(sel.xpath("td[2]/p/span/a/text()").extract()[0].strip()) item['awayteam'] = str(sel.xpath("td[2]/p/span/a/text()").extract()[1].strip()) yield item
this output i'm getting:
{'awayteam': 'west brom', 'hometeam': 'man city', 'kickoff': '12:45'} {'awayteam': 'swansea', 'hometeam': 'aston villa', 'kickoff': '15:00'} {'awayteam': 'arsenal', 'hometeam': 'newcastle', 'kickoff': '15:00'} ...
if want grab match dates, need change strategy - iterate on dates (h2
elements table-header
class) , first following sibling table
element:
for date in response.xpath('//h2[@class="table-header"]'): matches = date.xpath('.//following-sibling::table[@class="table-stats"][1]/tbody/tr[@class="preview"]') date = date.xpath('text()').extract()[0].strip() match in matches: item = fixture() item['date'] = date item['kickoff'] = match.xpath("td[3]/text()").extract()[0].strip() item['hometeam'] = match.xpath("td[2]/p/span/a/text()").extract()[0].strip() item['awayteam'] = match.xpath("td[2]/p/span/a/text()").extract()[1].strip() yield item
Comments
Post a Comment