python - Scrapy: Attempts to extract data from selector list not right -


i trying scrape football fixtures website , spider not quite right either same fixture repeated selectors or hometeam , awayteamvariables huge arrays contain home sides or away sides respectively. either way should reflect home vs away format.

this current attempt:

class fixturespider(crawlspider):     name = "fixturesspider"     allowed_domains = ["www.bbc.co.uk"]     start_urls = [         "http://www.bbc.co.uk/sport/football/premier-league/fixtures"     ]      def parse(self, response):         sel in response.xpath('//table[@class="table-stats"]/tbody/tr[@class="preview"]'):          item = fixture()         item['kickoff'] =  str(sel.xpath("//table[@class='table-stats']/tbody/tr[@class='preview']/td[3]/text()").extract()[0].strip())         item['hometeam'] = str(sel.xpath("//table[@class='table-stats']/tbody/tr/td[2]/p/span/a/text()").extract()[0].strip())         item['awayteam'] = str(sel.xpath("//table[@class='table-stats']/tbody/tr/td[2]/p/span/a/text()").extract()[1].strip())         yield item 

this returns below information repeatedly incorrect:

2015-03-20 21:41:40+0000 [fixturesspider] debug: scraped <200 http://www.bbc.co.uk/sport/football/premier-league/fixtures> {'awayteam': 'west brom', 'hometeam': 'man city', 'kickoff': '12:45'} 2015-03-20 21:41:40+0000 [fixturesspider] debug: scraped <200 http://www.bbc.co.uk/sport/football/premier-league/fixtures> {'awayteam': 'west brom', 'hometeam': 'man city', 'kickoff': '12:45'} 

could let me know i'm going wrong?

the problem xpath expressions using in loop absolute - start root element, should relative current row sel pointing to. in other words, need search in current row context.

fixed version:

for sel in response.xpath('//table[@class="table-stats"]/tbody/tr[@class="preview"]'):     item = fixture()     item['kickoff'] =  str(sel.xpath("td[3]/text()").extract()[0].strip())     item['hometeam'] = str(sel.xpath("td[2]/p/span/a/text()").extract()[0].strip())     item['awayteam'] = str(sel.xpath("td[2]/p/span/a/text()").extract()[1].strip())     yield item 

this output i'm getting:

{'awayteam': 'west brom', 'hometeam': 'man city', 'kickoff': '12:45'} {'awayteam': 'swansea', 'hometeam': 'aston villa', 'kickoff': '15:00'} {'awayteam': 'arsenal', 'hometeam': 'newcastle', 'kickoff': '15:00'} ... 

if want grab match dates, need change strategy - iterate on dates (h2 elements table-header class) , first following sibling table element:

for date in response.xpath('//h2[@class="table-header"]'):     matches = date.xpath('.//following-sibling::table[@class="table-stats"][1]/tbody/tr[@class="preview"]')     date = date.xpath('text()').extract()[0].strip()      match in matches:         item = fixture()         item['date'] = date         item['kickoff'] = match.xpath("td[3]/text()").extract()[0].strip()         item['hometeam'] = match.xpath("td[2]/p/span/a/text()").extract()[0].strip()         item['awayteam'] = match.xpath("td[2]/p/span/a/text()").extract()[1].strip()         yield item 

Comments

Popular posts from this blog

c++ - Delete matches in OpenCV (Keypoints and descriptors) -

java - Could not locate OpenAL library -

sorting - opencl Bitonic sort with 64 bits keys -