Data extraction with Scrapy-II

In this post, I will discuss on the subtle features of the Scrapy framework responsible for data extraction including building a basic spider. But first, a key points to remember, that are as follows;

Preliminaries

  • From the documentation, “Scrapy spiders can return the extracted data as Python dicts” therefore to separate the key-value pair the best option is to use a for loop or an item pipeline. See this SO answer by eLruLL for the same.
  •  You can create as many custom parse functions as you want but make sure that the default parse function is there in the spider class else scrapy will throw a “NotImplementedError”. See this SO answer by masnun
  • As underlined in the previous post’s, make sure that you check the XPath first in the scrapy shell before executing the spider. You cannot always rely on the browser inspect or xpath plugins. Sometimes, following the tag hierarchy can also yield the data you are looking for. So keep your options open.
  • The ‘/’ in xpath selects the specific tag text and ‘//’ selects the specific tag text as well as its children tags
  • Use the strip() method to remove the html tags surrounding the text

In Scrapy, the items that you need to extract require to be defined in the Item class. For example, If I require the description and price, then I will define them in the Items class as

descrptn = Field() price = Field()

Let’s create a spider which will extract the latest hot deals from ebay which are either labelled as “Sold out” or “Almost gone”. 

Crafting a Basic Spider

On the terminal type,

 scrapy startproject ebaySpiderTutorial

This will create a directory structure with the following contents

 ebaySpiderTutorial/ # project's Python module, you'll import your code from here
    __init__.py items.py # project items definition file
pipelines.py # project pipelines file
settings.py # project settings file
spiders/ # a directory where you'll later put your spiders
    __init__.py

Next, I create a scrapy shell object of the website I want the spider to crawl. although, this step is not necessary but it is a good programming practice to ensure that the items that you want to scrape you have the correct XPath for them. In the terminal, I type

 scrapy shell http://deals.ebay.com.my/

and then I test the XPath for item description as

descr = response.xpath("//div/div/div/a/span/text()").extract() 

, item price as

 price = response.xpath("//div/div/div//span[@class='price']/text()").extract()

, item sale tag as

 saletag = response.xpath("//span[@class='saleTag']/text()").extract()

and item sale alert tag as

saletagalert = response.xpath("//span[@class='saleTag alert']/text()").extract() 

.
Remember, from the preliminaries section that “Scrapy spiders can return the extracted data as Python dicts” therefore to separate the key-value pair the best option is to use a for loop or an item pipeline. This is shown in the following code as,

 
for d, p, st, stp in zip(descr,price,saletag, saletagalert): 
      item['descr'] = d.strip() 
      item['price'] = p.strip() 
      item['saletag'] = st.strip()
      item['saletagalert'] = stp.strip() 
      yield item

Now, execute the command

crawl ebay_spider -o datadump.csv -t csv

. Your scraped data. *.csv file should be similar to shown in Figure 1.scrapy-post3-1Figure 1. Scraped results.

Lessons Learnt

  • scrapy spiders return extracted data as python dictionaries and the best way to extract the values from the dictionaries is to use the for loop.
  • XPath should always be tested in the scrapy shell before crawling
  • In XPath, usage of ‘/text()’ helps in extracting the text data

Code Improvements/Future Work

  • The current spider is limited to extracting only the items with saleTag and saleAlert fields. It does not extract items that are excluded from saleTag or saleAlert field. The solution is to use the ‘OR operator |’ . I will leave this as a future work.
  • I also want to work on data pipe-lining which would be the discussion topic for the next post.
  • Create and execute multiple spiders from the same crawler
  • Design the ability for the spider to be activated a specific trigger of an event
  • Include regular expression (re) functionality for scraped data cleaning

Code on Github

Advertisements