In this post, I will discuss on the subtle features of the Scrapy framework responsible for data extraction including building a basic spider. But first, a key points to remember, that are as follows;
- From the documentation, “Scrapy spiders can return the extracted data as Python dicts” therefore to separate the key-value pair the best option is to use a for loop or an item pipeline. See this SO answer by eLruLL for the same.
- You can create as many custom parse functions as you want but make sure that the default parse function is there in the spider class else scrapy will throw a “NotImplementedError”. See this SO answer by masnun
- As underlined in the previous post’s, make sure that you check the XPath first in the scrapy shell before executing the spider. You cannot always rely on the browser inspect or xpath plugins. Sometimes, following the tag hierarchy can also yield the data you are looking for. So keep your options open.
- The ‘/’ in xpath selects the specific tag text and ‘//’ selects the specific tag text as well as its children tags
- Use the strip() method to remove the html tags surrounding the text
In Scrapy, the items that you need to extract require to be defined in the Item class. For example, If I require the description and price, then I will define them in the Items class as
descrptn = Field() price = Field()
Let’s create a spider which will extract the latest hot deals from ebay which are either labelled as “Sold out” or “Almost gone”.
Crafting a Basic Spider
On the terminal type,
scrapy startproject ebaySpiderTutorial
This will create a directory structure with the following contents
ebaySpiderTutorial/ # project's Python module, you'll import your code from here __init__.py items.py # project items definition file pipelines.py # project pipelines file settings.py # project settings file spiders/ # a directory where you'll later put your spiders __init__.py
Next, I create a scrapy shell object of the website I want the spider to crawl. although, this step is not necessary but it is a good programming practice to ensure that the items that you want to scrape you have the correct XPath for them. In the terminal, I type
scrapy shell http://deals.ebay.com.my/
and then I test the XPath for item description as
descr = response.xpath("//div/div/div/a/span/text()").extract()
, item price as
price = response.xpath("//div/div/div//span[@class='price']/text()").extract()
, item sale tag as
saletag = response.xpath("//span[@class='saleTag']/text()").extract()
and item sale alert tag as
saletagalert = response.xpath("//span[@class='saleTag alert']/text()").extract()
Remember, from the preliminaries section that “Scrapy spiders can return the extracted data as Python dicts” therefore to separate the key-value pair the best option is to use a for loop or an item pipeline. This is shown in the following code as,
for d, p, st, stp in zip(descr,price,saletag, saletagalert): item['descr'] = d.strip() item['price'] = p.strip() item['saletag'] = st.strip() item['saletagalert'] = stp.strip() yield item
Now, execute the command
crawl ebay_spider -o datadump.csv -t csv
. Your scraped data. *.csv file should be similar to shown in Figure 1.Figure 1. Scraped results.
- scrapy spiders return extracted data as python dictionaries and the best way to extract the values from the dictionaries is to use the for loop.
- XPath should always be tested in the scrapy shell before crawling
- In XPath, usage of ‘/text()’ helps in extracting the text data
Code Improvements/Future Work
- The current spider is limited to extracting only the items with saleTag and saleAlert fields. It does not extract items that are excluded from saleTag or saleAlert field. The solution is to use the ‘OR operator |’ . I will leave this as a future work.
- I also want to work on data pipe-lining which would be the discussion topic for the next post.
- Create and execute multiple spiders from the same crawler
- Design the ability for the spider to be activated a specific trigger of an event
- Include regular expression (re) functionality for scraped data cleaning