Data extraction with Scrapy-I

Disclaimer: The objective of this post is purely educational in nature. There are no monetary benefits associated.

Introduction

In this digital age, we are surrounded by data and a majority of it is in unstructured format. The oxford dictionary defines unstructured as “Without formal organization or structure.”. Websites are a rich source of this unstructured data and can be mined and turned into useful insights. There are several good open source web scraping frameworks including Scrapy, Nutch and Heritrix. A good review on Open Source Crawlers can be found here and see this related Quora answer.

Using Scrapy

Hope you have a working Scrapy on your machine, if not then you can go through my previous post. To derive the maximum learning pleasure out of Scrapy would require you to know at least the basics of XPATH and CSS. Refer to this tutorial by w3schools if you want to brush up your skills on either. Look, the data present in the websites is often buried under a lot of HTML/XPATH code. So you should know, how to get the relevant data out.

Besides, there are easier options to aid you with the relevant XPATH determination. In Google chrome browser, add the `SelectorGadget’ plugin and XPath helper plugin. On Firefox, you can install Firebug and Firefinder plugins.

We will use the SelectorGadget Chrome plugin for this guide, so install from here.

1. Understanding Selectors

In the simplest terms, a selector is a formula for the item that need to be extracted from a webpage. From the scrapy documentation on selectors, “Scrapy comes with its own mechanism for extracting data. They’re called selectors because they “select” certain parts of the HTML document specified either by XPath or CSS expressions. XPath is a language for selecting nodes in XML documents, which can also be used with HTML. CSS is a language for applying styles to HTML documents. It defines selectors to associate those styles with specific HTML elements.

2. Web data scraping walkthrough- getting the selectors

For this guide, we will scrape the ‘daily deals’ data from ebay malaysia . We will start from the deals page, and follow the links on each product and scrape some data.

After opening the daily deals page, click on SelectorGadget icon to activate the plugin, afterwards click on any product name, as shown in Figure [1].scrapy-post2-1Figure 1. Using the SelectorGadget in Chrome to get the XPath code

You will notice, that the SelectorGadget selected all the links in the page and highlighted them yellow. Now, click on the xpath button shown in Figure [2].scrapy-post2-1-2Figure 2. XPath button

A dialog box will pop up, see Figure [3] containing the relevant CSS Selector code.scrapy-post2-1-1

Figure 3. CSS selector code

Copy the CSS selector code and next click on the XPathHelper plugin on Chrome and paste it in the textbox as shown in Figure [4]. You should have 626 results and they include everything like item header, item name, price etc see Figure [4].scrapy-post2-1-3Figure 4. XPath results based on CSS Selector code.

Now, you dont want all these results. Let’s say, you are only interested in the item description, item cost, item shipping type. Then in this case, you will have to play around with the SelectorGadget to pick up the relevant CSS Selector code. For example, the XPath code for item description is

//*[contains(concat( " ", @class, " " ), concat( " ", "description", " " ))]//span

, item cost

//*[contains(concat( " ", @class, " " ), concat( " ", "price", " " ))]

, item shipping type

//*[contains(concat( " ", @class, " " ), concat( " ", "free-postage", " " ))]

when you verify all these XPath code’s stated above in the XPath plugin in Chrome, you should see each code give you 126-129 results. See in Figure 5.

scrapy-post2-1-4scrapy-post2-1-5scrapy-post2-1-6Figure 5: XPath results for required selectors

So far, we have found the relevant XPath selector code and have tested them in the XPath plugin, but this is not enough because we have to ensure that the XPath code works with a Scrapy object too. Now, wait a minute, you said, “Scrapy object”! what the hell is this ? Fret not. A Scrapy object is created when you invoke the scrapy crawl command with a webpage url. This object contains the whole of the webpage code and data. All you have to is use the scrapy shell to test your formula or the code to check if you are getting the required results or not. If yes, “Bulls’s eye”, if not, go back and see what you missed.

See the documentation here on Scrapy shell. As per the documentation, to launch the scrapy shell,  in command prompt type

 scrapy shell http://deals.ebay.com.my

. See Figure [6]

scrapy-post2-1-7Figure 6. Scrapy Shell object

The scrapy shell object by default is named as

sel

. So now we test out the XPath for item cost as

 sel.xpath('//*[contains(concat( " ", @class, " " ), concat( " ", "description", " " ))]').extract()

Note: In the above code, ensure that you enclose it in single quote so as to escape the double quotes of the XPath code. Also use the extract() method from XPath to derive the text encolsed within the HTML tags. Failure to do so will not give you only the HTML tags. See Figure [7].

scrapy-post2-1-8Figure 7: Testing Xpath in Scrapy shell

That will be all for now. In the next post, I will discuss on writing actual Scrapy code for data extraction.

Key takeaway points

  • A fair knowledge of XPath and CSS can help you a lot
  • A decent idea of HTML tags is good to have.
  • Please read the website policies on scraping
  • Finally, trust in yourself for nothing is impossible.

Thanks for reading. If any comments, please feel free to post.

Cheers.

Advertisements