“Scrapy is an open source and collaborative framework for extracting the data you need from websites. In a fast, simple, yet extensible way.”
This article will walk you through installing Scrapy (on a windows operating system).
First, ensure the following dependencies exist on your machine;
Step 1: Python version 2.7 as scrapy only works with python 2.7. To check the python version issue the command
- you need to add C:\Python27 and C:\Python27\Scripts to your Path environment variable.
Step 2: Install Microsoft Visual C++ for python 2.7 from here
Step 3: Ensure that pip is installed. Pip comes preinstalled on Python version 2.7.9 and above. See this nice SO post on the same.
Step 4: Download and install OpenSSL from here into your c:\python27 folder.
Right click on the .whl file and select Save link as to download the file to your python folder.
To install OpenSSL, open cmd. Then change the path using cd C:\Python27
Then use the following command: python -m pip install pyOpenSSL-16.0.0-py2.py3-none-any.whl
Once its installed completely, you can see the following message in the end.
Successfully installed pyOpenSSL-16.0.0
Step 5: Install lxml- Installing lxml is really important. Download lxml from here. We need to download and install the latest version which supports our Python 2.7. Download the file to your Python 2.7 folder, which is C:\Python27. To install the lxml, open cmd and change path to python as we did in previous step. Then use the following command to install lxml.
python -m pip install lxml-3.6.0-cp27-cp27m-win32.whl
Step 6: Install the relevant version of pywin32 for your computer’s OS from here or open command prompt and execute the command pip install pywin32. Here is a good answer on SO for the same, see the answer by the user ‘Kanguros’.
2. Scrapy installation on a Windows OS
Open the command prompt and issue the command
pip install Scrapy
Once installed, you can check the version by issuing the command
. If all went well, then you should have something like this as shown in fig 1.
Fig 1. Scrapy installation on windows OS
One drawback with scrapy to me was that I had to issue the scraping commands from the command prompt. I will now show you how to configure Scrapy to execute from an IDE. I use Pycharm as an IDE for programming.
3. Scrapy configuration with PyCharm IDE
Create a Run/Debug configuration in Pycharm as illustrated here on this SO answer by user “Pullie”
Once you have configured Pycharm IDE Run/Debug configuration, there is no longer any need to go to the command prompt and issue the crawl command. This can be done using the IDE. See fig 2 below for a successful spider crawl.
Fig 2. Scrapy configuration and execution in Pycharm
In next post, I will focus on building a data pipeline with Scrapy. If you have any comments or suggestions, please let me know.