Contents:
.. toctree:: :maxdepth: 2
In this tutorial, we'll assume that Grapy is already installed on your system. If that's not the case, see :ref:`intro-install`.
We are going to use Open directory project (dmoz) as our example domain to scrape.
This tutorial will walk you through these tasks:
- Creating a new Crawl project
- Defining the Items you will extract
- Writing a :ref:`spider <topics-spiders>` to grapy a site and extract :ref:`Items <topics-items>`
- Writing an :ref:`Item Pipeline <topics-item-pipeline>` to store the extracted Items
Crawl is written in Python. If you're new to the language you might want to start by getting an idea of what the language is like, to get the most out of Crawl. If you're already familiar with other languages, and want to learn Python quickly, we recommend Learn Python The Hard Way. If you're new to programming and want to start with Python, take a look at this list of Python resources for non-programmers.
Before you start crawling, you will have set up a new Grapy project. Enter a directory where you'd like to store your code and then run:
mkdir tutorial mkdir tutorial/spiders touch tutorial/__init__.py touch tutorial/items.py touch tutorial/pipelines.py touch tutorial/middlewares.py touch tutorial/spiders/__init__.py touch config.py touch main.py
These are basically:
config.py
: the project configuration filetutorial/
: the project's python module, you'll later import your code from here.tutorial/items.py
: the project's items file.tutorial/pipelines.py
: the project's pipelines file.tutorial/middlewares.py
: the project's middlewares file.tutorial/spiders/
: a directory where you'll later put your spiders.
Item are containers that will be loaded with the crawled data; they work like simple python dicts but provide additional protecting against populating undeclared fields, to prevent typos.
They are declared by creating an :class:`grapy.core.Item` class and defining its attributes as :attr:`grapy.core.Item._fields` objects, like you will in an ORM (don't worry if you're not familiar with ORMs, you will see that this is an easy task).
We begin by modeling the item that we will use to hold the sites data obtained
from dmoz.org, as we want to capture the name, url and description of the
sites, we define fields for each of these three attributes. To do that, we edit
items.py, found in the tutorial
directory. Our Item class looks like this:
from grapy.core import Item class DmozItem(Item): _fields = [ {'name': 'title', 'type': 'str'}, {'name': 'link', 'type': 'str'}, {'name': 'desc', 'type': 'str'} ]
This may seem complicated at first, but defining the item allows you to use other handy components of Grapy that need to know how your item looks like.
Spiders are user-written classes used to crawl information from a domain (or group of domains).
They define an initial list of URLs to download, how to follow links, and how to parse the contents of those pages to extract :ref:`items <topics-items>`.
To create a Spider, you must subclass :class:`grapy.BaseSpider`, and define the three main, mandatory, attributes:
:attr:`~grapy.BaseSpider.name`: identifies the Spider. It must be unique, that is, you can't set the same name for different Spiders.
:attr:`~grapy.BaseSpider.start_request`: is a method of spider, which will be called on initial the spider. :class:`grapy.Request`
:meth:`~grapy.BaseSpider.parse` is a method of the spider, which will be called with the downloaded :class:`grapy.Response` object of each start URL. The response is passed to the method as the first and only argument.
This method is responsible for parsing the response data and extracting crawled data (as grapyed items) and more URLs to follow.
The :meth:`~grapy.BaseSpider.parse` method is in charge of processing the response and returning crawled data (as :class:`grapy.Item` objects) and more URLs to follow (as :class:`grapy.Request` objects).
This is the code for our first Spider; save it in a file named
dmoz_spider.py
under the tutorial/spiders
directory:
from grapy import BaseSpider, Request class DmozSpider(BaseSpider): name = "dmoz" start_urls = [ "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/", "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/" ] # def start_request(self): # for url in self.start_urls: # yield Request(url) async def start_request(self, next): for url in self.start_urls: await next(Request(url)) def parse(self, response): filename = response.url.split("/")[-2] open(filename, 'wb').write(response.content)
To put our spider to work, go to the project's top level directory and edit main.py
:
from grapy import engine from grapy.sched import Scheduler from tutorial.spiders.dmoz_spider import DmozSpider sched = Scheduler() engine.set_sched(sched) engine.set_spiders([DmozSpider()]) await engine.start()
then:
python3 main.py
But more interesting, as our parse
method instructs, two files have been
created: Books and Resources, with the content of both URLs.
Grapy process every :class:`grapy.Request` of the Spider on start_request,
and assigns them the parse
method of
the spider as their callback function.
These Requests are scheduled, then executed, and :class:`grapy.Response` objects are returned and then fed back to the spider, through the :meth:`~grapy.BaseSpider.parse` method.
There are several ways to extract data from web pages. Scrapy use :attr:`~grapy.Response.soup` and :meth:`~grapy.Response.select` base on BeautifulSoup
Let's add this code to our spider:
from grapy import BaseSpider class DmozSpider(BaseSpider): name = "dmoz" start_urls = [ "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/", "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/" ] # def start_request(self): # for url in self.start_urls: # yield Request(url) async def start_request(self, next): for url in self.start_urls: await next(Request(url)) def parse(self, response): for site in response.select('ul li'): elem = site.find('a') if elem: title = elem.get_text() link = elem.get('href') desc = site.get_text() print(title, link, desc)
Now try crawling the dmoz.org domain again and you'll see sites being printed in your output, run:
python3 main.py
:class:`~grapy.core.Item` objects are custom python dicts; you can access the values of their fields (attributes of the class we defined earlier) using the standard dict syntax like:
>>> item = DmozItem() >>> item['title'] = 'Example title' >>> item['title'] 'Example title' >>> item.title 'Example title'
Spiders are expected to return their grapyed data inside :class:`~grapy.core.Item` objects. So, in order to return the data we've grapyed so far, the final code for our Spider would be like this:
from grapy import BaseSpider from tutorial.items import DmozItem class DmozSpider(BaseSpider): name = "dmoz" start_urls = [ "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/", "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/" ] # def start_request(self): # for url in self.start_urls: # yield Request(url) async def start_request(self, next): for url in self.start_urls: await next(Request(url)) def parse(self, response): items = [] for site in response.select('ul li'): elem = site.find('a') if elem: item = DmozItem() item['title'] = elem.get_text() item['link'] = elem.get('href') item['desc'] = site.get_text() items.append(item) return items
This tutorial covers only the basics of Crawl, but there's a lot of other features not mentioned here.
The installation steps assume that you have the following things installed:
- Python 3.3
- asyncio Python 3 async library
- aiohttp http client/server for asyncio
- BeautifulSoup Beautiful Soup: We called him Tortoise because he taught us
- pip or easy_install Python package managers
To install using source:
git clone https://github.com/Lupino/grapy.git cd grapy python3 setup.py install