-
Notifications
You must be signed in to change notification settings - Fork 129
Home
Wombat is a simple ruby DSL to crawl webpages on top of the cool Mechanize and Nokogiri gems. It is aimed to be a more high level abstraction if you dont want to dig into the specifics of getting the page and parsing it into your own data structure, which can be a decent amount of work, depending on what you need.
With Wombat, you can specify a class that includes the module Wombat::Crawler
and tells which information to retrieve and where.
Basically you can name stuff however you want and specify how you want to find that information in the page. For example:
class MyCrawler
include Wombat::Crawler
base_url "http://domain_to_be_crawled.com"
list_page "/path-to/page-with/the-data"
some_data "css=div.elemClass .anchor"
another_info "xpath=//my/xpath[@style='selector']"
end
Let's see what is going on here. First you create your class and include the Wombat::Crawler
module. Simple. Then, you start naming the properties you want to retrieve in a free format. The restrictions are the same as you would have to name a ruby method, because this is in fact a ruby method call :)
The lines that say base_url
and list_page
are reserved words to say where you want to get that information from. You should include only the scheme (http), host and port in the base_url
field. In the list_page
, you include only the path portion of the url to be crawled, always starting with a forward slash.
To run this crawler, just instantiate it and call the crawl
method:
my_cool_crawler = MyCrawler.new
my_cool_crawler.crawl
This will request and parse the page according to the filters you specified and return a hash with that same structure. We'll get into that detail later. Another important detail is that selectors can be specified as CSS or XPath. As you probably already realized, CSS selectors should start with css=
and XPath with xpath=
.
By default, properties will return the text of the first matching element for the provided selector.
Again, very important: Each property will return only the first matching element for that selector.
If you want to retrieve all the matching elements, use the option :list
(not implemented yet, please see the List properties section below).
You can also specify a namespace that will be used to query the document along with the selector. This can be useful to parse XML files. Yes, wombat can also crawl XML files. Yay! The syntax is:
class LastFmCrawler
include Wombat::Crawler
base_url "http://ws.audioscrobbler.com"
list_page "/2.0/?method=geo.getevents&location=San%20Francisco&api_key=<YOUR_LASTFM_API_KEY>"
format :xml
for_each 'xpath=//event' do
latitude, "xpath=./venue/location/geo:point/geo:lat", :text, { 'geo' => 'http://www.w3.org/2003/01/geo/wgs84_pos' }
longitude, "xpath=./venue/location/geo:point/geo:long", :text, { 'geo' => 'http://www.w3.org/2003/01/geo/wgs84_pos' }
end
end
Note that we used above the option format. This is another special property that tell the type of document we are supposed to parse. It defaults to :html
, so usually you won't need to specify this. If you want to parse a xml, you can say that by format :xml
. The only 2 formats supported so far are html and xml.
If you are going to specify a namespace, you have to also say the type of property you are requesting (:text, :html or :list) as the second argument, before the namespace. The namespace must be a hash with keys being the namespace name, and values being the namespace url.
If you need to iterate over a list of nodes that match the same selector, you can use the for_each
construct. It takes a selector and a block and yields as many times as the number of elements that were found in that page for that selector. Basically this construct narrows down the context to the elements returned by the selector given to for_each
and crawls each one of the elements found in the page. This one is a bit complicated, we better give a good example:
class GithubCrawler
include Wombat::Crawler
base_url "http://www.github.com"
list_page "/explore"
for_each "css=ol.ranked-repositories li" do
repo 'css=h3'
description 'css=p.description'
end
end
puts GithubCrawler.new.crawl
# outputs =>
{
"repo"=>["jairajs89 / Touchy.js", "mcavage / node-restify", "notlion / streetview-stereographic", "twitter / bootstrap", "stolksdorf / Parallaxjs"],
"description"=>["A simple light-weight JavaScript library for dealing with touch events", "node.js REST framework specifically meant for web service APIs", "Shader Toy + Google Map + Panoramic Explorer", "HTML, CSS, and JS toolkit from Twitter", "a Library for Javascript that allows easy page parallaxing"]
}
Remember that, by default, properties will return only the first element that matches the given selector. So, for the example above, even if there are several h3
elements inside each li
, only the first matching element will be returned. If you want to retrieve all the matching elements, use the option :list
instead of the default :text
(not implemented yet).