Skip to content
felipecsl edited this page Feb 15, 2012 · 41 revisions

Wombat

Wombat

Getting Started

Wombat is a simple ruby DSL to crawl webpages on top of the cool Mechanize and Nokogiri gems. It is aimed to be a more high level abstraction if you dont want to dig into the specifics of getting the page and parsing it into your own data structure, which can be a decent amount of work, depending on what you need.

With Wombat, you can specify a class that includes the module Wombat::Crawler and tells which information to retrieve and where. Basically you can name stuff however you want and specify how you want to find that information in the page. For example:

class MyCrawler
  include Wombat::Crawler

  base_url "http://domain_to_be_crawled.com"
  list_page "/path-to/page-with/the-data"
  
  some_data "css=div.elemClass .anchor"
  another_info "xpath=//my/xpath[@style='selector']"
end

Let's see what is going on here. First you create your class and include the Wombat::Crawler module. Simple. Then, you start naming the properties you want to retrieve in a free format. The restrictions are the same as you would have to name a ruby method, because this is in fact a ruby method call :)

The lines that say base_url and list_page are reserved words to say where you want to get that information from. You should include only the scheme (http), host and port in the base_url field. In the list_page, you include only the path portion of the url to be crawled, always starting with a forward slash.

To run this crawler, just instantiate it and call the crawl method:

my_cool_crawler = MyCrawler.new
my_cool_crawler.crawl

This will request and parse the page according to the filters you specified and return a hash with that same structure. We'll get into that detail later. Another important detail is that selectors can be specified as CSS or XPath. As you probably already realized, CSS selectors should start with css= and XPath with xpath=. By default, properties will return the text of the first matching element for the provided selector. Again, very important: Each property will return only the first matching element for that selector.

If you want to retrieve all the matching elements, use the option :list (not implemented yet, please see the List properties section below).

Namespaces

You can also specify a namespace that will be used to query the document along with the selector. This can be useful to parse XML files. Yes, wombat can also crawl XML files. Yay! The syntax is:

class LastFmCrawler
  include Wombat::Crawler
  base_url "http://ws.audioscrobbler.com"
  list_page "/2.0/?method=geo.getevents&location=San%20Francisco&api_key=<YOUR_LASTFM_API_KEY>"
  format :xml

  for_each 'xpath=//event' do
    latitude, "xpath=./venue/location/geo:point/geo:lat", :text, { 'geo' => 'http://www.w3.org/2003/01/geo/wgs84_pos' }
    longitude, "xpath=./venue/location/geo:point/geo:long", :text, { 'geo' => 'http://www.w3.org/2003/01/geo/wgs84_pos' }
  end
end

Note that we used above the option format. This is another special property that tell the type of document we are supposed to parse. It defaults to :html, so usually you won't need to specify this. If you want to parse a xml, you can say that by format :xml. The only 2 formats supported so far are html and xml. If you are going to specify a namespace, you have to also say the type of property you are requesting (:text, :html or :list) as the second argument, before the namespace. The namespace must be a hash with keys being the namespace name, and values being the namespace url.

Iterators

If you need to iterate over a list of nodes that match the same selector, you can use the for_each construct. It takes a selector and a block and yields as many times as the number of elements that were found in that page for that selector. Basically this construct narrows down the context to the elements returned by the selector given to for_each and crawls each one of the elements found in the page. This one is a bit complicated, we better give a good example:

class GithubCrawler
  include Wombat::Crawler
  base_url "http://www.github.com"
  list_page "/explore"

  for_each "css=ol.ranked-repositories li" do
    repo 'css=h3'
    description 'css=p.description'
  end
end

puts GithubCrawler.new.crawl
# outputs => 

{
  "repo"=>["jairajs89 / Touchy.js", "mcavage / node-restify", "notlion / streetview-stereographic", "twitter / bootstrap", "stolksdorf / Parallaxjs"], 
  "description"=>["A simple light-weight JavaScript library for dealing with touch events", "node.js REST framework specifically meant for web service APIs", "Shader Toy + Google Map + Panoramic Explorer", "HTML, CSS, and JS toolkit from Twitter", "a Library for Javascript that allows easy page parallaxing"]
}

Remember that, by default, properties will return only the first element that matches the given selector. So, for the example above, even if there are several h3 elements inside each li, only the first matching element will be returned. If you want to retrieve all the matching elements, use the option :list instead of the default :text (not implemented yet).

Additional resources

Clone this wiki locally