Skip to content
felipecsl edited this page Feb 15, 2012 · 41 revisions

Wombat

Wombat

Getting Started

Wombat is a simple ruby DSL to crawl webpages on top of the cool Mechanize and Nokogiri gems. It is aimed to be a more high level abstraction if you dont want to dig into the specifics of getting the page and parsing it into your own data structure, which can be a decent amount of work, depending on what you need.

With Wombat, you can specify a class that includes the module Wombat::Crawler and tells which information to retrieve and where. Basically you can name stuff however you want and specify how you want to find that information in the page. For example:

class MyCrawler
  include Wombat::Crawler

  base_url "http://domain_to_be_crawled.com"
  list_page "/path-to/page-with/the-data"
  
  some_data "css=div.elemClass .anchor"
  another_info "xpath=//my/xpath[@style='selector']"
end

Let's see what is going on here. First you create your class and include the Wombat::Crawler module. Simple. Then, you start naming the properties you want to retrieve in a free format. The restrictions are the same as you would have to name a ruby method, because this is in fact a ruby method call :)

The lines that say base_url and list_page are reserved words to say where you want to get that information from. You should include only the scheme (http), host and port in the base_url field. In the list_page, you include only the path portion of the url to be crawled, always starting with a forward slash.

To run this crawler, just instantiate it and call the crawl method:

my_cool_crawler = MyCrawler.new
my_cool_crawler.crawl

This will request and parse the page according to the filters you specified and return a hash with that same structure. We'll get into that detail later. Another important detail is that selectors can be specified as CSS or XPath. As you probably already realized, CSS selectors should start with css= and XPath with xpath=. By default, properties will return the text of the first matching element for the provided selector. Again, very important: Each property will return only the first matching element for that selector.

If you want to retrieve all the matching elements, use the option :list.

Sections

Additional resources

Clone this wiki locally