Skip to content
Felipe Lima edited this page Jun 6, 2013 · 41 revisions

Wombat

Wombat

What is it?

Getting Started

Wombat is a simple ruby DSL to scrape webpages on top of the cool Mechanize and Nokogiri gems. It is aimed to be a more high level abstraction if you dont want to dig into the specifics of getting the page and parsing it into your own data structure, which can be a decent amount of work, depending on what you need.

With Wombat, you can simply call Wombat.crawl to get started. You can also specify a class that includes the module Wombat::Crawler and tells which information to retrieve and where. Basically you can name stuff however you want and specify how you want to find that information in the page. For example:

class MyScraper
  include Wombat::Crawler

  base_url "http://domain_to_be_scraped.com"
  path "/path-to/page-with/the-data"
  
  some_data css: "div.elemClass .anchor"
  another_info xpath: "//my/xpath[@style='selector']"
end
Wombat.crawl do
  base_url "http://domain_to_be_scraped.com"
  path "/path-to/page-with/the-data"
  
  some_data css: "cdiv.elemClass .anchor"
  another_info xpath: "//my/xpath[@style='selector']"
end

The both examples above are equivalent and provide the same results. You can either use the class based approach (1st one), or simply call Wombat.crawl (2nd approach). Behind the scenes, Wombat.crawl is just a shortcut so you don't need to create a class if it doesn't make sense for you.

Let's see what is going on here. First you create your class and include the Wombat::Crawler module. Simple. Then, you start naming the properties you want to retrieve in a free format. The restrictions are the same as you would have to name a ruby method, because this is in fact a ruby method call :)

The lines that say base_url and path are reserved words to say where you want to get that information from. You should include only the scheme (http), host and port in the base_url field. In the path, you include only the path portion of the url to be scraped, always starting with a forward slash.

To run this scraper, just instantiate it and call the crawl method:

my_cool_scraper = MyScraper.new
my_cool_scraper.crawl

This will request and parse the page according to the filters you specified and return a hash with that same structure. We'll get into that detail later. Another important detail is that selectors can be specified as CSS or XPath. As you probably already realized, CSS selectors should start with css= and XPath with xpath=. By default, properties will return the text of the first matching element for the provided selector. Again, very important: Each property will return only the first matching element for that selector.

If you want to retrieve all the matching elements, use the option :list.

As of version 2.2, you can also request a pre-fetched Mechanize page in the following format:

Wombat.crawl do
  m = Mechanize.new 
  mp = m.get 'http://www.google.com'

  page mp
  # ...
end

If you use the method above, it will take precedence over any base_url and path attributes if they are also specified in the crawler.

Sections

Additional resources

Clone this wiki locally