GitHub - paulchai/x-ray: The next web scraper. See through the <html> noise.

var Xray = require('x-ray');
var x = Xray();

x('https://blog.ycombinator.com/', '.post', [{
  title: 'h1 a',
  link: '.article-title@href'
}])
  .paginate('.nav-previous a@href')
  .limit(3)
  .write('results.json')

Installation

npm install x-ray

Features

Flexible schema: Supports strings, arrays, arrays of objects, and nested object structures. The schema is not tied to the structure of the page you're scraping, allowing you to pull the data in the structure of your choosing.
Composable: The API is entirely composable, giving you great flexibility in how you scrape each page.
Pagination support: Paginate through websites, scraping each page. X-ray also supports a request delay and a pagination limit. Scraped pages can be streamed to a file, so if there's an error on one page, you won't lose what you've already scraped.
Crawler support: Start on one page and move to the next easily. The flow is predictable, following a breadth-first crawl through each of the pages.
Responsible: X-ray has support for concurrency, throttles, delays, timeouts and limits to help you scrape any page responsibly.
Pluggable drivers: Swap in different scrapers depending on your needs. Currently supports HTTP and PhantomJS driver drivers. In the future, I'd like to see a Tor driver for requesting pages through the Tor network.

Selector API

xray(url, selector)(fn)

Scrape the url for the following selector, returning an object in the callback fn. The selector takes an enhanced jQuery-like string that is also able to select on attributes. The syntax for selecting on attributes is selector@attribute. If you do not supply an attribute, the default is selecting the innerText.

Here are a few examples:

Scrape a single tag

xray('http://google.com', 'title')(function(err, title) {
  console.log(title) // Google
})

Scrape a single class

xray('http://reddit.com', '.content')(fn)

Scrape an attribute

xray('http://techcrunch.com', 'img.logo@src')(fn)

Scrape innerHTML

xray('http://news.ycombinator.com', 'body@html')(fn)

xray(url, scope, selector)

You can also supply a scope to each selector. In jQuery, this would look something like this: $(scope).find(selector).

xray(html, scope, selector)

Instead of a url, you can also supply raw HTML and all the same semantics apply.

var html = "<body><h2>Pear</h2></body>";
x(html, 'body', 'h2')(function(err, header) {
  header // => Pear
})

API

xray.driver(driver)

Specify a driver to make requests through. Available drivers include:

request - A simple driver built around request. Use this to set headers, cookies or http methods.
phantom - A high-level browser automation library. Use this to render pages or when elements need to be interacted with, or when elements are created dynamically using javascript (e.g.: Ajax-calls).

xray.stream()

Returns Readable Stream of the data. This makes it easy to build APIs around x-ray. Here's an example with Express:

var app = require('express')();
var x = require('x-ray')();

app.get('/', function(req, res) {
  var stream = x('http://google.com', 'title').stream();
  stream.pipe(res);
})

xray.write([path])

Stream the results to a path.

If no path is provided, then the behavior is the same as .stream().

xray.paginate(selector)

Select a url from a selector and visit that page.

xray.limit(n)

Limit the amount of pagination to n requests.

xray.delay(from, [to])

Delay the next request between from and to milliseconds. If only from is specified, delay exactly from milliseconds.

xray.concurrency(n)

Set the request concurrency to n. Defaults to Infinity.

xray.throttle(n, ms)

Throttle the requests to n requests per ms milliseconds.

xray.timeout (ms)

Specify a timeout of ms milliseconds for each request.

Collections

X-ray also has support for selecting collections of tags. While x('ul', 'li') will only select the first list item in an unordered list, x('ul', ['li']) will select all of them.

Additionally, X-ray supports "collections of collections" allowing you to smartly select all list items in all lists with a command like this: x(['ul'], ['li']).

Composition

X-ray becomes more powerful when you start composing instances together. Here are a few possibilities:

Crawling to another site

var Xray = require('x-ray');
var x = Xray();

x('http://google.com', {
  main: 'title',
  image: x('#gbar a@href', 'title'), // follow link to google images
})(function(err, obj) {
/*
  {
    main: 'Google',
    image: 'Google Images'
  }
*/
})

Scoping a selection

var Xray = require('x-ray');
var x = Xray();

x('http://mat.io', {
  title: 'title',
  items: x('.item', [{
    title: '.item-content h2',
    description: '.item-content section'
  }])
})(function(err, obj) {
/*
  {
    title: 'mat.io',
    items: [
      {
        title: 'The 100 Best Children\'s Books of All Time',
        description: 'Relive your childhood with TIME\'s list...'
      }
    ]
  }
*/
})

Filters

Filters can specified when creating a new Xray instance. To apply filters to a value, append them to the selector using |.

var Xray = require('x-ray');
var x = Xray({
  filters: {
    trim: function (value) {
      return typeof value === 'string' ? value.trim() : value
    },
    reverse: function (value) {
      return typeof value === 'string' ? value.split('').reverse().join('') : value
    },
    slice: function (value, start , end) {
      return typeof value === 'string' ? value.slice(start, end) : value
    }
  }
});

x('http://mat.io', {
  title: 'title | trim | reverse | slice:2,3'
})(function(err, obj) {
/*
  {
    title: 'oi'
  }
*/
})

Examples

selector: simple string selector
collections: selects an object
arrays: selects an array
collections of collections: selects an array of objects
array of arrays: selects an array of arrays

In the Wild

Levered Returns: Uses x-ray to pull together financial data from various unstructured sources around the web.

Resources

Video: https://egghead.io/lessons/node-js-intro-to-web-scraping-with-node-and-x-ray

Backers

Support us with a monthly donation and help us continue our activities. [Become a backer]

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 185 Commits
.github		.github
examples		examples
lib		lib
test		test
.bumpedrc		.bumpedrc
.gitignore		.gitignore
.npmignore		.npmignore
.travis.yml		.travis.yml
CHANGELOG.md		CHANGELOG.md
Readme.md		Readme.md
index.js		index.js
package.json		package.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Installation

Features

Selector API

xray(url, selector)(fn)

xray(url, scope, selector)

xray(html, scope, selector)

API

xray.driver(driver)

xray.stream()

xray.write([path])

xray.paginate(selector)

xray.limit(n)

xray.delay(from, [to])

xray.concurrency(n)

xray.throttle(n, ms)

xray.timeout (ms)

Collections

Composition

Crawling to another site

Scoping a selection

Filters

Examples

In the Wild

Resources

Backers

Sponsors

License

About

Releases

Packages

Languages

paulchai/x-ray

Folders and files

Latest commit

History

Repository files navigation

Installation

Features

Selector API

xray(url, selector)(fn)

xray(url, scope, selector)

xray(html, scope, selector)

API

xray.driver(driver)

xray.stream()

xray.write([path])

xray.paginate(selector)

xray.limit(n)

xray.delay(from, [to])

xray.concurrency(n)

xray.throttle(n, ms)

xray.timeout (ms)

Collections

Composition

Crawling to another site

Scoping a selection

Filters

Examples

In the Wild

Resources

Backers

Sponsors

License

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages