Skip to content

Commit

Permalink
Merge pull request #41 from yujiosaka/resolve_url
Browse files Browse the repository at this point in the history
Resolve url
  • Loading branch information
yujiosaka authored Dec 16, 2017
2 parents cce69c2 + 7c88bf9 commit e7bad81
Show file tree
Hide file tree
Showing 4 changed files with 111 additions and 28 deletions.
36 changes: 21 additions & 15 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -142,13 +142,13 @@ HCCrawler.launch({
* `persistCache` <[boolean]> Whether to persist cache on closing or disconnecting from the browser, defaults to `false`.
* returns: <Promise<HCCrawler>> Promise which resolves to HCCrawler instance.

This method connects to an existing Chromium instance. The following options are passed to [puppeteer.connect([options])](https://github.com/GoogleChrome/puppeteer/blob/master/docs/api.md#puppeteerconnectoptions).
This method connects to an existing Chromium instance. The following options are passed to [puppeteer.connect()](https://github.com/GoogleChrome/puppeteer/blob/master/docs/api.md#puppeteerconnectoptions).

```
browserWSEndpoint, ignoreHTTPSErrors
```

Also, the following options can be set as default values when [crawler.queue([options])](#crawlerqueueoptions) are executed.
Also, the following options can be set as default values when [crawler.queue()](#crawlerqueueoptions) are executed.

```
url, allowedDomains, timeout, priority, delay, retryCount, retryDelay, jQuery, device, username, password, shouldRequest, evaluatePage, onSuccess, onError
Expand All @@ -165,13 +165,13 @@ url, allowedDomains, timeout, priority, delay, retryCount, retryDelay, jQuery, d
* `persistCache` <[boolean]> Whether to clear cache on closing or disconnecting from the browser, defaults to `false`.
* returns: <Promise<HCCrawler>> Promise which resolves to HCCrawler instance.

The method launches a HeadlessChrome/Chromium instance. The following options are passed to [puppeteer.launch([options])](https://github.com/GoogleChrome/puppeteer/blob/master/docs/api.md#puppeteerlaunchoptions).
The method launches a HeadlessChrome/Chromium instance. The following options are passed to [puppeteer.launch()](https://github.com/GoogleChrome/puppeteer/blob/master/docs/api.md#puppeteerlaunchoptions).

```
ignoreHTTPSErrors, headless, executablePath, slowMo, args, handleSIGINT, handleSIGTERM, handleSIGHUP, timeout, dumpio, userDataDir, env, devtools
```

Also, the following options can be set as default values when [crawler.queue([options])](#crawlerqueueoptions) are executed.
Also, the following options can be set as default values when [crawler.queue()](#crawlerqueueoptions) are executed.

```
url, allowedDomains, timeout, priority, delay, retryCount, retryDelay, jQuery, device, username, password, shouldRequest, evaluatePage, onSuccess, onError
Expand All @@ -198,12 +198,12 @@ See [puppeteer.executablePath()](https://github.com/GoogleChrome/puppeteer/blob/
* `jQuery` <[boolean]> Whether to automatically add [jQuery](https://jquery.com) tag to page, defaults to `true`.
* `device` <[String]> Device to emulate. Available devices are listed [here](https://github.com/GoogleChrome/puppeteer/blob/master/DeviceDescriptors.js).
* `username` <[String]> Username for basic authentication. pass `null` if it's not necessary.
* `screenshot` <[Object]> Screenshot option, defaults to `null`. This option is passed to [Puppeteer's page.screenshot([options])](https://github.com/GoogleChrome/puppeteer/blob/master/docs/api.md#pagescreenshotoptions). Pass `null` or leave default to disable screenshot.
* `screenshot` <[Object]> Screenshot option, defaults to `null`. This option is passed to [Puppeteer's page.screenshot()](https://github.com/GoogleChrome/puppeteer/blob/master/docs/api.md#pagescreenshotoptions). Pass `null` or leave default to disable screenshot.
* `password` <[String]> Password for basic authentication. pass `null` if it's not necessary.
* `userAgent` <[String]> User agent string to override in this page.
* `extraHeaders` <[Object]> An object containing additional headers to be sent with every request. All header values must be strings.
* `preRequest(options)` <[Function]> Function to do anything like modifying `options` before each request. You can also return `false` if you want to skip the request.
* `options` <[Object]> [crawler.queue([options])](#crawlerqueueoptions)'s options with default values.
* `options` <[Object]> [crawler.queue()](#crawlerqueueoptions)'s options with default values.
* `evaluatePage()` <[Function]> Function to be evaluated in browsers. Return serializable object. If it's not serializable, the result will be `undefined`.
* `onSuccess(response)` <[Function]> Function to be called when `evaluatePage()` successes.
* `response` <[Object]>
Expand All @@ -212,7 +212,7 @@ See [puppeteer.executablePath()](https://github.com/GoogleChrome/puppeteer/blob/
* `status` <[String]> status code of the request.
* `url` <[String]> Last requested url.
* `headers` <[Object]> Response headers.
* `options` <[Object]> [crawler.queue([options])](#crawlerqueueoptions)'s options with default values.
* `options` <[Object]> [crawler.queue()](#crawlerqueueoptions)'s options with default values.
* `result` <[Serializable]> The result resolved from `evaluatePage()` option.
* `screenshot` <[Buffer]> Buffer with the screenshot image, which is `null` when `screenshot` option not passed.
* `links` <[Array]> List of links found in the requested page.
Expand All @@ -221,7 +221,7 @@ See [puppeteer.executablePath()](https://github.com/GoogleChrome/puppeteer/blob/

> **Note**: `response.url` may be different from `options.url` especially when the requested url is redirected.
The following options are passed to [Puppeteer's page.goto(url, options)](https://github.com/GoogleChrome/puppeteer/blob/master/docs/api.md#pagegotourl-options)'s options'.
The following options are passed to [Puppeteer's page.goto()](https://github.com/GoogleChrome/puppeteer/blob/master/docs/api.md#pagegotourl-options)'s options'.

```
timeout, waitUntil
Expand All @@ -231,7 +231,7 @@ The options can be either an object, an array, or a string. When it's an array,

#### crawler.setMaxRequest(maxRequest)

This method allows you to modify `maxRequest` option you passed to [HCCrawler.connect([options])](#hccrawlerconnectoptions) or [HCCrawler.launch([options])](#hccrawlerlaunchoptions).
This method allows you to modify `maxRequest` option you passed to [HCCrawler.connect()](#hccrawlerconnectoptions) or [HCCrawler.launch()](#hccrawlerlaunchoptions).

#### crawler.pause()

Expand Down Expand Up @@ -293,7 +293,7 @@ See [Puppeteer's browser.wsEndpoint()](https://github.com/GoogleChrome/puppeteer

### class: SessionCache

`SessionCache` is the [HCCrawler.connect([options])](#hccrawlerconnectoptions)'s default `cache` option. By default, the crawler remembers already requested urls on its memory. Pass `null` to the option in order to disable it.
`SessionCache` is the [HCCrawler.connect()](#hccrawlerconnectoptions)'s default `cache` option. By default, the crawler remembers already requested urls on its memory. Pass `null` to the option in order to disable it.

```js
const HCCrawler = require('headless-chrome-crawler');
Expand All @@ -305,9 +305,9 @@ HCCrawler.launch({ cache: null });

### class: RedisCache

Passing a `RedisCache` object to the [HCCrawler.connect([options])](#hccrawlerconnectoptions)'s `cache` options allows you to persist requested urls in Redis and prevents from requesting same urls in a distributed servers' environment. It also works well with its `persistCache` option to be true.
Passing a `RedisCache` object to the [HCCrawler.connect()](#hccrawlerconnectoptions)'s `cache` options allows you to persist requested urls in Redis and prevents from requesting same urls in a distributed servers' environment. It also works well with its `persistCache` option to be true.

Its constructing options are passed to [NodeRedis's redis.createClient([options])](https://github.com/NodeRedis/node_redis#rediscreateclient)'s options.
Its constructing options are passed to [NodeRedis's redis.createClient()](https://github.com/NodeRedis/node_redis#rediscreateclient)'s options.

```js
const HCCrawler = require('headless-chrome-crawler');
Expand All @@ -324,7 +324,7 @@ HCCrawler.launch({

### class: BaseCache

You can create your own cache by extending the [BaseCache's interfaces](https://github.com/yujiosaka/headless-chrome-crawler/blob/master/cache/base.js) and pass its object to the [HCCrawler.connect([options])](#hccrawlerconnectoptions)'s `cache` options.
You can create your own cache by extending the [BaseCache's interfaces](https://github.com/yujiosaka/headless-chrome-crawler/blob/master/cache/base.js) and pass its object to the [HCCrawler.connect()](#hccrawlerconnectoptions)'s `cache` options.

Here is an example of creating a file based cache.

Expand Down Expand Up @@ -370,16 +370,22 @@ HCCrawler.launch({ cache: new FsCache({ file: FILE }) });
// ...
```

## Debugging tips
## Tips

### Launch options

[HCCrawler.launch([options])](#hccrawlerlaunchoptions)'s options are passed to [puppeteer.launch([options])](https://github.com/GoogleChrome/puppeteer/blob/master/docs/api.md#puppeteerlaunchoptions). It may be useful to set the `headless` and `slowMo` options so that you can see what is going on.
[HCCrawler.launch()](#hccrawlerlaunchoptions)'s options are passed to [puppeteer.launch()](https://github.com/GoogleChrome/puppeteer/blob/master/docs/api.md#puppeteerlaunchoptions). It may be useful to set the `headless` and `slowMo` options so that you can see what is going on.

```js
HCCrawler.launch({ headless: false, slowMo: 10 });
```

Also, the `args` option is passed to the browser instance. List of Chromium flags can be found [here](http://peter.sh/experiments/chromium-command-line-switches/). Passing `--disable-web-security` flag is useful for crawling. If the flag is set, links within iframes are collected as those of parent frames. If it's not, the source attributes of the iframes are collected as links.

```js
HCCrawler.launch({ args: ['--disable-web-security'] });
```

### Enable debug logging

All requests and browser's logs are logged via the [debug](https://github.com/visionmedia/debug) module under the `hccrawler` namespace.
Expand Down
19 changes: 11 additions & 8 deletions lib/crawler.js
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
const { pick, isEmpty, uniq } = require('lodash');
const devices = require('puppeteer/DeviceDescriptors');
const { debugRequest, debugBrowser } = require('./helper');
const { resolveUrl, debugRequest, debugBrowser } = require('./helper');

const GOTO_OPTIONS = [
'timeout',
Expand All @@ -25,7 +25,7 @@ class Crawler {
Promise.all([
this._scrape(),
this._screenshot(),
this._collectLinks(),
this._collectLinks(response.url),
])
.then(([result, screenshot, links]) => ({
response,
Expand Down Expand Up @@ -161,23 +161,26 @@ class Crawler {
* @return {Promise} resolved after collecting links
* @private
*/
_collectLinks() {
_collectLinks(baseUrl) {
const links = [];
return this._page.exposeFunction('pushToLinks', link => void links.push(link))
return this._page.exposeFunction('pushToLinks', link => {
const _link = resolveUrl(link, baseUrl);
if (_link) links.push(_link);
})
.then(() => (
this._page.evaluate(() => {
function findLinks(document) {
document.querySelectorAll('a[href]')
.forEach(link => {
window.pushToLinks(link.href);
});
document.querySelectorAll('iframe')
.forEach(iframe => {
document.querySelectorAll('iframe,frame')
.forEach(frame => {
try {
findLinks(iframe.contentDocument);
findLinks(frame.contentDocument);
} catch (e) {
console.warn(e.message);
window.pushToLinks(iframe.src);
if (frame.src) window.pushToLinks(frame.src);
}
});
}
Expand Down
29 changes: 28 additions & 1 deletion lib/helper.js
Original file line number Diff line number Diff line change
@@ -1,5 +1,12 @@
const URL = require('url');
const crypto = require('crypto');
const { omit, isPlainObject } = require('lodash');
const {
omit,
isPlainObject,
trim,
startsWith,
includes,
} = require('lodash');
const debugRequest = require('debug')('hccrawler:request');
const debugBrowser = require('debug')('hccrawler:browser');

Expand Down Expand Up @@ -71,6 +78,26 @@ class Helper {
}, {});
}

/**
* Resolve valid url from anchor and iframe tags
* @param {String} url
* @param {String} baseUrl
* @return {String} resolved url
* @static
*/
static resolveUrl(url, baseUrl) {
url = trim(url);
if (!url) return null;
if (startsWith(url, '#')) return null;
const { protocol } = URL.parse(url);
if (includes(['http:', 'https:'], protocol)) {
return url.split('#')[0];
} else if (!protocol) {
return URL.resolve(baseUrl, url).split('#')[0];
}
return null;
}

/**
* Debug log for request events
* @param {string} msg
Expand Down
55 changes: 51 additions & 4 deletions test/helper.test.js
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@ const {
jsonStableReplacer,
hash,
generateKey,
resolveUrl,
debugRequest,
debugBrowser,
} = require('../lib/helper');
Expand Down Expand Up @@ -51,24 +52,70 @@ describe('Helper', () => {
describe('Helper.jsonStableReplacer', () => {
it('sorts keys by order', () => {
const obj = { c: 8, b: [{ z: 6, y: 5, x: 4 }, 7], a: 3 };
const actual = '{"a":3,"b":[{"x":4,"y":5,"z":6},7],"c":8}';
const expected = JSON.stringify(obj, jsonStableReplacer);
const actual = JSON.stringify(obj, jsonStableReplacer);
const expected = '{"a":3,"b":[{"x":4,"y":5,"z":6},7],"c":8}';
assert.equal(actual, expected);
});
});

describe('Helper.resolveUrl', () => {
const baseUrl = 'https://github.com/yujiosaka/headless-chrome-crawler';

it('returns null when the argument is null', () => {
const actual = resolveUrl(null, baseUrl);
const expected = null;
assert.equal(actual, expected);
});

it('returns null when the argument starts With hash', () => {
const actual = resolveUrl('#headless-chrome-crawler---', baseUrl);
const expected = null;
assert.equal(actual, expected);
});

it('returns null when the argument starts with javascript:', () => {
const actual = resolveUrl('javascript:void(0)', baseUrl); /* eslint no-script-url: 0 */
const expected = null;
assert.equal(actual, expected);
});

it('returns null when the argument starts with mailto:', () => {
const actual = resolveUrl('mail:[email protected]', baseUrl);
const expected = null;
assert.equal(actual, expected);
});

it('returns full URL when the argument is an absolute URL', () => {
const actual = resolveUrl('https://github.com/yujiosaka/headless-chrome-crawler', baseUrl);
const expected = 'https://github.com/yujiosaka/headless-chrome-crawler';
assert.equal(actual, expected);
});

it('strips hash when the argument is an absolute URL with hash', () => {
const actual = resolveUrl('https://github.com/yujiosaka/headless-chrome-crawler#headless-chrome-crawler---', baseUrl);
const expected = 'https://github.com/yujiosaka/headless-chrome-crawler';
assert.equal(actual, expected);
});

it('resolves url when the argument is a relative URL', () => {
const actual = resolveUrl('headless-chrome-crawler/settings', baseUrl);
const expected = 'https://github.com/yujiosaka/headless-chrome-crawler/settings';
assert.equal(actual, expected);
});
});

describe('Helper.debugRequest', () => {
it('does not throw an error', () => {
assert.doesNotThrow(() => {
debugRequest('Start requesting http://example.com/');
debugRequest('Start requesting https://github.com/yujiosaka/headless-chrome-crawler');
});
});
});

describe('Helper.debugBrowser', () => {
it('does not throw an error', () => {
assert.doesNotThrow(() => {
debugBrowser('Console log init.. http://example.com/');
debugBrowser('Console log init https://github.com/yujiosaka/headless-chrome-crawler');
});
});
});
Expand Down

0 comments on commit e7bad81

Please sign in to comment.