Skip to content

Commit

Permalink
Merge pull request #275 from yujiosaka/replace-newpage-by-custom-crawl
Browse files Browse the repository at this point in the history
replace newpage by custom crawl
  • Loading branch information
yujiosaka authored Jun 11, 2018
2 parents 4f5bac5 + dce9873 commit c16aa6d
Show file tree
Hide file tree
Showing 14 changed files with 146 additions and 70 deletions.
36 changes: 24 additions & 12 deletions docs/API.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,6 @@
* [crawler.queueSize()](#crawlerqueuesize)
* [crawler.pendingQueueSize()](#crawlerpendingqueuesize)
* [crawler.requestedCount()](#crawlerrequestedcount)
* [event: 'newpage'](#event-newpage)
* [event: 'requestdisallowed'](#event-requestdisallowed)
* [event: 'requeststarted'](#event-requeststarted)
* [event: 'requestskipped'](#event-requestskipped)
Expand Down Expand Up @@ -73,8 +72,11 @@ const HCCrawler = require('headless-chrome-crawler');
* `persistCache` <[boolean]> Whether to clear cache on closing or disconnecting from the Chromium instance, defaults to `false`.
* `preRequest(options)` <[Function]> Function to do anything like modifying `options` before each request. You can also return `false` if you want to skip the request.
* `options` <[Object]> [crawler.queue()](#crawlerqueueoptions)'s options with default values.
* `onSuccess(response)` <[Function]> Function to be called when `evaluatePage()` successes.
* `response` <[Object]>
* `customCrawl(page, crawl)` <[Function]> Function to customize crawled result, allowing access to [Puppeteer](https://github.com/GoogleChrome/puppeteer)'s raw API.
* `page` <[Page]> [Puppeteer](https://github.com/GoogleChrome/puppeteer)'s raw API.
* `crawl` <[Function]> Function to run crawling, which resolves to the result passed to `onSuccess` function.
* `onSuccess(result)` <[Function]> Function to be called when `evaluatePage()` successes.
* `result` <[Object]>
* `redirectChain` <[Array]<[Object]>> Redirect chain of requests.
* `url` <[string]> Requested url.
* `headers` <[Object]> Request headers.
Expand Down Expand Up @@ -130,8 +132,24 @@ url, allowedDomains, deniedDomains, timeout, priority, depthPriority, delay, ret
* `persistCache` <[boolean]> Whether to clear cache on closing or disconnecting from the Chromium instance, defaults to `false`.
* `preRequest(options)` <[Function]> Function to do anything like modifying `options` before each request. You can also return `false` if you want to skip the request.
* `options` <[Object]> [crawler.queue()](#crawlerqueueoptions)'s options with default values.
* `onSuccess(response)` <[Function]> Function to be called when `evaluatePage()` successes.
* `response` <[Object]>
* `customCrawl(page, crawl)` <[Function]> Function to customize crawled result, allowing access to [Puppeteer](https://github.com/GoogleChrome/puppeteer)'s raw API.
* `page` <[Page]> [Puppeteer](https://github.com/GoogleChrome/puppeteer)'s raw API.
* `crawl` <[Function]> Function to run crawling, which resolves to the result passed to `onSuccess` function.
* `onSuccess(result)` <[Function]> Function to be called when `evaluatePage()` successes.
* `result` <[Object]>
* `redirectChain` <[Array]<[Object]>> Redirect chain of requests.
* `url` <[string]> Requested url.
* `headers` <[Object]> Request headers.
* `cookies` <[Array]<[Object]>> List of cookies.
* `name` <[string]>
* `value` <[string]>
* `domain` <[string]>
* `path` <[string]>
* `expires` <[number]> Unix time in seconds.
* `httpOnly` <[boolean]>
* `secure` <[boolean]>
* `session` <[boolean]>
* `sameSite` <[string]> `"Strict"` or `"Lax"`.
* `response` <[Object]>
* `ok` <[boolean]> whether the status code in the range 200-299 or not.
* `status` <[string]> status code of the request.
Expand All @@ -140,7 +158,7 @@ url, allowedDomains, deniedDomains, timeout, priority, depthPriority, delay, ret
* `options` <[Object]> [crawler.queue()](#crawlerqueueoptions)'s options with default values.
* `result` <[Serializable]> The result resolved from `evaluatePage()` option.
* `screenshot` <[Buffer]> Buffer with the screenshot image, which is `null` when `screenshot` option not passed.
* `links` <[Array]> List of links found in the requested page.
* `links` <[Array]<[string]>> List of links found in the requested page.
* `depth` <[number]> Depth of the followed links.
* `previousUrl` <[string]> The previous request's url. The value is `null` for the initial request.
* `onError(error)` <[Function]> Function to be called when request fails.
Expand Down Expand Up @@ -280,12 +298,6 @@ This method clears the cache when it's used.

* returns: <[number]> The count of total requests.

### event: 'newpage'

* `page` <[Page]>

Emitted when a [Puppeteer](https://github.com/GoogleChrome/puppeteer)'s page is opened.

### event: 'requestdisallowed'

* `options` <[Object]>
Expand Down
17 changes: 11 additions & 6 deletions docs/CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,11 @@ and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.

- Set `previousUrl` to `onSuccess` argument.
- Set `options`, `depth`, `previousUrl` to errors.
- Support `customCrawl` for [HCCrawler.connect()](https://github.com/yujiosaka/headless-chrome-crawler/blob/master/docs/API.md#hccrawlerconnectoptions) and [HCCrawler.launch()](https://github.com/yujiosaka/headless-chrome-crawler/blob/master/docs/API.md#hccrawlerlaunchoptions)'s options.

### Changed

- Drop `newpage` event.

### Fixed

Expand All @@ -24,7 +29,7 @@ and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.
- Support `cookies` for [crawler.queue()](https://github.com/yujiosaka/headless-chrome-crawler/blob/master/docs/API.md#crawlerqueueoptions)'s options.
- Make `onSuccess` pass `cookies` in the response.

### changed
### Changed

- Update [Puppeteer](https://github.com/GoogleChrome/puppeteer) version to 1.4.0.

Expand All @@ -36,7 +41,7 @@ and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.
- Emit `requestdisallowed` event.
- Make `onSuccess` pass `redirectChain` in the response.

### changed
### Changed

- Bump Node.js version up to 8.10.0.
- Update [Puppeteer](https://github.com/GoogleChrome/puppeteer) version to 1.3.0.
Expand Down Expand Up @@ -68,7 +73,7 @@ and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.

## [1.3.4] - 2018-02-22

### changed
### Changed

- Drop `depthPriority` for [crawler.queue()](https://github.com/yujiosaka/headless-chrome-crawler/blob/master/docs/API.md#crawlerqueueoptions)'s options.

Expand All @@ -79,7 +84,7 @@ and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.
- Emit `newpage` event.
- Support `deniedDomains` and `depthPriority` for [crawler.queue()](https://github.com/yujiosaka/headless-chrome-crawler/blob/master/docs/API.md#crawlerqueueoptions)'s options.

### changed
### Changed

- Allow `allowedDomains` option to accept a list of regular expressions.

Expand All @@ -106,7 +111,7 @@ and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.
- Add [HCCrawler.defaultArgs()](https://github.com/yujiosaka/headless-chrome-crawler/blob/master/docs/API.md#hccrawlerdefaultargs) method.
- Emit `requestretried` event.

### changed
### Changed

- Use `cache` option not only for remembering already requested URLs but for request queue for distributed environments.
- Moved `onSuccess`, `onError` and `maxDepth` options from [HCCrawler.connect()](https://github.com/yujiosaka/headless-chrome-crawler/blob/master/docs/API.md#hccrawlerconnectoptions) and [HCCrawler.launch()](https://github.com/yujiosaka/headless-chrome-crawler/blob/master/docs/API.md#hccrawlerlaunchoptions) to [crawler.queue()](https://github.com/yujiosaka/headless-chrome-crawler/blob/master/docs/API.md#crawlerqueueoptions).
Expand All @@ -117,7 +122,7 @@ and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.
- Support `obeyRobotsTxt` for [crawler.queue()](https://github.com/yujiosaka/headless-chrome-crawler/blob/master/docs/API.md#crawlerqueueoptions)'s options.
- Support `persist` for [RedisCache](https://github.com/yujiosaka/headless-chrome-crawler/blob/master/docs/API.md#rediscache)'s constructing options.

### changed
### Changed

- Make `cache` to be required for [HCCrawler.connect()](https://github.com/yujiosaka/headless-chrome-crawler/blob/master/docs/API.md#hccrawlerconnectoptions) and [HCCrawler.launch()](https://github.com/yujiosaka/headless-chrome-crawler/blob/master/docs/API.md#hccrawlerlaunchoptions)'s options.
- Provide `skipDuplicates` to remember and skip duplicate URLs, instead of passing `null` to `cache` option.
Expand Down
8 changes: 4 additions & 4 deletions examples/conditional-screenshot.js
Original file line number Diff line number Diff line change
Expand Up @@ -4,14 +4,14 @@ const PATH = './tmp/';

(async () => {
const crawler = await HCCrawler.launch({
onSuccess: (result => {
onSuccess: result => {
console.log(`Screenshot is saved as ${PATH}${result.options.saveAs} for ${result.options.url}.`);
}),
preRequest: (options => {
},
preRequest: options => {
if (!options.saveAs) return false; // Skip the request by returning false
options.screenshot = { path: `${PATH}${options.saveAs}` };
return true;
}),
},
});
await crawler.queue({ url: 'https://example.com/' });
// saveAs is a custom option for preRequest to conditionally modify options and skip requests
Expand Down
4 changes: 2 additions & 2 deletions examples/custom-cache.js
Original file line number Diff line number Diff line change
Expand Up @@ -63,9 +63,9 @@ const cache = new FsCache({ file: FILE });
(async () => {
const crawler = await HCCrawler.launch({
maxConcurrency: 1,
onSuccess: (result => {
onSuccess: result => {
console.log(`Requested ${result.options.url}.`);
}),
},
cache,
});
await crawler.queue('https://example.com/');
Expand Down
29 changes: 29 additions & 0 deletions examples/custom-crawl.js
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
const HCCrawler = require('headless-chrome-crawler');

(async () => {
const crawler = await HCCrawler.launch({
customCrawl: async (page, crawl) => {
// You can access the page object before requests
await page.setRequestInterception(true);
page.on('request', request => {
if (request.url().endsWith('/')) {
request.continue();
} else {
request.abort();
}
});
// The result contains options, links, cookies and etc.
const result = await crawl();
// You can access the page object after requests
result.content = await page.content();
// You need to extend and return the crawled result
return result;
},
onSuccess: result => {
console.log(`Got ${result.content} for ${result.options.url}.`);
},
});
await crawler.queue('https://example.com/');
await crawler.onIdle();
await crawler.close();
})();
8 changes: 4 additions & 4 deletions examples/emulate-device.js
Original file line number Diff line number Diff line change
Expand Up @@ -3,12 +3,12 @@ const HCCrawler = require('headless-chrome-crawler');
(async () => {
const crawler = await HCCrawler.launch({
url: 'https://example.com/',
evaluatePage: (() => ({
evaluatePage: () => ({
userAgent: window.navigator.userAgent,
})),
onSuccess: (result => {
console.log(`Emulated ${result.result.userAgent} for ${result.options.url}.`);
}),
onSuccess: result => {
console.log(`Emulated ${result.result.userAgent} for ${result.options.url}.`);
},
});
await crawler.queue({ device: 'Nexus 7' });
await crawler.queue({ userAgent: 'headless-chrome-crawler' }); // Only override userAgent
Expand Down
8 changes: 4 additions & 4 deletions examples/multiple-queue.js
Original file line number Diff line number Diff line change
Expand Up @@ -2,12 +2,12 @@ const HCCrawler = require('headless-chrome-crawler');

(async () => {
const crawler = await HCCrawler.launch({
evaluatePage: (() => ({
evaluatePage: () => ({
title: $('title').text(),
})),
onSuccess: (result => {
console.log(`Got ${result.result.title} for ${result.options.url}.`);
}),
onSuccess: result => {
console.log(`Got ${result.result.title} for ${result.options.url}.`);
},
});
await crawler.queue('https://example.com/'); // Queue a request
await crawler.queue(['https://example.net/', { url: 'https://example.org/' }]); // Queue multiple requests in different styles
Expand Down
12 changes: 6 additions & 6 deletions examples/override-function.js
Original file line number Diff line number Diff line change
Expand Up @@ -2,18 +2,18 @@ const HCCrawler = require('headless-chrome-crawler');

(async () => {
const crawler = await HCCrawler.launch({
evaluatePage: (() => {
evaluatePage: () => {
throw new Error("Global functions won't be called");
}),
onSuccess: (result => {
},
onSuccess: result => {
console.log(`Got ${result.result.title} for ${result.options.url}.`);
}),
},
});
await crawler.queue({
url: 'https://example.com/',
evaluatePage: (() => ({
evaluatePage: () => ({
title: $('title').text(),
})),
}),
});
await crawler.onIdle();
await crawler.close();
Expand Down
4 changes: 2 additions & 2 deletions examples/pause-resume.js
Original file line number Diff line number Diff line change
Expand Up @@ -4,9 +4,9 @@ const HCCrawler = require('headless-chrome-crawler');
const crawler = await HCCrawler.launch({
maxConcurrency: 1,
maxRequest: 2,
onSuccess: (result => {
onSuccess: result => {
console.log(`Requested ${result.options.url}.`);
}),
},
});
await crawler.queue('https://example.com/');
await crawler.queue('https://example.net/');
Expand Down
4 changes: 2 additions & 2 deletions examples/priority-queue.js
Original file line number Diff line number Diff line change
Expand Up @@ -3,9 +3,9 @@ const HCCrawler = require('headless-chrome-crawler');
(async () => {
const crawler = await HCCrawler.launch({
maxConcurrency: 1,
onSuccess: (result => {
onSuccess: result => {
console.log(`Requested ${result.options.url}.`);
}),
},
});
await crawler.queue({ url: 'https://example.com/', priority: 1 });
await crawler.queue({ url: 'https://example.net/', priority: 2 }); // This queue is requested before the previous queue
Expand Down
4 changes: 2 additions & 2 deletions examples/redis-cache.js
Original file line number Diff line number Diff line change
Expand Up @@ -5,9 +5,9 @@ const cache = new RedisCache({ host: '127.0.0.1', port: 6379 });

function launch(persistCache) {
return HCCrawler.launch({
onSuccess: (result => {
onSuccess: result => {
console.log(`Requested ${result.options.url}.`);
}),
},
cache,
persistCache, // Cache won't be cleared when closing the crawler if set true
});
Expand Down
15 changes: 11 additions & 4 deletions lib/crawler.js
Original file line number Diff line number Diff line change
Expand Up @@ -32,14 +32,18 @@ class Crawler {
/**
* @param {!Puppeteer.Page} page
* @param {!Object} options
* @param {!number} depth
* @param {string} previousUrl
*/
constructor(page, options) {
constructor(page, options, depth, previousUrl) {
this._page = page;
this._options = options;
this._depth = depth;
this._previousUrl = previousUrl;
}

/**
* @return {!Promise}
* @return {!Promise<!Object>}
*/
async crawl() {
await this._prepare();
Expand All @@ -57,6 +61,9 @@ class Crawler {
this._collectLinks(response.url),
]);
return {
options: this._options,
depth: this._depth,
previousUrl: this._previousUrl,
response: this._reduceResponse(response),
redirectChain: this._getRedirectChain(response),
result,
Expand Down Expand Up @@ -248,7 +255,7 @@ class Crawler {
}

/**
* @return {!Promise}
* @return {!Promise<!Buffer|!String>}
* @private
*/
async _screenshot() {
Expand All @@ -266,7 +273,7 @@ class Crawler {

/**
* @param {!string} baseUrl
* @return {!Promise}
* @return {!Promise<!Array<!string>>}
* @private
*/
async _collectLinks(baseUrl) {
Expand Down
Loading

0 comments on commit c16aa6d

Please sign in to comment.