ScraperX

NodeJs web scraper with ability to crawl

Installation

npm i https://github.com/An0nymusA/ScraperX

import { ScraperX } from 'scraperx';

Usage

Selectors

There are 2+ types of content we can get from element

selector@text // will get text content of element
selector@html // will get outer html of element
selector@attr // will get value of attribute you specify

Pagination

ScraperX.html('...').crawl(linkSelector, scope ,selectors, maxPages?);
// The contents of selector will be used for navigation to next page
// scraper will crawl to max page, or until there is a valid link

Single usage

ScraperX.html('<p>Lorem Ipsum</p>', 'p@text').find('p@text');
(await ScraperX.url('https://google.com', 'p@text')).find('p@text');

ScraperX.find('<p>Lorem Ipsum</p>', 'p@text');
ScraperX.$find('https://google.com', 'p@text');

Mass usage

// if parentSelectors is "div", all child selectors will be "div selector"
ScraperX.html('...').find('div', {
    title: 'p.title@text',
    subtitle: 'p.subtitle@text',
    link: 'a@href',
    outerHtml: '&@html',
});

Filters

//Global filters
ScraperX.globalFilters = {
    trim: (str) => str.trim(),
    replace: (str, search, replace) => str.replace(search, replace || ''),
};

//Instance specific Filters
scraper.setFilters({
    trim: (str) => str.trim(),
    replace: (str, search, replace) => str.replace(search, replace || ''),
});

// filter can be applied in two ways
// "selector | filter" or "selector | filter:arg1,arg..."

Example

const url = 'https://example.com';
const html = '<div><span id="a" >Span Text </span></div>'; // Can be used instead of url for testing

const scraper = ScraperX.html(html);

scraper.find('span#a'); // "Span Text "
scraper.find('span#a | trim'); // "Span Text"
scraper.find('span#a | replace:Span,Div'); // "Div Text "

const urlScraper = await ScraperX.url(url);
// ...

Additional options

Allowing null values

By default all null items (where every field is blank) are filtered out

// The filter can be ignored by setting
ScraperX.filterNonNullValues = false;

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
out		out
src		src
.editorconfig		.editorconfig
.eslintrc.js		.eslintrc.js
.gitignore		.gitignore
.prettierrc		.prettierrc
README.md		README.md
package.json		package.json
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ScraperX

Installation

Usage

Selectors

Pagination

Single usage

Mass usage

Filters

Example

Additional options

Allowing null values

About

Releases

Packages

Languages

An0nymusA/ScraperX

Folders and files

Latest commit

History

Repository files navigation

ScraperX

Installation

Usage

Selectors

Pagination

Single usage

Mass usage

Filters

Example

Additional options

Allowing null values

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages