COWS - Corion's Own Web Scraper
use 5.020;
use COWS 'scrape';
my $html = '...';
my $rules = [
{
name => 'link',
query => 'a@href',
munge => ['absolute'],
},
];
my $res = scrape( $html, $rules, { url => 'https://example.com/' } );
say "url: $_" for $res->{link}->@*;
scrape( $html, $rules, $options );
For each query, a hashref with the following keys is accepted
-
name
The name of this query. This will be used as a key in the resulting hashref.
-
query
The CSS selector or XPath query to search for.
-
fields
Queries that should be matched below this node. The results will get merged into a hashref. Duplicate names are not allowed (duh).
fields => [ { name => 'foo', ... }, { name => 'bar', ... }, { name => 'baz', ... }, ]
results in
{ foo => ..., bar => ..., baz => ... }
-
debug
Output progress while stepping through this query. This is convenient for finding why a specific query doesn't result in what you think it should.
-
single
Expect only a single item, result will be the value of the query. If the query has a
fields
field, it will be a hashref of the fields.If this key is missing, the result will always be an arrayref.
-
index
Use the n-th node as result. The result will always be a hashref or scalar value. This could be done in XPath but sometimes it's easier to do it here.
-
discard
Do not use this intermediate value but replace it by the arrayref of its
fields
value. -
html
Include the value of this node as an HTML string.
-
munge
Apply the functions in the listed order to the value.
-
tag
tag => foo=bar tag => baz:bat
Add the key/value to the resulting hash. If the separator is
=
, the result will be a plain scalar:foo => 'bar',
If the separator is
:
, there can be multiple tags and they are collected in an arrayref:baz => [ 'bat' ],