Interesting, but incomplete docs. #5

data-envoy · 2023-10-13T16:42:16Z

Hey, this looks really cool.

Would it be possible to add some example usage, and maybe explain some more about what a scenario is in this context.

Thank you

jelen07 · 2023-10-13T19:20:11Z

Hi @data-envoy, yeah @tg666 did a lot of good work here. The documentation is in progress and we are currently working on other interesting projects. Below I attach the work in progress. If you want, you can start a PR.

For better UI testing, I recommend using the CMP administration that Crawler integrates, see https://github.com/68publishers/consent-management-platform. It's better than sending requests via Postman etc.

Docs

Script Samples

The following scenarios trigger the first 100 URLs found on the homepage, after clicking the consent modal - one accepts all, the other rejects all:

The following scenarios do the same, but open the settings modal and select only the specified category:

Basic Config Structure

{
  "options": {},
  "scenes": {
    "<scene name>": [
      {
        "action": "<action name>",
        "options": {}
      }
    ]
  },
  "entrypoint": {
    "url": "<url>",
    "scene": "<scene name>"
  }
}

Options

None of the options are required, but it is recommended to set at least maxRequests to avoid crawling the entire website. Here is a list of options:

A dot . indicates a nested object.

Option	Required	Default	Description
maxConcurrency	no	automatic	Maximum concurrency, the crawler automatically scales the pool, but it may not always estimate correctly, so there is an option to limit it from above to ensure server capability.
maxRequests	no	-	Maximum number of requests. It can be slightly exceeded as already started requests are allowed to finish after reaching the limit.
maxRequestRetries	no	0	How many times a single request can be retried if it fails or returns a negative status code.
viewport.width	no	?	Width of the viewport, default value is unknown.
viewport.height	no	?	Height of the viewport, default value is unknown.
session.maxPoolSize	no	1000	Maximum number of browser sessions.
session.maxSessionUsageCount	no	-	Maximum number of times a single browser session can be used.
session.transferredCookies	no	-	Array of cookie names to be transferred from the previous request.
waitUntil	no	networkidle0	One of the values "load", "domcontentloaded", "networkidle0", or "networkidle2". See https://pptr.dev/api/puppeteer.puppeteerlifecycleevent for details.

Scenes

A script consists of individual "scenes". One is always the main scene that is executed first - defined in the entrypoint. Others are used for instructions on subsequent sub-pages, or one scene can call another, etc.

A scene is essentially an array of actions.

Entrypoint

Only two required attributes

url → where to start
scene → the name of the scene to start with

Actions

click

click - clicks on an element

Option	Required	Default	Description
selector	yes	-	CSS selector or Xpath
button	no	"left"	"left" or "right" - mouse button
clickCount	no	1	Number of times to click
delay	no	0	Delay after each click
xpath	no	false	Specifies if the selector is xpath

clickWithRedirect

clickWithRedirect - clicks on an element (typically a link) and waits for redirection

Option	Required	Default	Description
selector	yes	-	CSS selector or Xpath
delay	no	0	Delay after click
xpath	no	false	Specifies if the selector is xpath
waitUntil	no	"networkidle0"	One of the values "load", "domcontentloaded", "networkidle0", or "networkidle2". See https://pptr.dev/api/puppeteer.puppeteerlifecycleevent for details.

collectCookies

collectCookies - collects cookies from the page, no options required

delay

delay - processing waits for the specified time before continuing

Option	Required	Default	Description
delay	yes	-	Delay in milliseconds - 1000 = 1s

enqueueLinks

enqueueLinks - finds links on the page and adds them to the request queue

Option	Required	Default	Description
strategy	yes	-	One of the values "all", "same-hostname", "same-domain", "same-origin", or "manual"
scene	yes	-	Name of the scene to be executed after the page is loaded
urls	yes, if strategy is manual	-	Array of URLs, simply used for manually running specific URLs
selector	no	a	CSS selector, can be used to find links only within a specific element by setting it to .my-element, for example
exclude	no	[]	Array of regular expressions, if a link matches any of them, it is skipped
limit	no	-	Maximum number of links to be found
baseUrl	no	-	Base URL for relative links. Usually not necessary to set, as it resolves automatically from the original URL. However, if we were to navigate to a page from a completely different website (e.g., from google), it needs to be set, otherwise it would resolve as http://google.com/

enqueueLinksByClicking

enqueueLinksByClicking - same purpose as enqueueLinks, but the links are added to the queue by clicking on them

Option	Required	Default	Description
scene	yes	-	Name of the scene to be executed after the page is loaded
selector	yes	-	CSS selector of the link to be clicked

focus

focus - focuses on the specified element

Option	Required	Default	Description
selector	yes	-	CSS selector of the element

hover

hover - hovers the mouse over the specified selector

Option	Required	Default	Description
selector	yes	-	CSS selector of the element

keyboardPress

keyboardPress - presses a specific key

Option	Required	Default	Description
key	yes	-	Key name, see https://pptr.dev/api/puppeteer.keyinput for available options

runScene

runScene - executes a "sub-scenario"

Option	Required	Default	Description
scene	yes	-	Name of the scenario

runSceneConditionally

runSceneConditionally - executes a scenario if a condition is met

Option	Required	Default	Description
condition	yes	-	Object defining the condition
thenRun	yes	-	Name of the scenario to execute if the condition is true
elseRun	no	-	Name of the scenario to execute if the condition is false

The condition has a structure similar to actions { "type": "", "options": {} }.

Currently, only one condition is implemented, see the Conditions section.

screenshot

screenshot - takes a screenshot

Option	Required	Default	Description
name	yes	-	Name of the screenshot (not the file name, but the displayed name)

select

select - selects a value from a select box

Option	Required	Default	Description
selector	yes	-	CSS selector of the element
value	yes	-	Value to be selected

type

type - types a value into an input/textarea

Option	Required	Default	Description
selector	yes	-	CSS selector of the element
value	yes	-	Value to be typed
delay	no	-	Delay between typing each character

waitForSelector

waitForSelector - waits until the specified element appears on the page. If it is not found before the timeout, processing of the page fails

Option	Required	Default	Description
selector	yes	-	CSS selector of the element
hidden	no	false	If set to true, waits until the element becomes invisible (display: none or visibility: hidden)
visible	no	false	If set to true, waits until the element becomes visible

Todo I forgot to include the option to set the timeout in the description, so by default it is set to 30 seconds.

Other supported actions that are not used in CMP:

collectData

collectData - collects data from the page. Options include an object where the keys are the names of the individual values, and under each key is an object with options, e.g.

{
  "action": "collectData",
  "options": {
    "page_title": {
      /* ... */
    },
    "images": {
      /* ... */
    }
  }
}

For each key, the options are:

Option	Required	Default	Description
strategy	yes	-	One of the values "static", "selector.innerText", or "selector.attribute"
value	yes, if the strategy is static	-	Any value to be saved
selector	yes, if the strategy is selector.innerText or selector.attribute	-	CSS selector
attribute	yes, if the strategy is selector.attribute	-	Name of the HTML attribute
multiple	no	false	Specifies if there are multiple values - if set to false, only the first element is searched, if true, all elements are searched and the value is stored as an array.

setIdentity

setIdentity - sets an "identity" for the collected data. It is always called before calling collectData

Option	Required	Default	Description
strategy	yes	-	One of the values "static", "selector.innerText", or "selector.attribute"
identity	yes, if the strategy is static	-	Any value (identity)
selector	yes, if the strategy is selector.innerText or selector.attribute	-	CSS selector
attribute	yes, if the strategy is selector.attribute	-	Name of the HTML attribute

Conditions

isElementVisible

isElementVisible - returns true if the element exists on the page and is visible

Option	Required	Default	Description
selector	yes	-	CSS selector

data-envoy · 2023-10-16T12:01:52Z

Oh wow, that looks super helpful! It should be enough to try things out. Sorry, I don't have time right now to make a PR, but maybe in the future. Thank you

tg666 added the documentation Improvements or additions to documentation label Oct 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Interesting, but incomplete docs. #5

Interesting, but incomplete docs. #5

data-envoy commented Oct 13, 2023

jelen07 commented Oct 13, 2023 •

edited

Loading

data-envoy commented Oct 16, 2023 •

edited

Loading

Interesting, but incomplete docs. #5

Interesting, but incomplete docs. #5

Comments

data-envoy commented Oct 13, 2023

jelen07 commented Oct 13, 2023 • edited Loading

Docs

Script Samples

Basic Config Structure

Options

Scenes

Entrypoint

Actions

click

clickWithRedirect

collectCookies

delay

enqueueLinks

enqueueLinksByClicking

focus

hover

keyboardPress

runScene

runSceneConditionally

screenshot

select

type

waitForSelector

collectData

setIdentity

Conditions

isElementVisible

data-envoy commented Oct 16, 2023 • edited Loading

jelen07 commented Oct 13, 2023 •

edited

Loading

data-envoy commented Oct 16, 2023 •

edited

Loading