Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Interesting, but incomplete docs. #5

Open
data-envoy opened this issue Oct 13, 2023 · 2 comments
Open

Interesting, but incomplete docs. #5

data-envoy opened this issue Oct 13, 2023 · 2 comments
Labels
documentation Improvements or additions to documentation

Comments

@data-envoy
Copy link

Hey, this looks really cool.

Would it be possible to add some example usage, and maybe explain some more about what a scenario is in this context.

  • Thank you
@jelen07
Copy link

jelen07 commented Oct 13, 2023

Hi @data-envoy, yeah @tg666 did a lot of good work here. The documentation is in progress and we are currently working on other interesting projects. Below I attach the work in progress. If you want, you can start a PR.

For better UI testing, I recommend using the CMP administration that Crawler integrates, see https://github.com/68publishers/consent-management-platform. It's better than sending requests via Postman etc.


Docs

Script Samples

The following scenarios trigger the first 100 URLs found on the homepage, after clicking the consent modal - one accepts all, the other rejects all:

The following scenarios do the same, but open the settings modal and select only the specified category:

Basic Config Structure

{
  "options": {},
  "scenes": {
    "<scene name>": [
      {
        "action": "<action name>",
        "options": {}
      }
    ]
  },
  "entrypoint": {
    "url": "<url>",
    "scene": "<scene name>"
  }
}

Options

None of the options are required, but it is recommended to set at least maxRequests to avoid crawling the entire website. Here is a list of options:

A dot . indicates a nested object.

Option Required Default Description
maxConcurrency no automatic Maximum concurrency, the crawler automatically scales the pool, but it may not always estimate correctly, so there is an option to limit it from above to ensure server capability.
maxRequests no - Maximum number of requests. It can be slightly exceeded as already started requests are allowed to finish after reaching the limit.
maxRequestRetries no 0 How many times a single request can be retried if it fails or returns a negative status code.
viewport.width no ? Width of the viewport, default value is unknown.
viewport.height no ? Height of the viewport, default value is unknown.
session.maxPoolSize no 1000 Maximum number of browser sessions.
session.maxSessionUsageCount no - Maximum number of times a single browser session can be used.
session.transferredCookies no - Array of cookie names to be transferred from the previous request.
waitUntil no networkidle0 One of the values "load", "domcontentloaded", "networkidle0", or "networkidle2". See https://pptr.dev/api/puppeteer.puppeteerlifecycleevent for details.

Scenes

A script consists of individual "scenes". One is always the main scene that is executed first - defined in the entrypoint. Others are used for instructions on subsequent sub-pages, or one scene can call another, etc.

A scene is essentially an array of actions.

Entrypoint

Only two required attributes

  1. url → where to start
  2. scene → the name of the scene to start with

Actions

click

click - clicks on an element

Option Required Default Description
selector yes - CSS selector or Xpath
button no "left" "left" or "right" - mouse button
clickCount no 1 Number of times to click
delay no 0 Delay after each click
xpath no false Specifies if the selector is xpath

clickWithRedirect

clickWithRedirect - clicks on an element (typically a link) and waits for redirection

Option Required Default Description
selector yes - CSS selector or Xpath
delay no 0 Delay after click
xpath no false Specifies if the selector is xpath
waitUntil no "networkidle0" One of the values "load", "domcontentloaded", "networkidle0", or "networkidle2". See https://pptr.dev/api/puppeteer.puppeteerlifecycleevent for details.

collectCookies

collectCookies - collects cookies from the page, no options required

delay

delay - processing waits for the specified time before continuing

Option Required Default Description
delay yes - Delay in milliseconds - 1000 = 1s

enqueueLinks

enqueueLinks - finds links on the page and adds them to the request queue

Option Required Default Description
strategy yes - One of the values "all", "same-hostname", "same-domain", "same-origin", or "manual"
scene yes - Name of the scene to be executed after the page is loaded
urls yes, if strategy is manual - Array of URLs, simply used for manually running specific URLs
selector no a CSS selector, can be used to find links only within a specific element by setting it to .my-element, for example
exclude no [] Array of regular expressions, if a link matches any of them, it is skipped
limit no - Maximum number of links to be found
baseUrl no - Base URL for relative links. Usually not necessary to set, as it resolves automatically from the original URL. However, if we were to navigate to a page from a completely different website (e.g., from google), it needs to be set, otherwise it would resolve as http://google.com/

enqueueLinksByClicking

enqueueLinksByClicking - same purpose as enqueueLinks, but the links are added to the queue by clicking on them

Option Required Default Description
scene yes - Name of the scene to be executed after the page is loaded
selector yes - CSS selector of the link to be clicked

focus

focus - focuses on the specified element

Option Required Default Description
selector yes - CSS selector of the element

hover

hover - hovers the mouse over the specified selector

Option Required Default Description
selector yes - CSS selector of the element

keyboardPress

keyboardPress - presses a specific key

Option Required Default Description
key yes - Key name, see https://pptr.dev/api/puppeteer.keyinput for available options

runScene

runScene - executes a "sub-scenario"

Option Required Default Description
scene yes - Name of the scenario

runSceneConditionally

runSceneConditionally - executes a scenario if a condition is met

Option Required Default Description
condition yes - Object defining the condition
thenRun yes - Name of the scenario to execute if the condition is true
elseRun no - Name of the scenario to execute if the condition is false

The condition has a structure similar to actions { "type": "", "options": {} }.

Currently, only one condition is implemented, see the Conditions section.

screenshot

screenshot - takes a screenshot

Option Required Default Description
name yes - Name of the screenshot (not the file name, but the displayed name)

select

select - selects a value from a select box

Option Required Default Description
selector yes - CSS selector of the element
value yes - Value to be selected

type

type - types a value into an input/textarea

Option Required Default Description
selector yes - CSS selector of the element
value yes - Value to be typed
delay no - Delay between typing each character

waitForSelector

waitForSelector - waits until the specified element appears on the page. If it is not found before the timeout, processing of the page fails

Option Required Default Description
selector yes - CSS selector of the element
hidden no false If set to true, waits until the element becomes invisible (display: none or visibility: hidden)
visible no false If set to true, waits until the element becomes visible

Todo I forgot to include the option to set the timeout in the description, so by default it is set to 30 seconds.

Other supported actions that are not used in CMP:

collectData

collectData - collects data from the page. Options include an object where the keys are the names of the individual values, and under each key is an object with options, e.g.

{
  "action": "collectData",
  "options": {
    "page_title": {
      /* ... */
    },
    "images": {
      /* ... */
    }
  }
}

For each key, the options are:

Option Required Default Description
strategy yes - One of the values "static", "selector.innerText", or "selector.attribute"
value yes, if the strategy is static - Any value to be saved
selector yes, if the strategy is selector.innerText or selector.attribute - CSS selector
attribute yes, if the strategy is selector.attribute - Name of the HTML attribute
multiple no false Specifies if there are multiple values - if set to false, only the first element is searched, if true, all elements are searched and the value is stored as an array.

setIdentity

setIdentity - sets an "identity" for the collected data. It is always called before calling collectData

Option Required Default Description
strategy yes - One of the values "static", "selector.innerText", or "selector.attribute"
identity yes, if the strategy is static - Any value (identity)
selector yes, if the strategy is selector.innerText or selector.attribute - CSS selector
attribute yes, if the strategy is selector.attribute - Name of the HTML attribute

Conditions

isElementVisible

isElementVisible - returns true if the element exists on the page and is visible

Option Required Default Description
selector yes - CSS selector

@data-envoy
Copy link
Author

data-envoy commented Oct 16, 2023

Oh wow, that looks super helpful! It should be enough to try things out. Sorry, I don't have time right now to make a PR, but maybe in the future. Thank you

@tg666 tg666 added the documentation Improvements or additions to documentation label Oct 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests

3 participants