Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

take a screenshot in process_spider_exception #309

Open
blacksteel1288 opened this issue Jul 28, 2024 · 1 comment
Open

take a screenshot in process_spider_exception #309

blacksteel1288 opened this issue Jul 28, 2024 · 1 comment

Comments

@blacksteel1288
Copy link

Is there a way to take a screenshot for a process_spider_exception error?

I can't figure out how to access the page object in that middleware.

@elacuesta
Copy link
Member

elacuesta commented Aug 1, 2024

Interesting question. I've encountered two limitations while trying to make this work:

  1. Scrapy supports process_spider_output methods in spider middlewares to be defined as coroutines, but does not support async def process_spider_exception methods. This is the reason I'm using asyncio.create_task in my example below
  2. We need to allow some time to take the screenshot, otherwise it's possible for the spider to run out of requests to make and the spider will be closed. This will close the download handler which will close the browser and cause a TargetClosedError: 'Page.screenshot: Target page, context or browser has been closed exception. This can be handled by connecting to the spider_idle signal and raising DontCloseSpider if the screenshot has not yet been taken.

Full example:

import asyncio
import logging

import scrapy
from playwright.async_api import Page
from scrapy import signals
from scrapy.crawler import Crawler


class HandleExceptionMiddleware:
    @classmethod
    def from_crawler(cls, crawler: Crawler):
        return cls(crawler)

    def __init__(self, crawler: Crawler) -> None:
        crawler.signals.connect(self.spider_idle, signal=signals.spider_idle)
        self.screenshot_taken = asyncio.Event()

    def spider_idle(self, spider):
        if not self.screenshot_taken.is_set():
            raise scrapy.exceptions.DontCloseSpider()

    def process_spider_exception(self, response, exception, spider):
        logging.info("Caught exception: %s", exception.__class__)
        page: Page = response.meta["playwright_page"]
        asyncio.create_task(self.take_screenshot(page=page))
        return []

    async def take_screenshot(self, page: Page):
        await page.screenshot(path="example_exception.png", full_page=True)
        self.screenshot_taken.set()
        await page.close()


class HandleExceptionSpider(scrapy.Spider):
    name = "exception"
    custom_settings = {
        "SPIDER_MIDDLEWARES": {HandleExceptionMiddleware: 100},
        "TWISTED_REACTOR": "twisted.internet.asyncioreactor.AsyncioSelectorReactor",
        "DOWNLOAD_HANDLERS": {
            "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
        },
    }

    def start_requests(self):
        yield scrapy.Request(
            url="https://example.org",
            meta={"playwright": True, "playwright_include_page": True},
        )

    def parse(self, response, **kwargs):
        logging.info("Received response for %s", response.url)
        1 / 0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants