Firefox does not work with proxy. #320

bboyadao · 2024-09-20T18:42:48Z

I just create an example spider.
Chromium works well. but with the setup below. it's raise NS_ERROR_PROXY_CONNECTION_REFUSED from playwright._impl._errors.Error: Page.goto: NS_ERROR_PROXY_CONNECTION_REFUSED

Debug to in ScrapyPlaywrightDownloadHandler._maybe_launch_browser and i got launch_options.

async def _maybe_launch_browser(self) -> None:
    async with self.browser_launch_lock:
        if not hasattr(self, "browser"):
            logger.info("Launching browser %s", self.browser_type.name)
            self.browser = await self.browser_type.launch(**self.config.launch_options)
            logger.info("Browser %s launched", self.browser_type.name)
            self.stats.inc_value("playwright/browser_count")
            self.browser.on("disconnected", self._browser_disconnected_callback)

And i copy it to playwright to test and it's works.

example_spider.py

import scrapy
from rich import print


class ExampleSpider(scrapy.Spider):
    name = "ex"
    start_urls = ["https://httpbin.org/get"]
    custom_settings = {
        "DOWNLOAD_HANDLERS": {
            "http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
            "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
        },
        "PLAYWRIGHT_BROWSER_TYPE": "firefox",
        "PLAYWRIGHT_LAUNCH_OPTIONS": {
            "headless": False,
            "timeout": 20 * 1000,
            'proxy': {
                'server': '127.0.0.1:8888',
                'username': 'username',
                'password': 'password'
            }
        },
    }
    
    def start_requests(self):
        yield scrapy.Request(
            url=self.start_urls[0],
            callback=self.parse_detail,
            meta=dict(
                playwright=True,
                playwright_include_page=True,
                playwright_context_kwargs=dict(
                    java_script_enabled=True,
                    ignore_https_errors=True,
                ),
            
            )
        )
    
    async def parse_detail(self, response):
        print(f"Received response from {response.url}")
        yield {}

test_with_playwright.py

import asyncio

from playwright.async_api import async_playwright


async def run_playwright_with_proxy():
    kwargs = {
        'headless': False, 
        'timeout': 20000,
        'proxy': {
            'server': '127.0.0.1:8888',
            'username': 'username',
            'password': 'password'
        }
    }
    
    async with async_playwright() as p:
        browser = await p.firefox.launch(**kwargs)
        page = await browser.new_page()
        await page.goto("https://httpbin.org/get")
        await asyncio.sleep(100)
        print("Page Title:", await page.title())
        await browser.close()


if __name__ == "__main__":
    asyncio.run(run_playwright_with_proxy())

The text was updated successfully, but these errors were encountered:

elacuesta · 2024-09-23T13:27:14Z

I can not reproduce with mitmproxy:

$ mitmproxy --proxyauth "user:pass"

Slightly adapted sample spider:

import scrapy

class ExampleSpider(scrapy.Spider):
    name = "ex"
    start_urls = ["https://httpbin.org/get"]
    custom_settings = {
        "DOWNLOAD_HANDLERS": {
            "http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
            "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
        },
        "TWISTED_REACTOR": "twisted.internet.asyncioreactor.AsyncioSelectorReactor",
        "PLAYWRIGHT_BROWSER_TYPE": "firefox",
        "PLAYWRIGHT_LAUNCH_OPTIONS": {
            "headless": False,
            "timeout": 20 * 1000,
            'proxy': {
                "server": "127.0.0.1:8080",
                "username": "user",
                "password": "pass",
            }
        },
    }

    def start_requests(self):
        yield scrapy.Request(
            url=self.start_urls[0],
            callback=self.parse_detail,
            meta=dict(
                playwright=True,
                playwright_include_page=True,
                playwright_context_kwargs=dict(
                    java_script_enabled=True,
                    ignore_https_errors=True,
                ),

            )
        )

    async def parse_detail(self, response):
        print(f"Received response from {response.url}")
        page = response.meta["playwright_page"]
        await page.close()

$ scrapy runspider proxy.py
(...)
2024-09-23 10:21:22 [scrapy.core.engine] INFO: Spider opened
2024-09-23 10:21:22 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2024-09-23 10:21:22 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2024-09-23 10:21:22 [scrapy-playwright] INFO: Starting download handler
2024-09-23 10:21:22 [scrapy-playwright] INFO: Starting download handler
2024-09-23 10:21:27 [scrapy-playwright] INFO: Launching browser firefox
2024-09-23 10:21:27 [scrapy-playwright] INFO: Browser firefox launched
2024-09-23 10:21:27 [scrapy-playwright] DEBUG: Browser context started: 'default' (persistent=False, remote=False)
2024-09-23 10:21:28 [scrapy-playwright] DEBUG: [Context=default] New page created, page count is 1 (1 for all contexts)
2024-09-23 10:21:28 [scrapy-playwright] DEBUG: [Context=default] Request: <GET https://httpbin.org/get> (resource type: document)
2024-09-23 10:21:28 [scrapy-playwright] DEBUG: [Context=default] Response: <407 https://httpbin.org/get>
2024-09-23 10:21:28 [scrapy-playwright] DEBUG: [Context=default] Request: <GET https://httpbin.org/get> (resource type: document)
2024-09-23 10:21:28 [scrapy-playwright] DEBUG: [Context=default] Response: <200 https://httpbin.org/get>
2024-09-23 10:21:29 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://httpbin.org/get> (referer: None) ['playwright']
Received response from https://httpbin.org/get
2024-09-23 10:21:29 [scrapy.core.engine] INFO: Closing spider (finished)
(...)

Which proxy are you using? Perhaps this is an interaction with that specific provider.

bboyadao · 2024-09-24T05:10:39Z


2024-09-23 10:21:28 [scrapy-playwright] DEBUG: [Context=default] Request: <GET https://httpbin.org/get> (resource type: document)
2024-09-23 10:21:28 [scrapy-playwright] DEBUG: [Context=default] Response: <407 https://httpbin.org/get>
2024-09-23 10:21:28 [scrapy-playwright] DEBUG: [Context=default] Request: <GET https://httpbin.org/get> (resource type: document)
2024-09-23 10:21:28 [scrapy-playwright] DEBUG: [Context=default] Response: <200 https://httpbin.org/get>
2024-09-23 10:21:29 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://httpbin.org/get> (referer: None) ['playwright']
Received response from https://httpbin.org/get

I have some thoughts

Look like scrapy got 407 at first.
Next request handled by playwright.

In my case scrapy got 407 then set it failure.

I use https://scrapoxy.io to manage proxies.

elacuesta · 2024-09-24T18:29:02Z

Look like scrapy got 407 at first.

Next request handled by playwright.

All requests were routed through Playwright, notice the "scrapy-playwright" logger name:

2024-09-23 10:21:28 [scrapy-playwright] DEBUG: [Context=default] Response: <407 https://httpbin.org/get>

The provided spider works correctly with Scrapoxy. I've started it as indicated in their docs and I'm getting the following logs. There is a failure downloading the response, but that's reasonable because I did not add an actual proxy provider in the Scrapoxy configuration site.

2024-09-24 10:53:10 [scrapy.core.engine] INFO: Spider opened
2024-09-24 10:53:10 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2024-09-24 10:53:10 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2024-09-24 10:53:10 [scrapy-playwright] INFO: Starting download handler
2024-09-24 10:53:10 [scrapy-playwright] INFO: Starting download handler
2024-09-24 10:53:15 [scrapy-playwright] INFO: Launching browser firefox
2024-09-24 10:53:16 [scrapy-playwright] INFO: Browser firefox launched
2024-09-24 10:53:16 [scrapy-playwright] DEBUG: Browser context started: 'default' (persistent=False, remote=False)
2024-09-24 10:53:17 [scrapy-playwright] DEBUG: [Context=default] New page created, page count is 1 (1 for all contexts)
2024-09-24 10:53:17 [scrapy-playwright] DEBUG: [Context=default] Request: <GET https://httpbin.org/get> (resource type: document)
2024-09-24 10:53:17 [scrapy-playwright] DEBUG: [Context=default] Response: <407 https://httpbin.org/get>
2024-09-24 10:53:17 [scrapy-playwright] DEBUG: [Context=default] Request: <GET https://httpbin.org/get> (resource type: document)
2024-09-24 10:53:17 [scrapy-playwright] DEBUG: [Context=default] Response: <557 https://httpbin.org/get>
2024-09-24 10:53:17 [scrapy.core.engine] DEBUG: Crawled (557) <GET https://httpbin.org/get> (referer: None) ['playwright']
2024-09-24 10:53:17 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <557 https://httpbin.org/get>: HTTP status code is not handled or not allowed
2024-09-24 10:53:17 [scrapy.core.engine] INFO: Closing spider (finished)

However, if I pass incorrect credentials I do get the reported message:

2024-09-24 10:53:37 [scrapy.core.engine] INFO: Spider opened
2024-09-24 10:53:37 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2024-09-24 10:53:37 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2024-09-24 10:53:37 [scrapy-playwright] INFO: Starting download handler
2024-09-24 10:53:37 [scrapy-playwright] INFO: Starting download handler
2024-09-24 10:53:42 [scrapy-playwright] INFO: Launching browser firefox
2024-09-24 10:53:42 [scrapy-playwright] INFO: Browser firefox launched
2024-09-24 10:53:43 [scrapy-playwright] DEBUG: Browser context started: 'default' (persistent=False, remote=False)
2024-09-24 10:53:43 [scrapy-playwright] DEBUG: [Context=default] New page created, page count is 1 (1 for all contexts)
2024-09-24 10:53:43 [scrapy-playwright] DEBUG: [Context=default] Request: <GET https://httpbin.org/get> (resource type: document)
2024-09-24 10:53:43 [scrapy-playwright] DEBUG: [Context=default] Response: <407 https://httpbin.org/get>
2024-09-24 10:53:43 [scrapy-playwright] DEBUG: [Context=default] Request: <GET https://httpbin.org/get> (resource type: document)
2024-09-24 10:53:43 [scrapy-playwright] DEBUG: [Context=default] Response: <407 https://httpbin.org/get>
2024-09-24 10:53:43 [scrapy.core.scraper] ERROR: Error downloading <GET https://httpbin.org/get>
Traceback (most recent call last):
  File "/.../scrapy-playwright/venv-scrapy-playwright/lib/python3.10/site-packages/twisted/internet/defer.py", line 1999, in _inlineCallbacks
    result = context.run(
  File "/.../scrapy-playwright/venv-scrapy-playwright/lib/python3.10/site-packages/twisted/python/failure.py", line 519, in throwExceptionIntoGenerator
    return g.throw(self.value.with_traceback(self.tb))
  File "/.../scrapy-playwright/venv-scrapy-playwright/lib/python3.10/site-packages/scrapy/core/downloader/middleware.py", line 54, in process_request
    return (yield download_func(request=request, spider=spider))
  File "/.../scrapy-playwright/venv-scrapy-playwright/lib/python3.10/site-packages/twisted/internet/defer.py", line 1251, in adapt
    extracted: _SelfResultT | Failure = result.result()
  File "/.../scrapy-playwright/scrapy_playwright/handler.py", line 378, in _download_request
    return await self._download_request_with_retry(request=request, spider=spider)
  File "/.../scrapy-playwright/scrapy_playwright/handler.py", line 431, in _download_request_with_retry
    return await self._download_request_with_page(request, page, spider)
  File "/.../scrapy-playwright/scrapy_playwright/handler.py", line 460, in _download_request_with_page
    response, download = await self._get_response_and_download(request, page, spider)
  File "/.../scrapy-playwright/scrapy_playwright/handler.py", line 560, in _get_response_and_download
    response = await page.goto(url=request.url, **page_goto_kwargs)
  File "/.../scrapy-playwright/venv-scrapy-playwright/lib/python3.10/site-packages/playwright/async_api/_generated.py", line 8805, in goto
    await self._impl_obj.goto(
  File "/.../scrapy-playwright/venv-scrapy-playwright/lib/python3.10/site-packages/playwright/_impl/_page.py", line 524, in goto
    return await self._main_frame.goto(**locals_to_params(locals()))
  File "/.../scrapy-playwright/venv-scrapy-playwright/lib/python3.10/site-packages/playwright/_impl/_frame.py", line 145, in goto
    await self._channel.send("goto", locals_to_params(locals()))
  File "/.../scrapy-playwright/venv-scrapy-playwright/lib/python3.10/site-packages/playwright/_impl/_connection.py", line 59, in send
    return await self._connection.wrap_api_call(
  File "/.../scrapy-playwright/venv-scrapy-playwright/lib/python3.10/site-packages/playwright/_impl/_connection.py", line 514, in wrap_api_call
    raise rewrite_error(error, f"{parsed_st['apiName']}: {error}") from None
playwright._impl._errors.Error: Page.goto: NS_ERROR_PROXY_CONNECTION_REFUSED
Call log:
navigating to "https://httpbin.org/get", waiting until "load"

2024-09-24 10:53:43 [scrapy.core.engine] INFO: Closing spider (finished)

elacuesta added the could not reproduce label Sep 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Firefox does not work with proxy. #320

Firefox does not work with proxy. #320

bboyadao commented Sep 20, 2024

elacuesta commented Sep 23, 2024

bboyadao commented Sep 24, 2024

elacuesta commented Sep 24, 2024

Firefox does not work with proxy. #320

Firefox does not work with proxy. #320

Comments

bboyadao commented Sep 20, 2024

elacuesta commented Sep 23, 2024

bboyadao commented Sep 24, 2024

elacuesta commented Sep 24, 2024