Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Page hangs on function instead of redirecting #147

Open
lime-n opened this issue Dec 10, 2022 · 0 comments
Open

Page hangs on function instead of redirecting #147

lime-n opened this issue Dec 10, 2022 · 0 comments

Comments

@lime-n
Copy link

lime-n commented Dec 10, 2022

I am attempting an SSO login to a website (I have access to this) via scrapy-playwright, and find that my playwright-script hangs when I use wait_for_function and this recursively produces the same network requests in the reactor and are all consoled. Eventually, all tasks are pending -- example output:

....

task: <Task pending name='Task-88505' coro=<_make_request_logger.<locals>._log_request() running at /Users//tealium_playwright/venv/lib/python3.10/site-packages/scrapy_playwright/handler.py:463> wait_for=<Future pending cb=[Task.task_wakeup()]> cb=[AsyncIOEventEmitter._emit_run.<locals>.callback() at /Users//tealium_playwright/venv/lib/python3.10/site-packages/pyee/asyncio.py:65, ProtocolCallback.__init__.<locals>.cb() at /Users//tealium_playwright/venv/lib/python3.10/site-packages/playwright/_impl/_connection.py:168]>
2022-12-10 21:06:09 [asyncio] ERROR: Task was destroyed but it is pending!
task: <Task pending name='Task-88624' coro=<_make_request_logger.<locals>._log_request() running at /Users//tealium_playwright/venv/lib/python3.10/site-packages/scrapy_playwright/handler.py:463> wait_for=<Future pending cb=[Task.task_wakeup()]> cb=[AsyncIOEventEmitter._emit_run.<locals>.callback() at /Users//tealium_playwright/venv/lib/python3.10/site-packages/pyee/asyncio.py:65, ProtocolCallback.__init__.<locals>.cb() at /Users//tealium_playwright/venv/lib/python3.10/site-packages/playwright/_impl/_connection.py:168]>

I have attempted the following script:

import scrapy
from scrapy_playwright.page import PageMethod
from path import Path
from urllib.parse import urlencode

class telSpider(scrapy.Spider):
    name = 'tel'
    start_urls = 'https://my.tealiumiq.com/login/sso/'

    custom_settings = {
        'CONTENT-TYPE': 'application/json',
        'USER-AGENT': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/16.0 Safari/605.1.15',
    }

    def start_requests(self):
        yield scrapy.Request(
            self.start_urls,
            meta = dict(
                    playwright = True,
                    playwright_include_page = True,
                    playwright_page_methods = [
                        PageMethod('wait_for_selector', selector = '.bodyMain', state='attached'),
                        PageMethod('wait_for_function', """(function() {const setValue = Object.getOwnPropertyDescriptor(
                                  window.HTMLInputElement.prototype,
                                  "value").set;
                                  const modifyInput = (name, value) => {
                                  const input = document.getElementsByName(name)[0]
                                  setValue.call(input, value)
                                  input.dispatchEvent(new Event('input', { bubbles: true}))
                                  };
                                  modifyInput('email', "[email protected]");
                                  document.querySelector("#submitBtn").click();
                                  setTimeout(() => {
                                      if(window.location.href.includes('https://okta.com/login/login.htm')){
                                          console.log(window.location.href);
                                      }else{
                                          console.log('not yet');
                                      }
                                  }, 5000)
                                  
                                  }())""", timeout=0),
                        PageMethod("screenshot", path=Path(__file__).parent / "tealium1.png", full_page=True),
                        ]),
                callback = self.parse)
 
    def parse(self, response):
        print(response)

Email me for a working email to test. However, replacing wait_for_function with evaluate and using the above, I find that only the first query is implemented, and click is not activated. Because, otherwise, I would get red-text under the input highlighting the email is incorrect. Any idea why this might be happening?

P.S.
It works absolutely fine on the console of the web-browser.

--
I eventually got it working by including multiple wait_for_timeouts, which worked better than wait_for_function, however, I would be interested to know why the latter keeps the crawler in a loop inside the reactor with unfinished tasks.

But I get the following page indicating the CSRF is invalid and so the cookies were not set-up properly. What do you advise? I have attempted this with scrapy-splash it redirects to the original page (not supposed to), it's a matter of how to properly assign cookies so your advice will be very helpful!.
tealium3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants