Running different requests with different crawlers? #573
-
I'm trying to solve a situation where I want to make the initial request with a plain crawler (because it's an API or something), but continue with subsequent requests to detail pages with a BS4 crawler (because they're regular HTML pages). A specific example would be requesting an RSS feed, having a default handler with a feedparser instead of BS4 (i.e. just Since the type of the crawler is set kinda globally for the whole program, I don't know how to do this. It would make more sense to be able to specify how the response gets parsed per handler or per request. I can imagine scrapers where I want to start with BS4, but then jump to Playwright for product detail pages or if BS4 fails to deliver. What's the best approach to switch crawler types on the fly like this? |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 3 replies
-
Well, the best approach to this would be to have separate
We are aware of this shortcoming though, and we'd like to enable running multiple interconnected crawlers in the future. |
Beta Was this translation helpful? Give feedback.
-
Worth noting on this thread. Looks like the team have now implemented it. |
Beta Was this translation helpful? Give feedback.
Well, the best approach to this would be to have separate
RequestQueue
instances for the separate crawlers and to add requests directly to the queue of the right crawler in your request handlers. There are however some challenges:Request.open()
will always resolve to the same unnamed queue. This may or may not be a problem if you're running on Apify. Locally, you'll probably need to purge the named queues manually before each run.await asyncio.gather(crawler_1.run(), crawler_2.run())
also won't work right off the bat - I assume that only one of your crawlers will have some start urls a…