HTTP server which provides API for scheduling Scrapy spiders and making requests with spiders.
- Allows you to easily add HTTP API to your existing Scrapy project
- All Scrapy project components (e.g. middleware, pipelines, extensions) are supported out of the box.
- You simply run Scrapyrt in Scrapy project directory and it starts HTTP server allowing you to schedule your spiders and get spider output in JSON format.
- Project is not a replacement for Scrapyd or Scrapy Cloud or other infrastructure to run long running crawls
- Not suitable for long running spiders, good for spiders that will fetch one response from some website and return response
To install Scrapyrt:
pip install scrapyrt
Now you can run Scrapyrt from within Scrapy project by just typing:
scrapyrt
in Scrapy project directory.
Scrapyrt will look for scrapy.cfg
file to determine your project settings,
and will raise error if it won't find one. Note that you need to have all
your project requirements installed.
Scrapyrt supports endpoint /crawl.json
that can be requested
with two methods: GET and POST.
To run sample toscrape-css spider from Quotesbot parsing page about famous quotes:
curl "http://localhost:9080/crawl.json?spider_name=toscrape-css&url=http://quotes.toscrape.com/"
To run same spider only allowing one request and parsing url
with callback parse_foo
:
curl "http://localhost:9080/crawl.json?spider_name=toscrape-css&url=http://quotes.toscrape.com/&callback=parse_foo&max_requests=1"
Documentation is available on readthedocs.
Open source support is provided here in Github. Please create a question issue (ie. issue with "question" label).
Commercial support is also available by Zyte.
ScrapyRT is offered under BSD 3-Clause license.
Development taking place on Github.