-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP / Discuss] Scrapy Streaming docs #7
base: master
Are you sure you want to change the base?
Conversation
information from a domain (or a group of domains). It contains the all the logic and necessary information to | ||
extract the data from a website. | ||
|
||
We'll define a simple spider, that works as follows: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please specify what you are creating here (with a title like Example 1: Github Streaming Spider) also not sure if it is a good idea to define a github crawler, a site I would recommend for testing would be https://www.dmoz.org
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi,
I updated the quickstart example to a dmoz spider.
Current coverage is 88.21% (diff: 100%)@@ master #7 diff @@
==========================================
Files 11 11
Lines 246 246
Methods 0 0
Messages 0 0
Branches 0 0
==========================================
Hits 217 217
Misses 29 29
Partials 0 0
|
--------------- | ||
|
||
If you are not familiar with Scrapy, we name Spider as an object that defines how scrapy should scrape | ||
information from a domain (or a group of domains). It contains the all the logic and necessary information to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove the the
before all the logic
moving from scrapy/scrapy#1991
PR Overview
This is an initial work in the scrapy streaming docs.
You can read it here: http://gsoc2016.readthedocs.io
I'd like to open the discussion about the communication protocol. It's pretty similar with the original protocol in my proposal, with some modifications. My idea is to open the process of the API design, so I can get some feedbacks and modify this API before implementing it.
In my proposal, I've suggested to start the implementation of this API on June 13, so it'd be helpful if we get a definitive API before this date.
Also, suggestion about new messages and new behavior are welcome 😄
Implementation
Adding some comments about the implementation:
Originally, I've suggested to implement the communication channel between scrapy and the external spider using the ProcessProtocol from twisted, and as pointed in the docs, each message ends with a line break
\n
.I started an initial POC to get an idea on how this should work.
However, this implementation could get some problems with buffering, because the messages sent by
transport.write
and received byoutReceived
can be buffered by the system.Checking the @Preetwinder POC, he uses https://github.com/Preetwinder/ScrapyStreaming/blob/master/linereceiverprocess.py#L53 to wrap the process and avoid this buffer issues.
Now I'm analyzing the best way to approach this possible problems with stdin/stdout buffering.
As long as all messages must end with a line break, both implementations (streaming core and external spiders) could "buffer" the received data and process it after receiving the line break (end of the message).
Also, a different implementation could make it easier. Using the LineReceiver in the streaming core could help while receiving data. But I'm still not sure about the best way to write in the process stdin, unfortunately
stdbuf
is not available in all platforms.As part of the communication protocol. the line break is defined as the end of the message. If the external spider developer uses this information and just analyze the received data after the line break, this should be enough.
Do you have any comments about the implementations and these possible issues ?