Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP / Discuss] Scrapy Streaming docs #7

Open
wants to merge 14 commits into
base: master
Choose a base branch
from
Open

[WIP / Discuss] Scrapy Streaming docs #7

wants to merge 14 commits into from

Conversation

aron-bordin
Copy link
Member

moving from scrapy/scrapy#1991

PR Overview

This is an initial work in the scrapy streaming docs.
You can read it here: http://gsoc2016.readthedocs.io

I'd like to open the discussion about the communication protocol. It's pretty similar with the original protocol in my proposal, with some modifications. My idea is to open the process of the API design, so I can get some feedbacks and modify this API before implementing it.

In my proposal, I've suggested to start the implementation of this API on June 13, so it'd be helpful if we get a definitive API before this date.

Also, suggestion about new messages and new behavior are welcome 😄

Implementation

Adding some comments about the implementation:

Originally, I've suggested to implement the communication channel between scrapy and the external spider using the ProcessProtocol from twisted, and as pointed in the docs, each message ends with a line break \n.
I started an initial POC to get an idea on how this should work.

However, this implementation could get some problems with buffering, because the messages sent by transport.write and received by outReceived can be buffered by the system.

Checking the @Preetwinder POC, he uses https://github.com/Preetwinder/ScrapyStreaming/blob/master/linereceiverprocess.py#L53 to wrap the process and avoid this buffer issues.

Now I'm analyzing the best way to approach this possible problems with stdin/stdout buffering.

As long as all messages must end with a line break, both implementations (streaming core and external spiders) could "buffer" the received data and process it after receiving the line break (end of the message).
Also, a different implementation could make it easier. Using the LineReceiver in the streaming core could help while receiving data. But I'm still not sure about the best way to write in the process stdin, unfortunately stdbuf is not available in all platforms.

As part of the communication protocol. the line break is defined as the end of the message. If the external spider developer uses this information and just analyze the received data after the line break, this should be enough.

Do you have any comments about the implementations and these possible issues ?

information from a domain (or a group of domains). It contains the all the logic and necessary information to
extract the data from a website.

We'll define a simple spider, that works as follows:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please specify what you are creating here (with a title like Example 1: Github Streaming Spider) also not sure if it is a good idea to define a github crawler, a site I would recommend for testing would be https://www.dmoz.org

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi,
I updated the quickstart example to a dmoz spider.

@codecov-io
Copy link

codecov-io commented Jun 9, 2016

Current coverage is 88.21% (diff: 100%)

Merging #7 into master will not change coverage

@@             master         #7   diff @@
==========================================
  Files            11         11          
  Lines           246        246          
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
  Hits            217        217          
  Misses           29         29          
  Partials          0          0          

Powered by Codecov. Last update dd41de4...49e9b2e

---------------

If you are not familiar with Scrapy, we name Spider as an object that defines how scrapy should scrape
information from a domain (or a group of domains). It contains the all the logic and necessary information to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove the the before all the logic

This was referenced Jul 20, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants