[WIP / Discuss] Scrapy Streaming docs #7

aron-bordin · 2016-06-01T04:08:46Z

moving from scrapy/scrapy#1991

PR Overview

This is an initial work in the scrapy streaming docs.
You can read it here: http://gsoc2016.readthedocs.io

I'd like to open the discussion about the communication protocol. It's pretty similar with the original protocol in my proposal, with some modifications. My idea is to open the process of the API design, so I can get some feedbacks and modify this API before implementing it.

In my proposal, I've suggested to start the implementation of this API on June 13, so it'd be helpful if we get a definitive API before this date.

Also, suggestion about new messages and new behavior are welcome 😄

Implementation

Adding some comments about the implementation:

Originally, I've suggested to implement the communication channel between scrapy and the external spider using the ProcessProtocol from twisted, and as pointed in the docs, each message ends with a line break \n.
I started an initial POC to get an idea on how this should work.

However, this implementation could get some problems with buffering, because the messages sent by transport.write and received by outReceived can be buffered by the system.

Checking the @Preetwinder POC, he uses https://github.com/Preetwinder/ScrapyStreaming/blob/master/linereceiverprocess.py#L53 to wrap the process and avoid this buffer issues.

Now I'm analyzing the best way to approach this possible problems with stdin/stdout buffering.

As long as all messages must end with a line break, both implementations (streaming core and external spiders) could "buffer" the received data and process it after receiving the line break (end of the message).
Also, a different implementation could make it easier. Using the LineReceiver in the streaming core could help while receiving data. But I'm still not sure about the best way to write in the process stdin, unfortunately stdbuf is not available in all platforms.

As part of the communication protocol. the line break is defined as the end of the message. If the external spider developer uses this information and just analyze the received data after the line break, this should be enough.

Do you have any comments about the implementations and these possible issues ?

eLRuLL · 2016-06-01T04:14:46Z

docs/source/quickstart.rst

+information from a domain (or a group of domains). It contains the all the logic and necessary information to
+extract the data from a website.
+
+We'll define a simple spider, that works as follows:


please specify what you are creating here (with a title like Example 1: Github Streaming Spider) also not sure if it is a good idea to define a github crawler, a site I would recommend for testing would be https://www.dmoz.org

Hi,
I updated the quickstart example to a dmoz spider.

codecov-io · 2016-06-09T01:08:58Z

Current coverage is 88.21% (diff: 100%)

Merging #7 into master will not change coverage

@@             master         #7   diff @@
==========================================
  Files            11         11          
  Lines           246        246          
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
  Hits            217        217          
  Misses           29         29          
  Partials          0          0

Powered by Codecov. Last update dd41de4...49e9b2e

eLRuLL · 2016-07-11T22:03:17Z

docs/source/quickstart.rst

+---------------
+
+If you are not familiar with Scrapy, we name Spider as an object that defines how scrapy should scrape
+information from a domain (or a group of domains). It contains the all the logic and necessary information to


remove the the before all the logic

eLRuLL reviewed Jun 1, 2016
View reviewed changes

eLRuLL reviewed Jul 11, 2016
View reviewed changes

This was referenced Jul 20, 2016

Node package #10

Open

Java library #9

Open

aron-bordin added 14 commits August 19, 2016 18:51

created docs

17a5317

added quickstart

541a5c2

added commands docs

626ab74

added request with base64

10d4a9b

update quickstart and added exception message

e7b4fcb

renamed form_request -> from_response_request; added log docs

377c3e4

R docs

c204520

added java docs

4bf5180

added node docs

559df95

update r docs with more details

24aa38b

added examples docs

f90018f

updated selector message

429c1af

added EXTERNAL_SPIDERS_PATH settings

11780c4

added post request example

49e9b2e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP / Discuss] Scrapy Streaming docs #7

[WIP / Discuss] Scrapy Streaming docs #7

aron-bordin commented Jun 1, 2016

eLRuLL Jun 1, 2016

aron-bordin Jun 9, 2016

codecov-io commented Jun 9, 2016 •

edited

Loading

eLRuLL Jul 11, 2016

[WIP / Discuss] Scrapy Streaming docs #7

Are you sure you want to change the base?

[WIP / Discuss] Scrapy Streaming docs #7

Conversation

aron-bordin commented Jun 1, 2016

PR Overview

Implementation

eLRuLL Jun 1, 2016

Choose a reason for hiding this comment

aron-bordin Jun 9, 2016

Choose a reason for hiding this comment

codecov-io commented Jun 9, 2016 • edited Loading

Current coverage is 88.21% (diff: 100%)

eLRuLL Jul 11, 2016

Choose a reason for hiding this comment

codecov-io commented Jun 9, 2016 •

edited

Loading