-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Worker-driven batch processing #399
base: master
Are you sure you want to change the base?
Changes from all commits
59effae
43af39f
dc2b5c5
129e5a0
ee4f3eb
1e66a77
e9a959e
50c79c7
f9bfcdd
2f38b2b
7ba3aa5
4f424e5
ffaf064
b252d74
130d600
7ba11b6
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,10 @@ | ||
{} | ||
{"!@#$": true} | ||
{"df": 1} | ||
{"df": "abc"} | ||
{"df": "abc1"} | ||
{"df": "absdfg"} | ||
{"df": "Some text here."} | ||
{"df": "abc", "fd": true} | ||
{"d": "abc"} | ||
{"df": "abc"} |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,15 @@ | ||
# $1 has to be used as a workaround to pass empty string as a value for | ||
# -E, since "... -E ''" will treat the single quotes literally due to double | ||
# quotes around them. | ||
cmd="./stage.py -m s -E $1" | ||
cmd_batch2="./stage.py -b 2 -m s -E $1" | ||
cmd_batch100="./stage.py -b 100 -m s -E $1" | ||
|
||
# Various tests that should produce the same results. | ||
|
||
# Stage chains. | ||
cat inp | $cmd "" | $cmd "" > outp1 | ||
cat inp | $cmd_batch2 "" | $cmd_batch2 "" > outp2 | ||
cat inp | $cmd_batch100 "" | $cmd_batch100 "" > outp100 | ||
cat inp | $cmd "" | $cmd_batch2 "" > outp12 | ||
cat inp | $cmd_batch2 "" | $cmd "" > outp21 |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,102 @@ | ||
#!/bin/env python | ||
""" | ||
DKB Dataflow stage XXX (StageName). | ||
|
||
Stage short description | ||
|
||
Authors: | ||
Author Name ([email protected]) | ||
""" | ||
|
||
import os | ||
import sys | ||
import traceback | ||
|
||
try: | ||
base_dir = os.path.dirname(__file__) | ||
dkb_dir = os.path.join(base_dir, os.pardir) | ||
sys.path.append(dkb_dir) | ||
import pyDKB | ||
from pyDKB.dataflow.stage import ProcessorStage | ||
from pyDKB.dataflow.communication.messages import JSONMessage | ||
from pyDKB.dataflow.exceptions import DataflowException | ||
from pyDKB.dataflow import messageType | ||
from pyDKB.common.types import logLevel | ||
except Exception, err: | ||
sys.stderr.write("(ERROR) Failed to import pyDKB library: %s\n" % err) | ||
sys.exit(1) | ||
|
||
|
||
def process(stage, messages): | ||
""" Single or batch message processing. | ||
|
||
This form of batch processing is pretty pointless in terms of efficiency: | ||
using it will replace, for example, ProcessorStage cycling over 100 | ||
messages with it cycling over 10 batches, and this stage cycling | ||
over 10 messages in each batch. But for testing and illustrative purposes | ||
it will do. | ||
""" | ||
if not isinstance(messages, list): | ||
messages = [messages] | ||
for message in messages: | ||
data = message.content() | ||
if not isinstance(data, dict): | ||
stage.log("Cannot process non-dict data: %s." % data, | ||
logLevel.WARN) | ||
continue | ||
# Processing machinery | ||
if 'df' in data and isinstance(data['df'], (str, unicode)): | ||
data['df'] = 'processed ' + data['df'] | ||
else: | ||
stage.log("Failed to process data %s, required field 'df' not" | ||
" found or contains non-str value." % data, | ||
logLevel.WARN) | ||
out_message = JSONMessage(data) | ||
stage.output(out_message) | ||
return True | ||
|
||
|
||
def main(args): | ||
""" Program body. """ | ||
stage = ProcessorStage() | ||
stage.set_input_message_type(messageType.JSON) | ||
stage.set_output_message_type(messageType.JSON) | ||
stage.set_default_arguments(bnc='') | ||
|
||
# Accept batch size from command line. | ||
# This is cheating because batch size is supposed to be set by | ||
# stage developer, not received from command line (so, | ||
# from supervisor). However, this is done in this illustrative | ||
# stage to simplify a process of comparing the results of | ||
# normal mode and batch mode with different batch sizes. | ||
stage.add_argument('-b', action='store', type=int, help='Batch size.', | ||
default=1, dest='bsize') | ||
|
||
stage.process = process | ||
|
||
exit_code = 0 | ||
exc_info = None | ||
try: | ||
stage.configure(args) | ||
stage.set_batch_size(stage.ARGS.bsize) | ||
stage.run() | ||
except (DataflowException, RuntimeError), err: | ||
if str(err): | ||
sys.stderr.write("(ERROR) %s\n" % err) | ||
exit_code = 2 | ||
except Exception: | ||
exc_info = sys.exc_info() | ||
exit_code = 3 | ||
finally: | ||
stage.stop() | ||
|
||
if exc_info: | ||
trace = traceback.format_exception(*exc_info) | ||
for line in trace: | ||
sys.stderr.write("(ERROR) %s" % line) | ||
|
||
exit(exit_code) | ||
|
||
|
||
if __name__ == '__main__': | ||
main(sys.argv[1:]) |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -16,6 +16,23 @@ class InputStream(Stream): | |
|
||
__iterator = None | ||
|
||
# Names of markers that the stream knows how to process. | ||
# Values are taken from config, markers with empty values are ignored. | ||
marker_names = ['eob'] | ||
|
||
def configure(self, config={}): | ||
""" Configure instance. """ | ||
super(InputStream, self).configure(config) | ||
self.markers = {} | ||
# If batch size is 1, meaning non-batch mode will be used, then | ||
# markers are unnecessary (and even can be a hindrance by forcing | ||
# usage of custom_readline() without need). | ||
if config.get('bsize', 1) > 1: | ||
Evildoor marked this conversation as resolved.
Show resolved
Hide resolved
|
||
for name in self.marker_names: | ||
value = config.get(name) | ||
if value: | ||
self.markers[name] = value | ||
|
||
def __iter__(self): | ||
""" Initialize iteration. """ | ||
self._reset_iterator() | ||
|
@@ -24,14 +41,14 @@ def __iter__(self): | |
def _reset_iterator(self): | ||
""" Reset inner iterator on a new file descriptor. """ | ||
fd = self.get_fd() | ||
if self.EOM == '\n': | ||
if self.EOM == '\n' and not self.markers: | ||
self.__iterator = iter(fd.readline, "") | ||
self.is_readable = self._fd_is_readable | ||
elif self.EOM == '': | ||
elif self.EOM == '' and not self.markers: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This condition should be left as it was:
But to avoid "silent disregard" for user configuration, in case of empty |
||
self.__iterator = iter(fd.read, "") | ||
self.is_readable = self._fd_is_readable | ||
else: | ||
self.__iterator = custom_readline(fd, self.EOM) | ||
self.__iterator = custom_readline(fd, self.EOM, self.markers) | ||
self.is_readable = self._gi_is_readable | ||
|
||
def reset(self, fd, close=True, force=False): | ||
|
@@ -139,9 +156,10 @@ def parse_message(self, message): | |
return False | ||
|
||
def get_message(self): | ||
""" Get next message from the input stream. | ||
""" Get next message or marker from the input stream. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This sounds wrong: Since all decent return values (like (And, I guess, all these tricks with Of course, the method can be renamed, but for now I am not sure if it should be. |
||
|
||
:returns: parsed next message, | ||
next marker, | ||
False -- parsing failed, | ||
None -- no messages left | ||
:rtype: pyDKB.dataflow.communication.messages.AbstractMessage, | ||
|
@@ -154,16 +172,22 @@ def get_message(self): | |
return result | ||
|
||
def next(self): | ||
""" Get next message from the input stream. | ||
""" Get next message or marker from the input stream. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This changes Strings are very.. questionable. Too standard -- so it can be anything. If iteration over But I would rather have iteration over The idea is as follows:
If all this stuff is configured or not can be encapsulated into choice of a proper There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I agree with this. If we will decide to keep this logic, then changing strings to constants sounds reasonable. Everything else here is ideological and will be discussed in email/meeting/etc. |
||
|
||
:returns: parsed next message, | ||
next marker, | ||
False -- parsing failed or unexpected end of stream occurred | ||
:rtype: pyDKB.dataflow.communication.messages.AbstractMessage, bool | ||
:rtype: pyDKB.dataflow.communication.messages.AbstractMessage, | ||
str, bool | ||
""" | ||
if not self.__iterator: | ||
self._reset_iterator() | ||
msg = self.__iterator.next() | ||
if not msg.endswith(self.EOM): | ||
# Check whether an expected marker was received. | ||
for key in self.markers: | ||
if msg == key: | ||
return key | ||
log_msg = msg[:10] + '<...>' * (len(msg) > 20) | ||
log_msg += msg[-min(len(msg) - 10, 10):] | ||
log_msg = log_msg.replace('\n', r'\n') | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
General comment to the whole set of changes in this function.
Expecting markers only at specific place in the flow is not very safe.
If the input stream was corrupted and part of it is missed, it may look like this:
Here's some illustration of how it looks like:
What's going on here:
<EOB>
,custom_readline()
will yield a piece of broken data with<EOB>
. No one will know it was<EOB>
-- but it's fine; data are unreadable anyway, stream is closed -- so whatever was read before, it will be processed;custom_readline()
won't return anything until it sees<EOM>
. But when it does -- the broken message,<EOB>
and the next proper message will stick together and will be taken as a single message; it won't be decoded (for it is not a properly formatted message) and will be thrown away. So the proper message will be ignored. And what's even worse -- until there's<EOM>
in the input,custom_readline()
will not return anything at all -- and the whole batch of already read messages will sit in memory and wait, not being processed till the next<EOB>
or the end of input.What I'd expect in this situation:
<EOB>
is detected;<EOB>
);<EOB>
is processed in a regular way.Returning "name of marker" instead of the marker itself may be ambiguous.
Whenever the function is called from, the context is aware of marker values and can compare returned value to them. On the other hand, sequence of
eob<EOF>
(with properEOF
== "STDIN is closed"), that originally would be processed as "unexpected end of stream" by theInputStream
, now will be taken as<EOB>
at the end of stream:(BTW, the extra empty line is an unexpected result)
Maybe it's not a big problem --
<EOB>
at the end of stream won't make worker to do something unexpected -- but it is just an example of possibility of incorrect interpretation of a plain string'eob'
as a control marker.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is true, but the problem here is missed
<EOM>
, not<EOB>
. Even if we work inmaster
that knows nothing about other markers and batch processing, a message being corrupted in a way that erases<EOM>
will make it "stick" to the next message, and that next message would be ignored, even though it is correct.I presume that there is a typo and the second marker is supposed to be
<EOM>
as well, because I don't see any way in which missed<EOB>
can causecustom_readline
to hang until next<EOB>
is encountered. In such case, it's the continuation of the previous text and my answer above still applies. Otherwise - please, clarify this moment.This solves a problem of missing a marker due to it being caught between corrupted messages, but creates another one - what if a sequence of characters inside a (normal or corrupted) message matches some marker's value by accident?
All of this is a part of a bigger question "how to deal with errors in communication between worker and supervisor" - or, to be more precise, communication of markers (and losing a marker can be much more dangerous than losing a message). This problem is not unique to batch mode. For example, if supervisor sends a message but
<EOM>
gets lost/corrupted then we have a deadlock: supervisor expects<EOP>
and worker expects<EOM>
[1]. Which means that supervisor should have means of breaking such deadlocks, by checking worker's state or something else.All things considered, I'm not sure that looking for markers inside messages is a solution and that this question fits into this PR. By the way, code in this PR should actually deal with missed
<EOB>
correctly (given that supervisor continues to operate correctly after sending mangled<EOB>
):Normal work:
message
<BNC>
<EOB>
messages
+EOP
Error:
message
<BNC>
something which may or may not include <EOB> or its pieces somewhere
<BNC>
<EOB>
messages
+EOP
[1] This example can be simulated by starting a stage in terminal mode with standard
<EOM>
=\n
and forgetting to press ENTER after a message. One can look at the screen and realize their mistake, but hey, that's some advanced supervisor technique...Fair point, will do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Evildoor
The problem is not in "missed
<EOM>
" -- some issues in the data transfer may happen, and it is not what we can control with the protocol specification and encoder/decoder implementation. My point is that in the current implementation of parser (how the presence of<EOB>
is checked) makes this kind of issue more harmful than it could be.Right, the correct message following the broken one will be thrown away. But in this case:
<data_broken+data_correct><EOM>
: there's no way to separate broken data from correct (while with<EOB>
between them it is possible);<EOM>
with only the broken one in memory, not the whole batch; in other words -- everything that can be processed, is processed without delay.No, it's
<EOB>
.Not the
custom_readline()
in particular, but the stage itself; the buffer of messages, for now, is a stage attribute, not thecustom_readline()
's.Here you are: the stage operating in the batch mode (even in worker-driven scenario) will not process anything until it:
<EOB>
;If supervisor sent
<EOB>
, keeping the stream open but having no more messages for processing for the next hour, the worker will sit and wait for the whole hour instead of processing the messages it already has.Just in case, reminder: we're talking about situation with the broken message (without
<EOM>
) prior to the<EOB>
, not the general situation.Ahha, why do you think
<EOM>
is a NULL ASCII character, not plain "eom"? ;)It's a well known issue: everything that has special meaning must be properly treated in the data (just as registered keyword "table" in SQL language, for example).
The simpliest way to get rid of this is to encode data with base64 and make sure none of these 64 ASCII symbols are used as special ones.
Right, this should be improved. Feel free to write it down somewhere (Google docs with the protocol description, Trello cards, your own notepad), add your ideas on how it can be solved to make sure you won't forget when decide to take care of it.
But here and now your main objective is not to detect and solve existing issues; it is to introduce new functionality without inviting new ones where it can be helped. Existing issue is not an excuse to add new ones saying "why bother now, if it's not working anyway; one day someone will take care of this, so let him/her take care of all at once".
I do insist.
Wrong: until there's
<EOM>
, it's not a message either. So I will wait for<EOM>
(for how long? a minute? an hour?) until decide that...Wrong: if worker saw
<EOM>
finally came, it's very likely that supervisor has a new pack of messages, in which case worker will have some messages and maybe even fill the batch, or -- as you suggested -- will see another<EOB>
.I don't think we should continue the discussion for it already took too much words, so here's my ultimate resolution on this question: I won't accept the PR until the code can recognize
<EOB>
in data stream independently of the marker's position.Thank you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Understood, will do.