-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal change to support: stream processing, server-supported tree puts, PATCH, and replacement of objects in resources #80
Comments
Looks quite like what we'd originally thought through a while back. I like how succinct it makes the change docs. I think this would be a good chance to introduce the server-side computation of change feeds, i.e., repeated writes would only result in change feeds reflecting the diff. Regarding increased storage requirements, I suppose the reverse side could be omitted via config. I say go for it. |
Thanks, @sanoel! @maverick-zhn @abalmos @awlayton thoughts? |
You make a good point on the "config": I considered that the option to receive reverse-changes could be configurable when setting the watch itself over a websocket. I also hadn't considered your idea that this would also let us implement @abalmos's requested feature that a "do-nothing" write would not register a change (at least I think that was @abalmos's requested feature). I think I like that idea quite a bit: you could just write an entire document and let the server diff it to figure out the "proper" change feed. That scenario has come up several times, and each time it requires you to diff on the client. In the case where you didn't have a reverse-change feed, you needed the entire written thing to be in the change feed if you wanted to figure out what was overwritten by just looking back one rev. By adding reverse-changes, now we can accomplish the best of all worlds: you can figure out what was ovewritten, AND the "change" feed could be server-certified as truly only including things that change. The only thing you couldn't do is trigger updates to go out by repeatedly writing the same thing over and over, which I don't think is a legit use case. |
@aultac, The proposal seems complete.
Disadvantages:
Considerations: Let us get started then. It is essential to reduce particular problems to architecture composability based on idempotent composition operators such as XOR. These idempotent composition operators include the associative and commutative properties, allowing a series of distributed computation models to be highly parallelizable. It will be fascinating to come up with an API that facilitates not only the development but also improves performance—it should give you low latency, high throughput, a higher level of concurrency, and scalability. Additionally, it is crucial to define an elastic topology that allows the System as a whole to scale according to the requests that are hitting the event sources. Many of these concepts created Lambda architecture and have existed for many years. For the computation of your model, we can create a topology (Direct Acyclic Graph, DAG) that emulates Storm Architecture to accomplish a higher level of scalability, high throughput, and low latency for an unbounded stream of requests. Our design should be able to scale down or up according to the number of requests present at a particular time Tk. Some other core concepts that we need to consider are that the System should be able to overcome failures and ensure the completion of the computation. References: (I apologize in advance for the uncivilized citation format)
|
Thanks, @maverick-zhn, excellent points. Seeing no major opposition, @tomohiroarakawa I think we're good for you to start on this if you're ready. |
@aultac Sounds good! |
I think there are major objections, but you know them already. Maybe one day I'll get a chance to read all of this. |
Over the past several years, a few items around the change feed structure have cropped up that we should think about implementing. I'm making this issue to bring all those conversations back to together into one thread so we can decide what we want to do. This is 10x longer than I thought it would be, but I think that's because the same solution solves a lot of otherwise unrelated goals.
First, a change document for a resource with linked children used to look something like this before we batched changes:
It was simply taking each of the changes up the graph and constructing them back into the original tree. In that case,
resources/child1
had the original write offoo: "bar"
, and that gave it rev3
, then the parent resource got thatrev: 3
when it updated to its own rev47
.When we batched changes together, so that one change may have overwritten another, we contemplated sticking with the tree structure and just merging all the batched changes together into one "net" change. We ultimately decided against keeping the big merged tree in favor of just putting each individual resource change into an array, which I like and agree with. In that case, the above change document now looks like this:
However, we did hit upon a rather elegant feature of the idempotent merges, namely that they can be composed into net changes by simply performing the same merge operation on the change document bodies that you would to the resources themselves. This lets you selectively compress a change stream, say for replication to an outside service, and it just feels cleaner in general. Simple composability is better than non-composability.
The trouble was, any non-idempotent-merge sort of change (by which I mean the "delete" type of change doc) would break that feature. The "delete" type change always felt a bit hacky, but we just threw it in there at the time so delete worked. I would love to find a way to clean that up by eliminating the need to have 2 different types of change documents and just reduce everything to the elegant ordered stream of idempotent merges.
So that leads me to
Goal 1: eliminate the "delete" change doc so everything is a composable ordered stream of idempotent merge operations.
If you stick with me through the end of this manifesto, there is a single proposal that will address all the goals simultaneously. I'm just working my way up to it.
Goal 2: Fix broken stream processing from change feeds
If you want to stream process documents based solely on the change feed, you cannot do it today and you should be able to. The problem is that the change only tells you what is now in the resource, not what used to be there before the change. For example, if your service is maintaining a running sum of data points, and you get a new change document showing that a
5
has been added to the list at keyfoo
, you obviously need to add 5 to your running sum.However, you do not know if there used to be a
3
at thatfoo
key before the write. If there was, you also need to subtract3
from your sum. The only way to learn about the previous value of3
currently is to eitherfoo
key in the body (that was the last value of it).This is inefficient and seems like we should be able to do better. The simplest answer is that the information about what was overwritten should be available either in the change document itself, or as a separate feed of "stuff that was overwritten."
It turns out that if we can represent all changes as a net idempotent merge, then we can easily compute the net change required to revert the document from any _rev to a previous one. This net change would definitionally contain the data that was overwritten. If a stream processing service had access to the regular change, and this
reverse-change
, it would have a very easily accessible way of knowing what was overwritten, or if this was a brand-new data point, or what got deleted, etc.For example, if this is the resource before a change:
and someone make this change to rev 13 to get to rev 14:
then the
reverse-change
to get from _rev 14 to _rev 13 would beThat looks pretty clean to me, and from a coding perspective, you already have code looking for the
foo
key in the forward change in order to add it to your running sum in my example, so it would be a piece of cake to add an independent piece of code to also look forfoo
in thereverse-change
and subtract whatever it finds there.In our current change feed, if the forward change is a delete, the reverse-change is just a merge with the deleted data. If the forward change is a merge, the reverse change could be multiple deletes and possibly a merge. This creates a messy-ness that there would not be a clean 1-to-1 mapping between forward and reverse changes. This will be addressed in Goal 3.
There are a few downside to
reverse-changes
:reverse-change
for its deletion.I feel like some clever algorithmic magic can reasonably alleviate most of those burdens if they really become an optimization issue, but they do exist. Sounds like a good paper or two...
In my opinion, the fact that a reverse-change could both solve the stream processing problem and enable you to easily walk back-and-forth through a change feed like a doubly-linked-list is reason enough to consider adoption. It also has the benefit of being backwards compatible: adding the reverse-change doesn't break things that look for the forward change, it just enables you to fix the services for all those edge cases that the service authors (i.e. me) forgot about or didn't want to deal with at the time.
Goal 3: 1-to-1 mapping between forward and reverse-changes and PATCH
json-patch
exists and people use it. I think we all think it would be nice to enable it in OADA via the usualPATCH
verb. However, the way json-patch represents changes as potentially a series of actions means that a single PATCH today could result in multiple change documents. It would be substantially better to get a single PATCH request mapping to a single change document.One approach would be to get rid of our idempotent merge change documents and replace them with json-patch. Unfortunately, for the person writing code to do stream processing from the change feed, this would be a nightmare to handle. Right now you can just look for keys you care about in a single json document: with JSON patch, you'd have a near-infinite number of possible ways that a change could be represented and it would be extremely difficult to figure out all the ways that the key you care about could have changed. For that reason, I don't think this is a viable option at all.
Another approach would be to just solve the "2 types of changes" dilemna, since we already proposed above that all series of idempotent merge operations are composable into net changes. Get rid of the "delete" type and make it part of a "merge", and we're there. Bonus, we could probably utilize a json-patch library to just figure out a single "net" idempotent merge document to represent whatever changes might have happened and then just make PATCH forward to PUT internally just like POST does.
To do this, we must add the ability to specify in a "merge" change that one or more keys should be deleted. We've rejected in the past the idea of using
null
at the key like arangodb because it complicates your ability to put an actualnull
at that key. However, putting an object there with a reserved key inside that means "delete me" would work, even if it doesn't quite look as clean asnull
:That clearly means "delete keys a and b when you "merge" in this change. With that model, you could finally send a single request to delete multiple keys.
It also has the advantage that you can also specify things to write in the same document:
That quite clearly means "delete keys
a
andb
and upsert the value"hello"
intoc
.Now, we have a great 1-to-one mapping between forward and reverse changes. If the resource before that change looked like this:
then the reverse change from the forward change above would clearly be just:
Any code listening to that change feed and receiving those 2 change docs could easily do whatever stream processing they want because all the info they need is now there, and it is really easy to write code to just hone in on changes to a specific part of a resource.
The downside is, this would be the first "reserved" key we would be introducing into the JSON bodies that isn't at the top level (i.e.
_id
,_rev
,_type
, and_meta
are already reserved at the top level). Which brings me to Goal 4.Goal 4: enable single-request create-and-link for resources via tree-put
I think we all agree we'd like to enable some form of tree-put on the server side. The most basic thing to support is at least the ability to put to a link and have it create a resource and link it at the same time. The PUT body would look something like this:
If there was no resource linked at key
alink
in the parent document, the server would create a new resource with the body{ foo: "bar" }
and create a versioned link to it under the parent at the keyalink
. Omit _rev, and it's a non-versioned link. Make 2 simultaneous requests, and the first one create the resource and the second one just writes directly to the created resource as the second rev.If this is what we'd like, then we are already extending the reserved
_
keys from the top-level of a PUT body down to any level, and therefore adding one more reserved key of_delete
would not actually be anything different than what we plan to do anyway.Speaking of reducing the number of requests needed to do things in OADA, we also would all prefer to have some sort of "replace" or "set" functionality so we can make a particular key exactly equal to some value. Which brings up Goal 5.
Goal 5: Enable "replace" for keys in a resource
If you have this resource:
And you would like to change the
a
key to7
, delete theb
key, and set thec
key to be exactly{ foo: "bar" }
. To accomplish this today, you would have to do:In order to explain my next part of this proposal, consider how this would look up to this point with our newly proposed change document structure using
_delete
. This would result in 3 idempotent merge change docs:Mentally just merge all those change documents together in your head. They would look like this:
Now, pretend you didn't know what the
_delete
under thec
key means. You can easily reason there are only 2 possibilities: it either means putfoo: "bar"
into thec
key and then deletec
, or it means first delete what's inc
and putfoo: "bar"
there. Since the first one would be a meaningless no-op (why bother writing to something that is destined for deletion), only the second one makes sense.As a bonus, to completely replace an existing resource with new data would now be really easy:
That means "first delete the stuff at this resource, then put
foo: "bar"
there instead.Therefore, adding only the
_delete
concept, we are now able to represent any arbitrary net change by a simple, repeated merging of change documents together.For completeness, there is an edge case where a sequence of change documents end up putting in keys and then deleting them and then putting them back in, but that's easily handled by just making sure you do it in order for overlapping keys, but you can do it in parallel for any non-overlapping keys. i.e. if the first change writes to
c
, and the second change deletesc
, the net document cannot have any of the deleted keys inside the final{ _delete: true }
object. It still applies the same algorithm to compose changes as the actual writes to resources, it's just not quite as easy to visualize those if someone hasn't considered them before.Conclusion
Adding this single concept allows us to meet all the goals listed here. It is completely backward compatible with any systems doing PUTs. Given minimal training, I think just about anybody could understand what's going on here and recognize what the change document means at a glance. It retains the simplicity of watching for keys you care about among a stream of things you might not care about. It makes sure you can figure out what was possibly overwritten by any given write.
The primary ramification of enabling this would be that the change feeds would start containing
_delete
and stop containingtype: "delete"
. Therefore, anything trying to track deletes from a change feed would have to be updated. It also has ramifications in potential slowdown of writes from doing pre-retrieval+diff, and it will double the storage requirements of the change feed.With that said,
The final proposal
1: Add
_delete
as outlined above.2: Include a
reverse
key in the change documentat the same level as
body
:That means: to get from rev
2
to3
, make the "a" key equal"foo"
, anda
was added new in rev 3. To get back from rev3
to2
, deletea
.The text was updated successfully, but these errors were encountered: