You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The goal is to enable client-side consistent hashing (even if it just happens on the Explorer API), so that a user can GET or PUT an object directly to a node with a primary partition.
Combined (optionally) with using n_val=1, this enables more efficient reading and writing of large(r) objects to a Riak cluster without the additional inter-node network traffic (though at the cost of less redundancy, obviously).
The actual API endpoint should probably be:
a. either an extension of our /riak proxy resource (something like /riak/primary/clusters/$cluster/types/$type/buckets/$bucket/keys/$key), or
b. a replacement for the keys/$key resource under /explore, something like:
/explore/clusters/$cluster/bucket_types/$type/buckets/$bucket/primary/key
Specifically, they hacked/extended the Riak HTTP API to expose the general preflist/ring status object (page 111), which enabled them to do client-side consistent-hashing. Combining the hashing capability with n_val=1, they were enable to eliminate inter-node traffic and push the performance (for that one particular use case) of their Riak cluster way beyond its initial capabilities.
Note that Damien's approach is subtly different from the GET object preflist Riak API call, that was introduced in Riak 2.1.2 (in riak_kv PR #1083), as a result of a mailing list answer by Charlie Voiselle. The 2.1 get preflist call is made for a specific riak object -- the API calculates the consistent hash for that object and just returns the list of nodes. However, to use it, it requires an additional request for each object read and write: first request to fetch the preflist for that object, and second request to actually write it to a specific node. Whereas Damien's approach allows a client to get the general cluster preflist once (or, periodically poll for it), and then perform client-side consistent hashing itself, enabling it to know which node to read/write from directly, without additional calls to get preflists for each individual object.
Relevance to Explorer
Implementing this capability will enable Riak Explorer to store large non-critical objects more efficiently in Riak itself. (The non-critical part here is important, because to reap the full benefits and eliminate all inter-node traffic, one would use the preflist with n_val=1, which in turn reduces the redundancy / data resilience for those objects).
This ability to store larger-than-usually-recommended objects directly in Riak will come in useful in several roadmap items:
It will serve as a foundation for storing aggregated Riak stats for various graphing and historical statistics requirements.
Recommended Approach
We have two options, as far as implementing this capability in Riak Explorer.
Option 1 (minimum effort, Riak 2.1.2+ only) - Use existing "Get object preflist" Riak API
To start with, we can build on the existing Riak 2.1.2+ GET object preflist HTTP API call.
So, for each call to the "primary proxy" API endpoint, we would do:
Get the preflist for that object (which returns the list of nodes that own primary partitions):
Pick one of those primary nodes, and make a read/write request to that node directly.
The benefit of this approach is that it requires the least amount of API-side code to implement.
The drawback is that it only works on nodes running Riak 2.1.2 and later, but not on Riak 2.0 LTS series. Plus, it requires an additional call (to fetch the object's preflist) for each read and write operation.
Option 2 (higher performance, works on 2.0 LTS) - Implement a "Get Cluster Preflist" API endpoint
If we end up using this "primary proxy" API endpoint frequently, it would make sense to optimize it using the same approach as Damien's team. (In addition, taking this approach would let Riak users take advantage of this functionality in their own app code).
Specifically:
Implement a Cluster Preflist API endpoint (possibly using riak_core_ring_manager:get_my_ring() ?). For example:
(Optionally) Poll and cache the cluster preflist above in the Explorer API app, to account for changes in node status.
For each GET/PUT request to the "primary proxy" endpoint, Explorer would perform the same consistent hash used by Riak Core on the specified object, then consult the cached preflist, and be able to determine which node to make the request to. (See also the Erlang snippet from the mailing list).
The text was updated successfully, but these errors were encountered:
The goal is to enable client-side consistent hashing (even if it just happens on the Explorer API), so that a user can GET or PUT an object directly to a node with a primary partition.
Combined (optionally) with using
n_val=1
, this enables more efficient reading and writing of large(r) objects to a Riak cluster without the additional inter-node network traffic (though at the cost of less redundancy, obviously).The actual API endpoint should probably be:
a. either an extension of our
/riak
proxy resource (something like /riak/primary/clusters/$cluster/types/$type/buckets/$bucket/keys/$key), orb. a replacement for the
keys/$key
resource under/explore
, something like:/explore/clusters/$cluster/bucket_types/$type/buckets/$bucket/primary/key
Whichever one is easier to implement.
History
This feature request is inspired by a Damien Krotkine Ricon 2015 talk, Events storage and analysis with Riak at Booking.com (slides / video), as well as by frequent customer questions about doing the consistent hashing on the client side. In the talk, Damien set out the problem (inter-node network being saturated at the massive throughput level of Bookings.com), and detailed the team's solution -- see page 108 of Damien's Ricon 2015 slide deck.
Specifically, they hacked/extended the Riak HTTP API to expose the general preflist/ring status object (page 111), which enabled them to do client-side consistent-hashing. Combining the hashing capability with
n_val=1
, they were enable to eliminate inter-node traffic and push the performance (for that one particular use case) of their Riak cluster way beyond its initial capabilities.Note that Damien's approach is subtly different from the GET object preflist Riak API call, that was introduced in Riak 2.1.2 (in riak_kv PR #1083), as a result of a mailing list answer by Charlie Voiselle. The 2.1 get preflist call is made for a specific riak object -- the API calculates the consistent hash for that object and just returns the list of nodes. However, to use it, it requires an additional request for each object read and write: first request to fetch the preflist for that object, and second request to actually write it to a specific node. Whereas Damien's approach allows a client to get the general cluster preflist once (or, periodically poll for it), and then perform client-side consistent hashing itself, enabling it to know which node to read/write from directly, without additional calls to get preflists for each individual object.
Relevance to Explorer
Implementing this capability will enable Riak Explorer to store large non-critical objects more efficiently in Riak itself. (The non-critical part here is important, because to reap the full benefits and eliminate all inter-node traffic, one would use the preflist with
n_val=1
, which in turn reduces the redundancy / data resilience for those objects).This ability to store larger-than-usually-recommended objects directly in Riak will come in useful in several roadmap items:
/basho-patches
) instead of as a standalone installation. (See issue Store Cached Bucket and Key Lists in Riak objects (instead of flat files) #121).Recommended Approach
We have two options, as far as implementing this capability in Riak Explorer.
Option 1 (minimum effort, Riak 2.1.2+ only) - Use existing "Get object preflist" Riak API
To start with, we can build on the existing Riak 2.1.2+ GET object preflist HTTP API call.
So, for each call to the "primary proxy" API endpoint, we would do:
Get the preflist for that object (which returns the list of nodes that own primary partitions):
Pick one of those primary nodes, and make a read/write request to that node directly.
The benefit of this approach is that it requires the least amount of API-side code to implement.
The drawback is that it only works on nodes running Riak 2.1.2 and later, but not on Riak 2.0 LTS series. Plus, it requires an additional call (to fetch the object's preflist) for each read and write operation.
Option 2 (higher performance, works on 2.0 LTS) - Implement a "Get Cluster Preflist" API endpoint
If we end up using this "primary proxy" API endpoint frequently, it would make sense to optimize it using the same approach as Damien's team. (In addition, taking this approach would let Riak users take advantage of this functionality in their own app code).
Specifically:
Implement a Cluster Preflist API endpoint (possibly using
riak_core_ring_manager:get_my_ring()
?). For example:(Optionally) Poll and cache the cluster preflist above in the Explorer API app, to account for changes in node status.
For each GET/PUT request to the "primary proxy" endpoint, Explorer would perform the same consistent hash used by Riak Core on the specified object, then consult the cached preflist, and be able to determine which node to make the request to. (See also the Erlang snippet from the mailing list).
The text was updated successfully, but these errors were encountered: