Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement a proxy API endpoint to read/write objects from primary partitions only #120

Open
dmitrizagidulin opened this issue Dec 17, 2015 · 0 comments

Comments

@dmitrizagidulin
Copy link
Contributor

The goal is to enable client-side consistent hashing (even if it just happens on the Explorer API), so that a user can GET or PUT an object directly to a node with a primary partition.
Combined (optionally) with using n_val=1, this enables more efficient reading and writing of large(r) objects to a Riak cluster without the additional inter-node network traffic (though at the cost of less redundancy, obviously).

The actual API endpoint should probably be:

a. either an extension of our /riak proxy resource (something like /riak/primary/clusters/$cluster/types/$type/buckets/$bucket/keys/$key), or

b. a replacement for the keys/$key resource under /explore, something like:
/explore/clusters/$cluster/bucket_types/$type/buckets/$bucket/primary/key

Whichever one is easier to implement.

History

This feature request is inspired by a Damien Krotkine Ricon 2015 talk, Events storage and analysis with Riak at Booking.com (slides / video), as well as by frequent customer questions about doing the consistent hashing on the client side. In the talk, Damien set out the problem (inter-node network being saturated at the massive throughput level of Bookings.com), and detailed the team's solution -- see page 108 of Damien's Ricon 2015 slide deck.

Specifically, they hacked/extended the Riak HTTP API to expose the general preflist/ring status object (page 111), which enabled them to do client-side consistent-hashing. Combining the hashing capability with n_val=1, they were enable to eliminate inter-node traffic and push the performance (for that one particular use case) of their Riak cluster way beyond its initial capabilities.

Note that Damien's approach is subtly different from the GET object preflist Riak API call, that was introduced in Riak 2.1.2 (in riak_kv PR #1083), as a result of a mailing list answer by Charlie Voiselle. The 2.1 get preflist call is made for a specific riak object -- the API calculates the consistent hash for that object and just returns the list of nodes. However, to use it, it requires an additional request for each object read and write: first request to fetch the preflist for that object, and second request to actually write it to a specific node. Whereas Damien's approach allows a client to get the general cluster preflist once (or, periodically poll for it), and then perform client-side consistent hashing itself, enabling it to know which node to read/write from directly, without additional calls to get preflists for each individual object.

Relevance to Explorer

Implementing this capability will enable Riak Explorer to store large non-critical objects more efficiently in Riak itself. (The non-critical part here is important, because to reap the full benefits and eliminate all inter-node traffic, one would use the preflist with n_val=1, which in turn reduces the redundancy / data resilience for those objects).

This ability to store larger-than-usually-recommended objects directly in Riak will come in useful in several roadmap items:

Recommended Approach

We have two options, as far as implementing this capability in Riak Explorer.

Option 1 (minimum effort, Riak 2.1.2+ only) - Use existing "Get object preflist" Riak API

To start with, we can build on the existing Riak 2.1.2+ GET object preflist HTTP API call.
So, for each call to the "primary proxy" API endpoint, we would do:

  1. Get the preflist for that object (which returns the list of nodes that own primary partitions):

    GET /riak/clusters/localdev/types/default/buckets/users/keys/user123/preflist
    {
    "preflist": [
      {
        "node": "[email protected]",
        "partition": 29,
        "primary": true
      },
      {
        "node": "[email protected]",
        "partition": 30,
        "primary": true
      },
      {
        "node": "[email protected]",
        "partition": 31,
        "primary": true
      }
    ]
    }
    
  2. Pick one of those primary nodes, and make a read/write request to that node directly.

The benefit of this approach is that it requires the least amount of API-side code to implement.

The drawback is that it only works on nodes running Riak 2.1.2 and later, but not on Riak 2.0 LTS series. Plus, it requires an additional call (to fetch the object's preflist) for each read and write operation.

Option 2 (higher performance, works on 2.0 LTS) - Implement a "Get Cluster Preflist" API endpoint

If we end up using this "primary proxy" API endpoint frequently, it would make sense to optimize it using the same approach as Damien's team. (In addition, taking this approach would let Riak users take advantage of this functionality in their own app code).

Specifically:

  1. Implement a Cluster Preflist API endpoint (possibly using riak_core_ring_manager:get_my_ring() ?). For example:

    curl http://$host:9000/explore/clusters/$cluster/preflist
    {
    “0"                                              :"riak1@...",
    “5708990770823839524233143877797980545530986496” :"riak2@...",
    “11417981541647679048466287755595961091061972992”:"riak3@...",
    “17126972312471518572699431633393941636592959488":“riak4@...”,
    ...etc
    }
    
  2. (Optionally) Poll and cache the cluster preflist above in the Explorer API app, to account for changes in node status.

  3. For each GET/PUT request to the "primary proxy" endpoint, Explorer would perform the same consistent hash used by Riak Core on the specified object, then consult the cached preflist, and be able to determine which node to make the request to. (See also the Erlang snippet from the mailing list).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant