Add Rack awareness support #967

systream · 2021-02-22T20:43:02Z

In this PR location (you can also call it site or availability zone) has been introduced. When claiming a new ring the list of nodes is ordered taking into consideration the location of the individual nodes, in a manner that to adjacent nodes are preferably from different locations.

martinsumner · 2021-02-23T11:47:00Z

Thank-you for this. I've taken a quick look, and I want to make sure I understand this correctly.

The aim is data diversity across locations to improve data protection (e.g. try to make sure data is in multiple Availability Zones).
This doesn't offer any guarantee that data for any particular preflist is stored in diverse locations, it just tries to bias the feed into the algorithm to make it more likely that data is stored in diverse locations.

Is my understanding correct?

systream · 2021-02-23T19:23:17Z

1, Yes, exactly.
2, You are right, there are circumstances where it cannot guarantee the diverse of locations, it is just like not optimal node count, ring size combinations where tail violations could happen.

It should be used responsibly as the other options.

I tested it across different ring-size/node count/locations combo, and it works quite well, but for example it won't optimize transfers between old and new ring.

The aim was to not modify the original claiming algorithm, until one or more locations have set.

martinsumner · 2021-02-24T11:05:09Z

The challenge is how much value there is, without any guarantee. Perhaps, it would be nice if there was a CLI command that would allow you to report on how location-safe the cluster plan is.

Previously, some consideration had been given to resurrecting claim v3 to implement things like location awareness. The claim v3 algorithm was a radical change, to treat claim as an optimisation problem. With claim v3, if you weren't happy with the optimised plan - you could reject it and roll the dice again, giving some control to the operator.

Location awareness is on the to-do list for riak-core-lite, perhaps that team may have some thoughts on this.

I will add a link to the PR into the slack channel as well, to try and canvas some broader feedback.

systream · 2021-02-24T15:03:46Z

Cool, thanks.

There is something what i'd like to clarify.
If you pick wrong node count or set unbalanced locations across the nodes it could not guarantee, it is true. (I think this is normal, warning message is a great idea).

Example:

---- Cluster Status ----
Ring ready: true

+--------------------+------+-------+-----+-------+--------+
|        node        |status| avail |ring |pending|location|
+--------------------+------+-------+-----+-------+--------+
|     [email protected] |valid |  up   | 31.3|  --   | loc_A  |
| (C) [email protected] |valid |  up   | 37.5|  --   | loc_B  |
|     [email protected] |valid |  up   | 31.3|  --   | loc_C  |
+--------------------+------+-------+-----+-------+--------+

Key: (C) = Claimant; availability marked with '!' is unexpected

First and last partitions are on the same node so it could a be a problem, but anyway it is a problem.

([email protected])8> riak_core_location:print_ring_with_location().
'[email protected]' ("loc_B")       0
'[email protected]' ("loc_C")       91343852333181432387730302044767688728495783936
'[email protected]' ("loc_A")       182687704666362864775460604089535377456991567872
'[email protected]' ("loc_B")       274031556999544297163190906134303066185487351808
'[email protected]' ("loc_C")       365375409332725729550921208179070754913983135744
'[email protected]' ("loc_A")       456719261665907161938651510223838443642478919680
'[email protected]' ("loc_B")       548063113999088594326381812268606132370974703616
'[email protected]' ("loc_C")       639406966332270026714112114313373821099470487552
'[email protected]' ("loc_A")       730750818665451459101842416358141509827966271488
'[email protected]' ("loc_B")       822094670998632891489572718402909198556462055424
'[email protected]' ("loc_C")       913438523331814323877303020447676887284957839360
'[email protected]' ("loc_A")       1004782375664995756265033322492444576013453623296
'[email protected]' ("loc_B")       1096126227998177188652763624537212264741949407232
'[email protected]' ("loc_C")       1187470080331358621040493926581979953470445191168
'[email protected]' ("loc_A")       1278813932664540053428224228626747642198940975104
'[email protected]' ("loc_B")       1370157784997721485815954530671515330927436759040

If you have proper node count/location setup, it will work:

Example:

---- Cluster Status ----
Ring ready: true

+--------------------+------+-------+-----+-------+--------+
|        node        |status| avail |ring |pending|location|
+--------------------+------+-------+-----+-------+--------+
|     [email protected] |valid |  up   | 25.0|  --   | loc_A  |
| (C) [email protected] |valid |  up   | 25.0|  --   | loc_B  |
|     [email protected] |valid |  up   | 25.0|  --   | loc_C  |
|     [email protected] |valid |  up   | 25.0|  --   | Loc_D  |
+--------------------+------+-------+-----+-------+--------+

Key: (C) = Claimant; availability marked with '!' is unexpected

([email protected])11> riak_core_location:print_ring_with_location().
'[email protected]' ("loc_B")       0
'[email protected]' ("loc_C")       91343852333181432387730302044767688728495783936
'[email protected]' ("Loc_D")       182687704666362864775460604089535377456991567872
'[email protected]' ("loc_A")       274031556999544297163190906134303066185487351808
'[email protected]' ("loc_B")       365375409332725729550921208179070754913983135744
'[email protected]' ("loc_C")       456719261665907161938651510223838443642478919680
'[email protected]' ("Loc_D")       548063113999088594326381812268606132370974703616
'[email protected]' ("loc_A")       639406966332270026714112114313373821099470487552
'[email protected]' ("loc_B")       730750818665451459101842416358141509827966271488
'[email protected]' ("loc_C")       822094670998632891489572718402909198556462055424
'[email protected]' ("Loc_D")       913438523331814323877303020447676887284957839360
'[email protected]' ("loc_A")       1004782375664995756265033322492444576013453623296
'[email protected]' ("loc_B")       1096126227998177188652763624537212264741949407232
'[email protected]' ("loc_C")       1187470080331358621040493926581979953470445191168
'[email protected]' ("Loc_D")       1278813932664540053428224228626747642198940975104
'[email protected]' ("loc_A")       1370157784997721485815954530671515330927436759040

---- Cluster Status ----
Ring ready: true

+--------------------+------+-------+-----+-------+--------+
|        node        |status| avail |ring |pending|location|
+--------------------+------+-------+-----+-------+--------+
|     [email protected] |valid |  up   | 25.0|  --   | Loc_A  |
| (C) [email protected] |valid |  up   | 25.0|  --   | Loc_A  |
|     [email protected] |valid |  up   | 25.0|  --   | Loc_B  |
|     [email protected] |valid |  up   | 25.0|  --   | Loc_B  |
+--------------------+------+-------+-----+-------+--------+

Key: (C) = Claimant; availability marked with '!' is unexpected

([email protected])13> riak_core_location:print_ring_with_location().
'[email protected]' ("Loc_B")       0
'[email protected]' ("Loc_A")       91343852333181432387730302044767688728495783936
'[email protected]' ("Loc_B")       182687704666362864775460604089535377456991567872
'[email protected]' ("Loc_A")       274031556999544297163190906134303066185487351808
'[email protected]' ("Loc_B")       365375409332725729550921208179070754913983135744
'[email protected]' ("Loc_A")       456719261665907161938651510223838443642478919680
'[email protected]' ("Loc_B")       548063113999088594326381812268606132370974703616
'[email protected]' ("Loc_A")       639406966332270026714112114313373821099470487552
'[email protected]' ("Loc_B")       730750818665451459101842416358141509827966271488
'[email protected]' ("Loc_A")       822094670998632891489572718402909198556462055424
'[email protected]' ("Loc_B")       913438523331814323877303020447676887284957839360
'[email protected]' ("Loc_A")       1004782375664995756265033322492444576013453623296
'[email protected]' ("Loc_B")       1096126227998177188652763624537212264741949407232
'[email protected]' ("Loc_A")       1187470080331358621040493926581979953470445191168
'[email protected]' ("Loc_B")       1278813932664540053428224228626747642198940975104
'[email protected]' ("Loc_A")       1370157784997721485815954530671515330927436759040

---- Cluster Status ----
Ring ready: true

+--------------------+------+-------+-----+-------+--------+
|        node        |status| avail |ring |pending|location|
+--------------------+------+-------+-----+-------+--------+
|     [email protected] |valid |  up   | 25.0|  --   | Loc_A  |
| (C) [email protected] |valid |  up   | 25.0|  --   | Loc_B  |
|     [email protected] |valid |  up   | 25.0|  --   | Loc_A  |
|     [email protected] |valid |  up   | 25.0|  --   | Loc_B  |
+--------------------+------+-------+-----+-------+--------+

Key: (C) = Claimant; availability marked with '!' is unexpected

([email protected])14> riak_core_location:print_ring_with_location().
'[email protected]' ("Loc_B")       0
'[email protected]' ("Loc_A")       91343852333181432387730302044767688728495783936
'[email protected]' ("Loc_B")       182687704666362864775460604089535377456991567872
'[email protected]' ("Loc_A")       274031556999544297163190906134303066185487351808
'[email protected]' ("Loc_B")       365375409332725729550921208179070754913983135744
'[email protected]' ("Loc_A")       456719261665907161938651510223838443642478919680
'[email protected]' ("Loc_B")       548063113999088594326381812268606132370974703616
'[email protected]' ("Loc_A")       639406966332270026714112114313373821099470487552
'[email protected]' ("Loc_B")       730750818665451459101842416358141509827966271488
'[email protected]' ("Loc_A")       822094670998632891489572718402909198556462055424
'[email protected]' ("Loc_B")       913438523331814323877303020447676887284957839360
'[email protected]' ("Loc_A")       1004782375664995756265033322492444576013453623296
'[email protected]' ("Loc_B")       1096126227998177188652763624537212264741949407232
'[email protected]' ("Loc_A")       1187470080331358621040493926581979953470445191168
'[email protected]' ("Loc_B")       1278813932664540053428224228626747642198940975104
'[email protected]' ("Loc_A")       1370157784997721485815954530671515330927436759040

martinsumner · 2021-03-08T11:36:19Z

I don't know if you're seeing the CI failures, but there are dialyzer issues:

src/riak_core_cluster_cli.erl
  40: Function register_cli/0 has no local return
  44: Function register_all_usage/0 has no local return
  54: Function register_all_commands/0 will never be called
  63: Function status_register/0 will never be called
 150: Function partition_count_register/0 will never be called
 183: Function partitions_register/0 will never be called
 231: Function partition_register/0 will never be called
 290: Function location_register/0 will never be called
===> Warnings written to /home/travis/build/basho/riak_core/_build/default/22.3.dialyzer_warnings
===> Warnings occurred running dialyzer: 8

The next stage is riak_test testing. There is a test group defined for core tests. If you've not used riak_test before there's some setup instructions.

There hasn't been any positive feedback via slack etc wrt pushing for this change. There is interest in a rack-awareness with stricter guarantees - however I think any complete solution is going to be such a radical change as to be unrealistic, compared to this relatively simple enhancement.

There is no firm process for getting something like this approved at the moment. I can see positive aspects of the change, but would want a broader consensus behind it. @martincox do you have an opinion?

systream · 2021-03-08T15:57:37Z

Sorry, dialyzer issues fixed.
I can accept that it might not the best solution.
We have strong security prescription, so i can't tell much about how we use riak, but to be able to scale without increasing the n_val we need a feature something like that. I'd like to help to make rack awareness happen. :)

systream · 2021-03-09T09:31:04Z

Put riak_core in _checkouts and I did run the tests:

That's 100.0% for those keeping score
making summary
bucket_props_roundtrip-bitcask: pass
bucket_props_validation-bitcask: pass
bucket_types-bitcask: pass
http_bucket_types-bitcask: pass
cluster_meta_basic-bitcask: pass
cluster_meta_rmr-bitcask: pass
gh_riak_core_154-bitcask: pass
gh_riak_core_155-bitcask: pass
gh_riak_core_176-bitcask: pass
http_bucket_types-bitcask: pass
verify_build_cluster-bitcask: pass
verify_claimant-bitcask: pass
verify_dynamic_ring-bitcask: pass
verify_leave-bitcask: pass
verify_reset_bucket_props-bitcask: pass
verify_riak_lager-bitcask: pass

Before I write more tests, it would be nice to know whether to continue with this solution or not.

martincox · 2021-03-10T07:15:24Z

I think it sounds like a good starting point, even with the caveat over a lack of strict guarantees. Potentially could be developed further in the future, if anyone can invest more time into it? Makes sense to me to adopt it as is, it looks to provide some improvement. I'd be happy to see it merged. 👍

martinsumner · 2021-03-17T15:02:20Z

@systream just a note on timescales. I'm going to proceed with release 3.0.4 this week without this change, as I need something fixed ASAP, but once we have a riak_test we're satisfied with, I will set off the process for producing a Riak 3.0.5 which includes this rack awareness PR.

I have a few other things on for the next few days, but I might have some time towards the end of next week to help with a riak_test test if required.

systream · 2021-03-18T20:09:25Z

Okay, cool thanks! I have already started writing new tests in the rack_test, but unfortunately I had urgent tasks to finish.
I think it will be finished it in a couple of days.

…ring

systream · 2021-03-29T06:32:55Z

@martinsumner I wrote tests in riak test. I'm not totally sure is it appropriate, could you check it. Thanks.
basho/riak_test#1353.

martinsumner · 2021-03-29T06:58:30Z

I will have a look later in the week. Thanks.

martinsumner · 2021-04-13T14:47:12Z

I think I'm fine with this from a test and code perspective now. Sorry for the delay, I've been distracted with other things.

@systream - for others to use it, a write-up would be helpful (just some markdown in docs/). Is this something you could do?

@martincox - anything to add?

systream · 2021-04-13T18:21:39Z

Sure, I will do the docs

systream · 2021-04-23T08:52:03Z

What's the next step? Is there anything i need to do?

martinsumner · 2021-04-23T11:03:01Z

Sorry, it is on me now. I will start work on finalising a release next week. Apologies for the delay.

systream · 2021-04-23T12:05:46Z

Sorry, I don't want you to rush, just got the feeling that i might need to do something.

martinsumner · 2021-04-29T12:40:38Z

@systream - sorry, but some late suggestions for the docs.

I think the instructions need to be extended to make it clear that locations need to be allocated sequentially in order for the claim to have the desired affect. That is to say:

node1 -> loc1
node2 -> loc2
node3 -> loc3
node4 -> loc1
node5 -> loc2
node6 -> loc3

should provide a diverse spread across locations, but the following mapping will not:

node1 -> loc1
node2 -> loc1
node3 -> loc2
node4 -> loc2
node5 -> loc3
node6 -> loc3

Also in order to check a planned, but uncommitted cluster change, this script would be useful:

PlannedRing = element(1, lists:last(element(3, riak_core_claimaint:plan())).
riak_core_location:check_ring(PlannedRing, 3, 2).

systream · 2021-04-29T13:02:55Z

It doesn't (or shouldn't) matter which node is where, as long as the numbers of distinct locations are good.
Both of the node location assignment should work. I did check it.

martinsumner · 2021-04-29T13:18:20Z

Perhaps I did something wrong, but I've just checked it - and not allocating nodes in sequence caused an issue:

$ dev/dev6/riak/bin/riak admin cluster plan
=============================== Staged Changes ================================
Action         Details(s)
-------------------------------------------------------------------------------
set-location  '[email protected]' to rack_a
set-location  '[email protected]' to rack_a
set-location  '[email protected]' to rack_b
set-location  '[email protected]' to rack_b
set-location  '[email protected]' to rack_c
set-location  '[email protected]' to rack_c
-------------------------------------------------------------------------------


NOTE: Applying these changes will result in 1 cluster transition

###############################################################################
                         After cluster transition 1/1
###############################################################################

================================= Membership ==================================
Status     Ring    Pending    Node
-------------------------------------------------------------------------------
valid     100.0%     17.2%    [email protected] (rack_a)
valid       0.0%     15.6%    [email protected] (rack_a)
valid       0.0%     17.2%    [email protected] (rack_b)
valid       0.0%     15.6%    [email protected] (rack_b)
valid       0.0%     17.2%    [email protected] (rack_c)
valid       0.0%     17.2%    [email protected] (rack_c)
-------------------------------------------------------------------------------
Valid:6 / Leaving:0 / Exiting:0 / Joining:0 / Down:0

WARNING: Not all replicas will be on distinct locations

Transfers resulting from cluster changes: 53
  10 transfers from '[email protected]' to '[email protected]'
  10 transfers from '[email protected]' to '[email protected]'
  11 transfers from '[email protected]' to '[email protected]'
  11 transfers from '[email protected]' to '[email protected]'
  11 transfers from '[email protected]' to '[email protected]'

$ dev/dev6/riak/bin/riak admin cluster commit
Cluster changes committed

Eshell V10.7  (abort with ^G)
([email protected])1> {ok, R} = riak_core_ring_manager:get_my_ring().
([email protected])3> length(riak_core_location:check_ring(R, 3, 2)).
40

Whereas a sequential allocation had no such violations.

I'm going to have a play around to see if I can see how this might not be working for me.

martinsumner · 2021-04-29T13:19:48Z

Hang on ... I think I might have prematurely checked before transfers were complete!

Sorry, my mistake. That was the case.

Add Rack awareness support

ad45897

Add warning message when not all replicas will be on distinct locations

185f4a1

Fix dialyzer.

bdb0b3d

Add minimum number of distinct locations parameter to location check_…

9a9e2ed

…ring

Add rack-awareness docs

8e0faa7

martinsumner self-requested a review April 15, 2021 08:21

martinsumner approved these changes Apr 15, 2021

View reviewed changes

Add rack-awareness docs update

c196b6e

martinsumner merged commit 1f7e204 into basho:develop-3.0 Apr 29, 2021

albsch mentioned this pull request Jul 14, 2021

Inclusion of Rack awareness support riak-core-lite/riak_core_lite#99

Open

martinsumner mentioned this pull request Sep 27, 2021

Leave and attempt_simple_transfer #970

Closed

martinsumner mentioned this pull request Mar 16, 2023

Extending location awareness for general join/leave support #1001

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Rack awareness support #967

Add Rack awareness support #967

systream commented Feb 22, 2021

martinsumner commented Feb 23, 2021

systream commented Feb 23, 2021 •

edited

Loading

martinsumner commented Feb 24, 2021 •

edited

Loading

systream commented Feb 24, 2021

martinsumner commented Mar 8, 2021

systream commented Mar 8, 2021

systream commented Mar 9, 2021

martincox commented Mar 10, 2021

martinsumner commented Mar 17, 2021

systream commented Mar 18, 2021

systream commented Mar 29, 2021 •

edited

Loading

martinsumner commented Mar 29, 2021

martinsumner commented Apr 13, 2021

systream commented Apr 13, 2021

systream commented Apr 23, 2021

martinsumner commented Apr 23, 2021

systream commented Apr 23, 2021

martinsumner commented Apr 29, 2021

systream commented Apr 29, 2021

martinsumner commented Apr 29, 2021

martinsumner commented Apr 29, 2021 •

edited

Loading

Add Rack awareness support #967

Add Rack awareness support #967

Conversation

systream commented Feb 22, 2021

martinsumner commented Feb 23, 2021

systream commented Feb 23, 2021 • edited Loading

martinsumner commented Feb 24, 2021 • edited Loading

systream commented Feb 24, 2021

martinsumner commented Mar 8, 2021

systream commented Mar 8, 2021

systream commented Mar 9, 2021

martincox commented Mar 10, 2021

martinsumner commented Mar 17, 2021

systream commented Mar 18, 2021

systream commented Mar 29, 2021 • edited Loading

martinsumner commented Mar 29, 2021

martinsumner commented Apr 13, 2021

systream commented Apr 13, 2021

systream commented Apr 23, 2021

martinsumner commented Apr 23, 2021

systream commented Apr 23, 2021

martinsumner commented Apr 29, 2021

systream commented Apr 29, 2021

martinsumner commented Apr 29, 2021

martinsumner commented Apr 29, 2021 • edited Loading

systream commented Feb 23, 2021 •

edited

Loading

martinsumner commented Feb 24, 2021 •

edited

Loading

systream commented Mar 29, 2021 •

edited

Loading

martinsumner commented Apr 29, 2021 •

edited

Loading