Back propagation in MCTS tree #47

Blackmoor · 2018-03-19T16:59:12Z

Blackmoor
Mar 19, 2018

Many games allow the same game state to be reached by different sets of moves, or the same set of moves performed in a different order.

Because the current MCTS implements its 'tree' as a 'set' we can end up finding that the next state is already in the 'tree' if it has been reached via another set of moves. The current back propagation code will update the states we traversed to get here but potentially leave the 'tree' in an incomplete state as some branches will suddenly have a child those updated values have not been back propagated

An additional problem is that is it not always safe to assume that identical games state reached via different moves are equivalent (take 'go' for example) so to be a generic game playing engine we probably have to treat them as different states and implement the MCTS as a 'tree' and not as a 'set' (python list indexed on the game state).

sharpobject · 2018-05-31T21:04:49Z

sharpobject
May 31, 2018

I thought about this a bit and it seems like Blackmoor is correct.

Assume three states A,B,C and moves a,b where game_step(A,a)=C and game_step(B,b)=C. When you run search(C) a bunch of times, you get some sequence of values v_n=[v_1, v_2, v_3, v_4, v_5 ...] each of which is the raw nn output for some state resulting from some action sequence starting with the action at C with the highest upper confidence bound. So if the sequence of actions selected from C begins [d,e,d,d,e] then v_2 and v_5 come from E=game_step(C,e) and v_1, v_3, and v_4 come from D=game_step(C,d). As written, we might get all the values from D into Q(A,a) and all the values from E into Q(B,b) rather than getting each an average of a prefix of v_n as intended.

I think this is fixable and there is a valid optimization to salvage here:

each node should cache the sequence of values that search(s) emits, Ws[s]
replace search(next_s) with search(next_s, Nsa[(s,a)])
if search(s, idx) is called with idx < len(Ws[s]) just return Ws[s][idx]

This causes the values used at all steps of the search to be the same as the values used in the tree version, but saves some nn evaluations.

(This optimization relies more strongly on the assumption that game_step is deterministic than any of the existing code, so games with randomness like Splendor have some extra bookkeeping to do here, but this should work as I wrote it here for abstract games like Chess or Othello)

To avoid the second concern, games with rules about repetitions and such would have to include enough information about repetitions and such in their string representations to avoid this sort of state aliasing. In the extreme case, including the entire move history would work and make this MCTS behave like a tree-based one.

0 replies

sharpobject · 2018-05-31T21:26:01Z

sharpobject
May 31, 2018

After applying my proposed optimization someone might reasonably object, "Why should we make decisions based on a Q(B,b) that uses an average of the first 2 evaluations of search(C) when the first 3 evaluations are available?"

There is some useful information that I am still failing to take advantage of, but it seems annoying to do the bookkeeping required to take advantage of it.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Back propagation in MCTS tree #47

{{title}}

Replies: 2 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Back propagation in MCTS tree #47

Blackmoor Mar 19, 2018

Replies: 2 comments

sharpobject May 31, 2018

sharpobject May 31, 2018

Blackmoor
Mar 19, 2018

sharpobject
May 31, 2018

sharpobject
May 31, 2018