This is an algorithm to play all games of Mastermind on a GPU. All possible games are played at once, in parallel, arranging per-game work and data effectively for the GPU. By doing so, we can also compute the next best guess for (often large) groups of games just once, and we can further gather work across games into units that make the best utilization of GPU resources. All game state is kept on-device, with minimal copying back to the host each round and reduced synchronization overhead. Games are batched to improve scheduling and occupancy.
This document also outlines a CUDA implementation of the algorithm and discusses details and tradeoffs. This implementation represents a significant speedup vs previous serial and GPU versions, allowing larger games to be played in reasonable times.
Prior methods are adaptations of game-at-a-time approaches and use the GPU to accelerate the core scoring loop, or attempt to play a single game at a time on the device. These methods have limited effectiveness, with high CPU and host-to-device overhead, and make poor use of GPU resources per-game.
I've provided an optimized CUDA implementation, a simple reference CPU version, and a more optimized CPU version for comparison. Results are provided for many games up to 8 pins or 15 colors.
Multiple strategies are implemented: Knuth1, Expected Size2, Entropy2, and Most Parts3.
For a 7 pin 7 color game, using Knuth's algorithm, my previous best game-at-a-time CUDA algorithm executed in 24.3s. The current algorithm executes in 14.2s.
The core of most interesting Mastermind algorithms comes from the method described by Knuth in [1]. It involves scoring
all codewords,
Algorithms which play a single game at a time will play the next guess, reduce
A reasonable first approach to solving Mastermind on the GPU is to move the most expensive portion of the CPU-based algorithm into a GPU kernel, accelerating that portion, and then move more pieces as warranted. This can work quite well, especially since finding the next guess is so extremely expensive and amenable to parallelism.
However, these approaches hit practical limits of memory bandwidth, both on-device and host-to-device, and work scheduling as they are unable to effectively group work between games.
When the same initial guess is played for all games (each with an unknown but unique secret), the resulting scores
partition the games into disjoint regions with a common property: all games within a region have the same reduced
Regions can be identified by the ordered sequence of scores which formed them,
Because every such region is disjoint from all others at a given depth, they may all be computed in parallel without synchronization or regard to order. Regions of similar size, or with similar other properties, may be grouped and dispatched to the GPU in any way that best exploits the nature of the device at hand.
Regions are given an id
Identify each game
- Set
$G = \{g_0, g_1, ... g_n\}$ - Set the next guess for each game
$N_g$ to the precomputed starting guess - Set the region id for every game
$R_g = ()$ - While there are games in
$G$ :- Score a game's next guess
$N_g$ and append that score to$R_g$ - Remove any game just won from
$G$ - For each unique, un-won region
$R$ - Using
$PS = R$ , compute a new guess$c$ for$R$ - Update
$N_g = c$ for each game in$R$
- Using
- Score a game's next guess
This algorithm has a few important properties:
- Order in each step doesn't matter. For example, in step 4.i we can score games and build region ids in any order.
- Each step is parallelizable, with no shared state between games or regions within each step.
- We can order and group work by region size in a variety of different ways depending on how we want to use GPU
resources.
Other properties than region size may also be useful, though region size is a good way to pack the
$n^2$ work onto the GPU. - Almost all game state can be held on the GPU, which minimizes copies between host and device memory, and minimizes synchronization barriers.
Note that any algorithm can be applied in step 4.iii.a, such as those from Knuth1, Ville2, etc.
A simple, serial, CPU-only, C++ implementation of this algorithm is given
in solver_cpu_reference.in
l.
There are no optimizations, and it should be fairly readable and easily proved correct, if a little slow.
This is broken down into two phases. In Phase 1 we play a turn for all active games and organize the resulting regions, and in Phase 2 we find new guesses to play on the next turn.
Note: this has been tailored for compute capability 8.6, CUDA 11.7, running on an NVIDIA GeForce RTX 3070. Other GPUs might have capabilities which would suggest different tuning.
Every game gets a stable index, which is the same as the index for its winning secret in
There is a sequence of next moves per turn,
There is a single device vector
The next move for each game is played and scored in parallel for all active games. There is no ordering requirement. Region ids are updated with the new score.
Next, a list of start offsets and lengths for regions in reduce_by_key
gets us run lengths, then an exclusive_scan
builds start offsets. The reduction
returns the count of regions to the CPU, which is a second synchronization barrier.
If the case equivalence optimization is enabled, discussed below, we build the zero and free sets for every active
region with buildZerosAndFrees
. The actual code is somewhat complicated, with a reduction fused with a few transforms
to build the zero set, a simpler transform to build the free set, and a zipped transform to combine the sets and adjust
free based on zero. The sets are represented by bit fields, each bit set representing a color present.
Next,
Finally, some pre-optimization is done to reduce indirection when finding new guesses. Region ids are translated to game indexes and codewords for secrets.
This ends Phase 1. We now have a sorted, coalesced vector of regions, offsets and lengths into it, and other optimized data prepared to help later. We have the number of regions and the lengths of the regions, in order, on the CPU.
This sounds like a lot of work, but it is insignificant compared to the time spent in Phase 2. As an example, for a large game like 7p7c, the largest round spends 0.0075s in Phase 1 and 4.0350s in Phase 2. There are a number of obvious optimization opportunities in the current version of the code, but given the trivial time spent vs. the extra complexity I have chosen to ignore all of them.
Next, we search for a new guess for every active region, and record those guesses in the
We consider regions of size 1 and 2 "tiny", and these are all handled at once by a single kernel. This
kernel, nextGuessTiny
, is a single block of 128 threads with each thread handling a tiny region in parallel. The
threads stride through these regions to request codewords in parallel and update
All other regions are considered "big" and handled the same. This includes seemingly tiny regions like size 3, and regions no larger than the number of possible scores for a game (e.g., 14 for 4p games). I implemented a raft of different optimizations and custom kernels for each of these cases and tested them all, and much like all of Phase 1 the gains are insignificant vs. the work done on larger regions. Again, I've left them all out in favor of simplicity.
If the case equivalence optimization is enabled, discussed below, the big regions are sorted by Zero/Free colors. This allows us to generate ACr for multiple regions at once and share them well. A fixed-sized ACr buffer is used to keep memory bounded, and the buffer is filled eagerly.
All big regions are handled in chunks of 256 at a time.
A simple kernel nextGuessForRegions
is launched which uses the region size to decide which kernels to use to search
for the best next guess. Only the regions which have an ACr generated are processed, and once those are done we iterate
and build more ACr, and keep going with the next group of big regions until they're all done.
The main kernel, subsettingAlgosKernel
, is launched when no other optimization can be performed, and we must consider
every codeword in
Each thread considers a single codeword from reduceBestGuess
, is used to reduce these to the single
best guess. This is written to
If the region size is less than or equal to the total number of possible scores then we have the opportunity to perform
the fully discriminating
optimization discussed below. The fullyDiscriminatingOpt
kernel is launched for these regions with just 32 threads.
Each thread considers a single codeword from subsettingAlgosKernel
.
The chunks of big regions are stacked up into the same CUDA stream so that a fixed temporary space can be used for the
storage between subsettingAlgosKernel
and reduceBestGuess
.
Once all kernels are launched we synchronize for the last time this round, waiting for all next guesses to be computed.
We now have next guesses updated for all active games in the
There are various well-known algorithms for playing Mastermind, and all the interesting ones are centered around
computing scores of one codeword vs every element of
The maximum number of scores per Ville(2013)2 is
That seemingly small number of counters quickly becomes large. Significant performance improvement comes from storing these counters in shared GPU memory, even though the counters don't need to be shared between threads. Such memory is typically limited to kilobytes per symmetric multiprocessor (SM), and on an NVidia RTX 3070 the default effective shared memory size per thread block is 47KiB. With 32-bit counters, 45 subsets yields an upper limit of 267 threads. Rounded down to the warp size of 32, it's a practical limit of 256 concurrent threads per SM, which is not enough to keep each processor busy each cycle, resulting in suboptimal occupancy and thus GPU utilization. The result is blocks which are ready to run, but cannot be scheduled only due to a lack of shared memory.
We need a counter for each possible score value which is large enough to count up to, at most,
Specializing the compute kernels with a properly sized counter type allows 2x or 4x the number of concurrent threads, and yields significant improvement for these games.
We can pack counters for subset sizes tightly and index them directly with the score by using the packed scores described in Packed Indices for Mastermind Scores. These score values cost a little more to compute, but it's very minor. And these strange score values never escape the GPU, only being used to index into the subset sizes, so it's okay that they're completely different from scores used elsewhere.
The Most Parts algorithm2 doesn't even need to compute subset sizes, only record usage of a subset, and thus a single
bit will do for all sizes of
There is a good optimization due to Ville(2013)2: if a codeword produces one subset for every member of
A separate kernel, fullyDiscriminatingOpt
, is launched for such regions which is optimized for small sizes. If it
identifies a fully discriminating codeword, it is used as the next guess for the region, otherwise the larger subsetting
kernel is launched to perform a complete search.
A previous CPU implementation also looked for fully discriminating codewords in the full subsetting kernel, and performed a simple reduction to select the lexically first such codeword and play that. This saved significant CPU time each round, however in the current GPU implementation playing all games at once it turns out that the overhead to record and reduce this is strictly slower than simply playing the result of the normal reduction. Multiple approaches were tried, but in every case the extra overhead of the comparisons or the extra memory defeated the purpose. Thus, it has been left out of the current implementation.
Another good optimization due to Ville(2013)2 is to exploit case equivalence in codewords based on colors not yet
played (free colors) and colors which cannot possibly be part of the solution (zero colors). This can lead to a
significant reduction in
Currently, this optimization is only applied when SolverCPUFaster
: 256.
Also note that this is applied after the fully discriminating optimization described above. But, because it's not clear
ahead of time if the FD opt will apply,
The work done to both compute the Zero and Free sets, and to build every necessary
While each
The pre-computed
The current implementation takes a fairly straightforward approach to completing big regions with their
It's also possible, if not likely, that this optimization isn't necessary for
Note that while pre-computed
Possible
7p7c | Time (s) | Scores |
---|---|---|
Original | 3.5283 | 2,594,858,890,338 |
CE Opt | 2.6222 | 1,688,549,605,473 |
Speedup | 1.35x | 1.54x |
8p8c | Time (s) | Scores |
---|---|---|
Original | 955 | 1,346,760,512,102,540 |
CE Opt | 1,974 | 701,717,829,399,382 |
Speedup | 2.07x | 1.92x |
And here are log scale plots of the scores performed in all games to-date, with and without this optimization.
Recall that the optimization is only enabled when
The scoring function used is based on the hand-vectorized version in codeword.inl. The first portion,
computing popcount
.
The second part, computing all hits based on color counts, changed a bit. GPUs provide data types which pack 8-bit integers into vectors of 4 values, and automatically turn common operations on them like addition and minimum into vector ops. Minimums of the pair of 16 color counts are taken with four such vector ops, then the minimums are reduced with a series of a few vector additions.
This ends up being very efficient and, like the CPU version, far faster than any attempts to memoize results for reuse later.
A codeword up to 8 pins is encoded in a single 32-bit value, with each pin up to 15 colors represented by 4 bits. Color counts are pre-computed when forming the codewords and are encoded with 8 bits per count. For a 15 color game we need 120 bits, so it's encoded into a 128-bit value. However, half the games we wish to play have 8 or fewer colors, wasting half of the color count memory.
The current implementation specializes the Codeword
type based on game size, and uses 64-bit values for packed color
counts on smaller game sizes. This is a significant win in terms of memory not only because of the obvious savings of 8
bytes per codeword, but also due to alignment requirements on the type which resulted in much more waste. This yields
better coalescing of global memory requests, increased useful data with each request, etc.
Finally, by reducing the color counts by half the number of vector ops to compute all hits in the scoring function is also reduced by half for an extra time savings.
There are two interesting results from this algorithm. First, the number of guesses needed to win a game configuration, average and maximum, and second the tree of guesses played and scores.
All of this is captured in the
There are other interesting stats as a byproduct of the implementation, e.g. number of scoring operations, time spent, region counts at each level, etc. These are accumulated in device memory and extracted afterward as well.
Footnotes
-
D.E. Knuth. The computer as Master Mind. Journal of Recreational Mathematics, 9(1):1–6, 1976. ↩ ↩2 ↩3
-
Geoffroy Ville, An Optimal Mastermind (4,7) Strategy and More Results in the Expected Case, March 2013, arXiv: 1305.1010 [cs.GT]. https://arxiv.org/abs/1305.1010 ↩ ↩2 ↩3 ↩4 ↩5 ↩6 ↩7
-
Barteld Kooi, Yet another mastermind Strategy. International Computer Games Association Journal, 28(1):13–20, 2005. https://www.researchgate.net/publication/30485793_Yet_another_Mastermind_strategy ↩
-
My previous implementation which played games serially can be found on the
game_at_a_time
branch. ↩