Project Description

Build a custom (6x6 is used for illustration and explanation purposes) Tic Tac Toe using Reinforcement Learning Techniques

Challenges

Compared to 3x3 Tic Tac Toe, 6x6 is sparse. It has only 14 winning positions for the players. Winning positions are depicted as follows.

States:

Represents the state of the board in a string format A random state of the board as below is represented as xeexeeeeoeeoxeeoeeoeeexeexeeeeeeeeeo where

‘e’ means empty
‘x’ means RL player
‘o’ means opponent

Actions:

As a normal Tic-Tac-Toe game, board is initialized as all empty ‘eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee’ and one of the players start the game. Players play alternately, until either of them win or the both draw. At each slot (representative of time), the player playing in the slot can make a move by placing his/her move in any of the empty slots.

Objective:

Minimize the opponent’s win. This objective is chosen because of the high probability of a draws(tie). The right measure of learning will be the ability to reduce the number of times opponent won.

Solution:

Framework:

RL agent, Random player, and Judge are present in the framework.

Random player

Random player is the opponent to the RL agent. Random Player, as the name suggests plays randomly, by choosing one of the available empty slot as the move.

Judge

Judge has the following responsibilities.

Decide which player starts the game (Gives chance to each player in alternative games)
Regulate the game (whose turn when, as per the tic tac toe rules, alternative again)
Evaluate the winner (Judge also intelligently declares the draw, even before all the empty slots get exhausted, if it detects there is no possibility of winning by any of the players)
Notify both the players about the end of the game and who is the winner
Keep the statistics for evaluating the performance

Reward Structure:

Win: 1
Draw: -1
Loss: -10
Any action leading to none of the above, results in 0 reward

Returns and Value functions:

After each episode, RL agent updates the value function using Monte Carlo updates. TD method is unnecessary here, as none of the states get repeated in an episode.

Algorithm to update the Value Function after an episode:

Algorithm to choose At at state St:

Special Techniques used in Learning Efficiently

Make use of symmetry of Tic – Tac – Toe board One game has 5 alternatives, which corresponds to
- $90^{\degree}$ rotation of the board,
- $180^{\degree}$ degree rotation of the board,
- $270^{\degree}$ rotation of the board,
- Diagonal flip of the board,
- Anti-diagonal flip of the board Refer functions update_values and Transform Dictionary in the code for the details
Learn from opponent’s moves On top of self-movements leading to win, loss, draw, we also can learn from opponent’s moves Refer function update_values in the code for the details

Thus, every episode leads to x12 times learning, implies 12 times updating of value function of states involved in those learning paths

Code Reference:

Section# Dependencies: Lists out the required libraries as well as the package version requirements
Section# Change the dimension of the board here. The same code can be used for any dimension of Tic-Tac-Toe
Section# Transform Dictionary: Function to transform the Tic-Tac-Toe board to 6 transformations
Section# Import the previous learning: Open with previous learning (which is saved at the end)
Section# RL Agent – RL Agent class
Section# Random Player – Random Player class
Section# Judge – Judge class
Section# Check performance – Check Statistics and plots
Section# Saving values to pickle file

Performance

Performance through 10000 games

In comparison with 3x3, Performance through 10000 games for 3x3

The same algorithm can be used for any board, the learning for smaller dimension games are faster whereas 6x6 takes much more steps for proper learning because of the sparse nature

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
images		images
README.md		README.md
Tic-Tac-Toe-Code.ipynb		Tic-Tac-Toe-Code.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Project Description

Challenges

States:

Actions:

Objective:

Solution:

Framework:

Random player

Judge

Reward Structure:

Returns and Value functions:

Algorithm to update the Value Function after an episode:

Algorithm to choose At at state St:

Special Techniques used in Learning Efficiently

Code Reference:

Performance

About

Releases

Packages

Languages

bnarath/Tic-Tac-Toe

Folders and files

Latest commit

History

Repository files navigation

Project Description

Challenges

States:

Actions:

Objective:

Solution:

Framework:

Random player

Judge

Reward Structure:

Returns and Value functions:

Algorithm to update the Value Function after an episode:

Algorithm to choose At at state St:

Special Techniques used in Learning Efficiently

Code Reference:

Performance

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages