This is a tic-tac-toe game built using Q-learning, a reinforcement learning algorithm.
As a result of the training with 100,000 episodes where the agent played against a computer that made random moves, the agent won close to 80% of the games against the computer. The average reward was approximately 0.7, and the maximum reward exceeded 0.9.
Q-learning algorithm:
Q-learning is a model-free, value-based, off-policy reinforcement learning (RL) algorithm
based on the Bellman equation. It uses a Q-table to store Q-values for state-action pairs,
which represent the expected future rewards for taking specific actions in specific states.
Steps in Q-Learning Algorithm:
1: Initialize the Q-values for all state-action pairs arbitrarily (often to zero).
2: Observe the current state
3: Select an action a based on the current policy (e.g., ε-greedy).
4: Perform the action a and observe the reward r and the next state
5: Update the Q-value using the Q-learning equation.
6: Set the current state s to the next state
7: Repeat steps 2-6 until the termination condition is met.
Q-table:
The Q-table is defined as a list data type and is stored as a csv file with the following format.
state, action, q_value
"[[0, 0, 0], [0, 0, 0], [0, 0, 0]]", 1, 0.0
Q-value Update Equation:
The Q-values are updated using the standard Q-learning update equation:
Q_new(s, a) = Q(s, a) + alpha * (r + γ * max(Q(s', a')) - Q(s, a))
Where:
- Q(s, a): Current Q-value for the state-action pair (s, a)
- alpha: Learning rate
- γ: Discount factor (gamma)
- max(Q(s', a')): Maximum Q-value for the next state s'
- s: Current state
- a: Action taken in the current state
- r: Reward received after taking action a
Reward values:
- Win: +1
- Tie: +0.5
- Loss: -1
Reinforcement Learning (RL) Environment:
- Action: Choose a move between 1 and 9.
- State: Board configuration represented as a string, e.g., "[[0, 0, 0], [0, 0, 0], [0, 0, 0]]".
- Reward:
- Win: +1
- Tie: +0.5
- Loss: -1
Links:
・Introduce Q-learning algorithm
https://towardsdatascience.com/an-ai-agent-learns-to-play-tic-tac-toe-part-3-training-a-q-learning-rl-agent-2871cef2faf0
https://medium.com/@ardra4/tic-tac-toe-using-q-learning-a-reinforcement-learning-approach-d606cfdd64a3
https://medium.com/@kaneel.senevirathne/teaching-agents-to-play-tic-tac-toe-using-reinforcement-learning-7a9d4d6ee9b3
https://www.datacamp.com/tutorial/introduction-q-learning-beginner-tutorial?dc_referrer=https%3A%2F%2Fwww.google.com%2F
https://towardsdatascience.com/reinforcement-learning-implement-tictactoe-189582bea542
├── src # codes for tic-tac-toe environment
│ ├── board.py # for board
│ ├── game.py # for game
│ ├── move.py # for move
│ ├── player.py # for player
│ └── rl.py # for Q-learning algorithm
├── training # codes for training
│ ├── training_results #
│ │ └── plan1 # training plan1 result files
│ ├── training.py # for training
│ ├── result_analysis.py # for analyzing the training result
│ ├── training_result.csv # training result file
│ └── q_table.csv # Q-table file generated after 100,000 episodes
├── main.py # Run the app
├── .gitignore
├── requirements.txt
└── README.md
- Clone this project
git clone https://github.com/taka-rl/tic-tac-toe_q_learning.git
- If you would like to only play Tic-Tac-Toe, please see "Play Tic-Tac-Toe".
- If you would like to train the agent, Run the following command for the libraries:
On Windows type:python -m pip install -r requirements.txt
On MacOS type:pip3 install -r requirements.txt
If you would like to play tic-tac-toe simply, run main.py.
Choose a game mode between 1 and 6.
-
When you are ready, then run training.py
-
After the training, q_table.csv and training_result.csv files are saved in the training folder.
q_table.csv looks like this.
training_result.csv looks like this.
Training environment: During training, the agent plays against a computer that makes random moves.
Expectation: As the number of episodes increases, the average reward is expected to increase.
Result: The average reward increased, and the number of wins also rose.
The parameter settings are as follows in config.py:
CONFIGURATIONS = [
Config(learning_rate=0.1, discount_factor=0.9, epsilon=0.1, num_episodes=1000, identifier="training1_1"),
Config(learning_rate=0.1, discount_factor=0.9, epsilon=0.1, num_episodes=10000, identifier="training1_2"),
Config(learning_rate=0.1, discount_factor=0.9, epsilon=0.1, num_episodes=100000, identifier="training1_3"),
]
The average reward was calculated every 100 games.
The number of episode is 1000.
The number of episode is 10000.
The number of episode is 100000.
Win/Lost/Tie:
Through the training, the number of win increased gradually.