forked from tobiasemrich/SchafkopfRL
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathtodo.txt
71 lines (55 loc) · 2.38 KB
/
todo.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
* todo
+ already done
Missing Functions
* include klopfen
+ include spritzen
+ wann darf die Suchsau gespielt werden
* wenn gesucht ist
* wenn schon gesucht wurde
* beim letzten Stich
* davonlaufen: werden die teams gesetzt?
Things I added that I don't like
+ Reward Shaping: trick points are added to reward
* Action Selection: hard coded solos and wenz (when player has a lot of trumps)
Questions
* warum brauchts 2 policies in ppo? eigentlich wird immer nur eine her genommen
* Bug fixing
+ evaluiere ein spiel states/actions ,...
+ overfitte auf ein spiel
* Input state enhancement
+ current trick should be encoded with LSTM
* performance
+ multiprocessing to parallelize played games ---> only possible on CPU thus no performance gain
- play-n_games implement
* prettyfy
+ change project structure
* add/rework comments
* When nothing else works
* include MCTS
* check out CFR
* MISC
+ hard code more wenz ----> did that
+ player_plays_card: add probability of playing that card and show it in print game
+ loss shown in tensorboard is dependent on learning rate currently, this should be changed
* add batchnorm (faster training?) --> not good for RL
+ use gru instead of LSTM
* current trick encoding: one lstm layer should be enough
* new metrics to see that the learner makes progress
* number of times a player plays a lower card than is played although he could take the trick
+ play against a rule based bot
* improve rule-based bot
* Next
* MCTS
* Soft Actor Critic
* try to implement gym env
* Parameter Changes
* K = 8 instead of 16
* entropy loss coefficient back to 0.001
* increase batchsize further
* What did I learn
- Using GPU does not always help when doing RL. Small networks, batches => CPU is faster. Best thing would be playouts using CPU (parallel) and then update on GPU
- increaseing batchsize and update_games helps making the agent learn better (better optima and more stable but slower)
- Handling of illegal moves (no consensus what works best)
- Masking of illegal moves
- Learning not to make illegal moves (sometimes by providing allowed actions as input) by punishing them with high negative reward if made
- learn the value function (instead of the policy) and only ask for value of valid state/action pairs