todo.txt

* todo
+ already done


Missing Functions
    * include klopfen
    + include spritzen
    + wann darf die Suchsau gespielt werden
        * wenn gesucht ist
        * wenn schon gesucht wurde
        * beim letzten Stich
    * davonlaufen: werden die teams gesetzt?

Things I added that I don't like
    + Reward Shaping: trick points are added to reward
    * Action Selection: hard coded solos and wenz (when player has a lot of trumps)

Questions
    * warum brauchts 2 policies in ppo? eigentlich wird immer nur eine her genommen

* Bug fixing
    + evaluiere ein spiel states/actions ,...
    + overfitte auf ein spiel


* Input state enhancement
    + current trick should be encoded with LSTM

* performance
    + multiprocessing to parallelize played games ---> only possible on CPU thus no performance gain
    - play-n_games implement

* prettyfy
    + change project structure
    * add/rework comments

* When nothing else works
    * include MCTS
    * check out CFR

* MISC
    + hard code more wenz ----> did that
    + player_plays_card: add probability of playing that card and show it in print game
    + loss shown in tensorboard is dependent on learning rate currently, this should be changed
    * add batchnorm (faster training?) --> not good for RL
    + use gru instead of LSTM
    * current trick encoding: one lstm layer should be enough

* new metrics to see that the learner makes progress
    * number of times a player plays a lower card than is played although he could take the trick
    + play against a rule based bot
    * improve rule-based bot

* Next
    * MCTS
    * Soft Actor Critic
    * try to implement gym env
    * Parameter Changes
        * K = 8 instead of 16
        * entropy loss coefficient back to 0.001
        * increase batchsize further

* What did I learn
 - Using GPU does not always help when doing RL. Small networks, batches => CPU is faster. Best thing would be playouts using CPU (parallel) and then update on GPU
 - increaseing batchsize and update_games helps making the agent learn better (better optima and more stable but slower)
 - Handling of illegal moves (no consensus what works best)
    - Masking of illegal moves
    - Learning not to make illegal moves (sometimes by providing allowed actions as input) by punishing them with high negative reward if made
    - learn the value function (instead of the policy) and only ask for value of valid state/action pairs