Suboptimal policy #12

xli4217 · 2018-11-15T22:27:43Z

I'm trying SQL on a simple manipulator reaching task, the agent quickly learns to get to the vicinity of the goal but then the learning curve plateaus and the agent never quite get to the goal. Some of my hyperparameters are

policy learning rate 0.0005
Q learning rate 0.001
reward scale 20
alpha 1.0

Is there something I can do to improve this? Thanks.

haarnoja · 2018-11-16T17:17:43Z

SQL learns maximum entropy policies, so that's why the optimal policy is stochastic. You can try for example annealing the temperature to zero, or shaping the reward function by making the reward much larger in the vicinity of the goal.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Suboptimal policy #12

Suboptimal policy #12

xli4217 commented Nov 15, 2018

haarnoja commented Nov 16, 2018

Suboptimal policy #12

Suboptimal policy #12

Comments

xli4217 commented Nov 15, 2018

haarnoja commented Nov 16, 2018