SimpleAgent 600K dataset
Samples collected from four SimpleAgents playing against each other. Dataset contains 600 episodes (~600K samples) in training set and 100 episodes (~100K samples) in validation set. There are two versions: rewards calculated with 0.99 discount and no discount (1). Cleaned version means that if three consecutive actions and four consecutive observations did not change, those samples are removed.