本文为 Ubuntu 使用 google reasearch football 单机ppo示例教程。
关于google research football 环境详见 GitHub代码。
经过测试,建议使用 python3.7 搭建google research football 环境,并使用1.15gpu版本tensorflow,此外还需要安装一些依赖
conda create -n gf-ppo python=3.7 tensorflow-gpu=1.15.*
conda activate gf-ppo
python3 -m pip install dm-sonnet==1.*
除此之外还需要一些系统包,对于 openai baslines 和 google research football 分别需要进行如下配置:
sudo apt-get install git cmake build-essential libgl1-mesa-dev libsdl2-dev \
libsdl2-image-dev libsdl2-ttf-dev libsdl2-gfx-dev libboost-all-dev \
libdirectfb-dev libst-dev mesa-utils xvfb x11vnc libsdl-sge-dev python3-pip
python3 -m pip install --upgrade pip setuptools psutil
git clone https://github.com/google-research/football.git
立即安装环境需要执行:
cd football
python3 -m pip install .
可能需要一些时间,因为这会在后台编译C++环境
同时由于需要编译C++环境,安装完成后直接修改本地代码无法直接加载到环境上,需要再次编译,因此这里先不进行编译,待修改完代码后再编译。
sudo apt-get update && sudo apt-get install cmake libopenmpi-dev python3-dev zlib1g-dev
git clone https://github.com/openai/baselines.git
cd baselines
pip install -e .
在~/football/gfootball/examples/run_ppo2.py中添加语句:
flags.DEFINE_string('dir', "~/log", 'Path to openai baselines log.')
并在
tf.Session(config=config).__enter__()
后添加:
logger.configure(dir=FLAGS.dir, format_strs=['stdout','log','csv','tensorboard'])
之后可以通过下列操作安装google research football 环境(若之前已经安装这里需要重新安装)
cd football
python3 -m pip install .
也可以将~/baselines/baselines/ppo2/model.py文件修改:
# my code
from baselines import logger
self.CLIPRANGE = CLIPRANGE = tf.placeholder(tf.float32, [])
# my code
with tf.name_scope('env'):
tf.summary.histogram('return',self.R)
loss = pg_loss - entropy * ent_coef + vf_loss * vf_coef
# my code
self.count = 0
with tf.name_scope('loss'):
tf.summary.scalar('policy_loss',pg_loss)
tf.summary.scalar('value_loss',vf_loss)
tf.summary.scalar('policy_entropy',entropy)
tf.summary.scalar('approxkl',approxkl)
tf.summary.scalar('clipfrac',clipfrac)
self.merged = tf.summary.merge_all()
self.writer=tf.summary.FileWriter(logger.get_dir(),sess.graph)
if states is not None:
td_map[self.train_model.S] = states
td_map[self.train_model.M] = masks
# my code
self.count+=1
if(self.count % 5 == 0):
summary = self.sess.run(self.merged,td_map)
self.writer.add_summary(summary,self.count)
然后在~/football/gfootball/examples/run_ppo2.py中设置路径:
# my code
flags.DEFINE_string('dir', "~/log", 'Path to openai baselines log.')
tf.Session(config=config).__enter__()
# my code
logger.configure(dir=FLAGS.dir)
之后重新安装环境
除了可以直接设置的一些参数:
flags.DEFINE_string('level', 'academy_empty_goal_close',
'Defines type of problem being solved')
flags.DEFINE_enum('state', 'extracted_stacked', ['extracted',
'extracted_stacked'],
'Observation to be used for training.')
flags.DEFINE_enum('reward_experiment', 'scoring',
['scoring', 'scoring,checkpoints'],
'Reward to be used for training.')
flags.DEFINE_enum('policy', 'cnn', ['cnn', 'lstm', 'mlp', 'impala_cnn',
'gfootball_impala_cnn'],
'Policy architecture')
flags.DEFINE_integer('num_timesteps', int(2e6),
'Number of timesteps to run for.')
flags.DEFINE_integer('num_envs', 8,
'Number of environments to run in parallel.')
flags.DEFINE_integer('nsteps', 128, 'Number of environment steps per epoch; '
'batch size is nsteps * nenv')
flags.DEFINE_integer('noptepochs', 4, 'Number of updates per epoch.')
flags.DEFINE_integer('nminibatches', 8,
'Number of minibatches to split one epoch to.')
flags.DEFINE_integer('save_interval', 100,
'How frequently checkpoints are saved.')
flags.DEFINE_integer('seed', 0, 'Random seed.')
flags.DEFINE_float('lr', 0.00008, 'Learning rate')
flags.DEFINE_float('ent_coef', 0.01, 'Entropy coeficient')
flags.DEFINE_float('gamma', 0.993, 'Discount factor')
flags.DEFINE_float('cliprange', 0.27, 'Clip range')
flags.DEFINE_float('max_grad_norm', 0.5, 'Max gradient norm (clipping)')
flags.DEFINE_bool('render', False, 'If True, environment rendering is enabled.')
flags.DEFINE_bool('dump_full_episodes', True,
'If True, trace is dumped after every episode.')
flags.DEFINE_bool('dump_scores', False,
'If True, sampled traces after scoring are dumped.')
flags.DEFINE_string('load_path', None, 'Path to load initial checkpoint from.')
还有一些参数的设置也十分重要:
位于代码中,为 env = football_env.create_environment() 函数的参数 logdir , 默认取为logger.get_dir() ,但是由于代码运行到这里时,tensorflow会话还没有启动,logger路径也没有修改,建议改为某一固定路径,或者添加参数:
flags.DEFINE_string('logdir', "dumps/", 'Path to load replay.')
并修改:
def create_single_football_env(iprocess):
"""Creates gfootball environment."""
env = football_env.create_environment(
env_name=FLAGS.level, stacked=('stacked' in FLAGS.state),
rewards=FLAGS.reward_experiment,
# my code
logdir=FLAGS.logdir,
# my code end
write_goal_dumps=FLAGS.dump_scores and (iprocess == 0),
write_full_episode_dumps=FLAGS.dump_full_episodes and (iprocess == 0),
render=FLAGS.render and (iprocess == 0),
dump_frequency=50 if FLAGS.render and iprocess == 0 else 0)
env = monitor.Monitor(env, logger.get_dir() and os.path.join(logger.get_dir(),
str(iprocess)))
return env
即为create_single_football_env()函数中的 dump_frequency 参数,这一参数决定每多少个episode对dumps进行储存,同时在render时影响render的频率,即没经过dump_frequency个episode进行一个episode的渲染(其他时间画面冻结)
可以直接使用
python3 -m gfootball.examples.run_ppo2 --level=academy_empty_goal_close
来运行实例,或者在后面添加其他参数
也可以通过使用shell脚本
#!/bin/bash
python3 -u -m gfootball.examples.run_ppo2 \
--level 5_vs_5 \
--reward_experiment scoring,checkpoints \
--policy impala_cnn \
--cliprange 0.08 \
--gamma 0.993 \
--ent_coef 0.003 \
--num_timesteps 50000000 \
--max_grad_norm 0.64 \
--lr 0.000343 \
--num_envs 16 \
--noptepochs 2 \
--nminibatches 8 \
--nsteps 512 \
--dir '~/ppo_log/5_vs_5' \
--logdir 'dumps/5_vs_5' \
--dump_full_episodes True \
"$@"
运行实例。
(通过level设置不同场景)