WET-RL

1 introduction

Hybrid Proximal Policy Optimization (HPPO) for WetEnv

This project implements a Hybrid Proximal Policy Optimization (HPPO) algorithm for reinforcement learning in a custom environment called WetEnv. The HPPO algorithm is designed to handle both discrete and continuous action spaces.

2 env & model

2.1 Project Structure

-env/wet_rl_env.py: Custom environment implementation (WetEnv)

-hppo/hppo.py: HPPO algorithm implementation, including PPO_Hybrid, ActorCritic_Hybrid, and PPOBuffer classes

-hppoTrainer.py: Trainer class for managing the training process

-util/: Utility functions and configurations

2.2 Key Components

You can modify the hyperparameters in the hppoTrainer.py file or pass them as command-line arguments.

WetEnv: Custom environment with hybrid action space (discrete and continuous).
PPO_Hybrid: Implementation of the Hybrid Proximal Policy Optimization algorithm.
ActorCritic_Hybrid: Neural network architecture for the actor-critic model.
PPOBuffer: Experience replay buffer for storing and sampling transitions.
Trainer: Manages the training process, including episode collection, agent updates, and logging.

Key hyperparameters include:

Learning rates (actor, critic, std)
Discount factor (gamma)
GAE lambda
Clipping epsilon
Target KL divergence
Entropy coefficient

Refer to the hppoTrainer.py file for a complete list of hyperparameters and their default values.

Results

After training, the results will be saved in the log/ directory, including:

Total reward history (.npy file)
Total reward plot (.png file)

4 Experiment Design

调参对比，曲线优化
evaluate function design1：用当前版本input直接计算，用下个版本的last opt plan作为对比
evaluate function design2：用当前版本input训练，生成随机初始状态，用下个版本的input做推理，用下下个版本的last opt plan作为对比，查看模型泛化能力
单个版本的与gurobi结果的对比，要gantt结果可视化
大量版本的与gurobi结果的对比，对比可视化，不需要gantt结果可视化
计算时间与episode的曲线

5 todo list

6 Issues to be optimized

均衡性，计算平均值时，gurobi模型将pm和down机的也算进来了，应该去掉
全局step均衡性，rl改为了同一时间内的step均衡性
计算均衡性，是否应该使用归一化方差
目标函数的参数设计应该根据结果再进行优化调整

7 Important Issues

buffer size can not bigger than 64, because the buffer size must < the ptr size
Be consistent with gurobi's settings self.period_list=[1,24], otherwise data reading will cause errors
needn't use continuous action, the move qauntity is set to be max directly

8 reference

https://github.com/ray-project/ray
https://docs.ray.io/en/latest/ray-overview/index.html
https://docs.ray.io/en/latest/rllib/index.html
ray-rllib 开发者文档总结 - 知乎 (zhihu.com)
Metro1998/hppo-in-traffic-signal-control (github.com)
https://github.com/openai/spinningup/tree/master/spinup/algos/pytorch/ppo
https://github.com/ikostrikov/pytorch-a2c-ppo-acktr-gail/blob/master/a2c_ppo_acktr/algo/ppo.py
深度强化学习调参技巧：以D3QN、TD3、PPO、SAC算法为例（有空再添加图片） - 曾伊言的文章 - 知乎 https://zhuanlan.zhihu.com/p/345353294

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
hppo		hppo
README.md		README.md
hppoTrainer.py		hppoTrainer.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WET-RL

1 introduction

2 env & model

2.1 Project Structure

2.2 Key Components

4 Experiment Design

5 todo list

6 Issues to be optimized

7 Important Issues

8 reference

About

Releases

Packages

Languages

youyouzz/RL-HPPO

Folders and files

Latest commit

History

Repository files navigation

WET-RL

1 introduction

2 env & model

2.1 Project Structure

2.2 Key Components

4 Experiment Design

5 todo list

6 Issues to be optimized

7 Important Issues

8 reference

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages