Stable Baseline3 简单使用

王茂南

3082
文章

75
评论

2024年11月13日07:26:12

评论 2761字阅读9分12秒

文章目录(Table of Contents)

简介

Stable Baseline3 是一个非常受欢迎的深度强化学习工具包，能够快速完成强化学习算法的搭建和评估，提供预训练的智能体，包括保存和录制视频等等，是一个功能非常强大的库。Stable Baseline3 可以帮助我们快速实现各种强化学习算法，或是作为 baseline 方便我们比较新的算法。

我们使用以下的命令进行安装：

pip install stable-baselines3

参考资料

官方文档，Stable-Baselines3 Docs - Reliable Reinforcement Learning Implementations

官方教程，Stable Baselines3 RL tutorial

一个 Youtuber 制作的 SB3 的教学视频，快速上手，包括自定义环境：

Introduction to Stable Baseline 3，使用 SB3 解决 LunarLand 的问题（这里他在测试的时候代码有问题，可以看一下视频下面的评论）；

Saving and Loading Models，如何保存和加载模型，同时使用 Tensorboard 记录指标；

Reinforcement Learning Tips and Tricks

这里包含一些训练 RL 的建议（算法的选择等）。在这里我就简单概括一下，详细的内容可以查看官方文档。Reinforcement Learning Tips and Tricks，或是查看视频，RL in practice: tips & tricks and practical session with stable-baselines3。

一些通用的建议

Good results in RL are generally dependent on finding appropriate hyperparameters. Recent algorithms (PPO, SAC, TD3) normally require little hyperparameter tuning, however, don’t expect the default ones to work on any environment. 调参

When applying RL to a custom problem, you should always normalize the input to the agent (e.g. using VecNormalize for PPO/A2C) and look at common preprocessing done on other environments (e.g. for Atari, frame-stack, …). Please refer to Tips and Tricks when creating a custom environment paragraph below for more advice related to custom environments. 输入标准化，这个是非常重要的。

As a general advice, to obtain better performances, you should augment the budget of the agent (number of training timesteps). 需要有足够的样本

如何验证强化学习算法

Because most algorithms use exploration noise during training, you need a separate test environment to evaluate the performance of your agent at a given time. It is recommended to periodically evaluate your agent for n test episodes (n is usually between 5 and 20) and average the reward per episode to have a good estimate.（测试的时候需要把探索关闭）

As some policy are stochastic by default (e.g. A2C or PPO), you should also try to set deterministic=True when calling the .predict() method, this frequently leads to better performance. 对于 stochastic 的算法，需要设置 deterministic=True 在预测的时候。

一些自定义环境的建议

Some basic advice:

always normalize your observation space when you can, i.e., when you know the boundaries
normalize your action space and make it symmetric when continuous (cf potential issue below) A good practice is to rescale your actions to lie in [-1, 1]. This does not limit you as you can easily rescale the action inside the environment
start with shaped reward (i.e. informative reward) and simplified version of your problem
debug with random actions to check that your environment works and follows the gym interface（使用随机动作进行测试）

Two important things to keep in mind when creating a custom environment is to avoid breaking Markov assumption and properly handle termination due to a timeout (maximum number of steps in an episode). For instance, if there is some time delay between action and observation (e.g. due to wifi communication), you should give an history of observations as input.

我们可以使用 check_env 来检查环境：

from stable_baselines3.common.env_checker import check_env
env = CustomEnv(arg1, ...)
# It will check your custom environment and output additional warnings if needed
check_env(env)