欢迎访问中国科学院大学学报,今天是

中国科学院大学学报

• •    

不确定度增强下基于模型的探索-学习策略联合优化*

肖士湘1, 黄文振2, 焦建彬1†   

  1. 1 中国科学院大学,电子电气与通信工程学院,北京 100049;
    2 清华大学,信息科学技术学院,北京 100084
  • 收稿日期:2023-12-04 修回日期:2024-09-05 发布日期:2024-12-23
  • 通讯作者: E-mail: jiaojb@ucas.ac.cn
  • 基金资助:
    *中国科学院项目(XDA27010300)资助

Model-based explorer-learner joint optimization via uncertainty augmentation

XIAO Shixiang1, HUANG Wenzhen2, JIAO Jianbin1   

  1. 1 School of Electronic, Electrical and Communication Engineering, University of Chinese Academy of Sciences, Beijing 100049, China;
    2 School of Information Science and Technology, Tsinghua University, Beijing 100084, China
  • Received:2023-12-04 Revised:2024-09-05 Published:2024-12-23

摘要: 在现有基于模型的强化学习方法中,智能体采用同一策略进行与真实环境和环境模型的交互,难以兼顾探索环境的效率和策略更新的稳定性。为了解决此问题,本文提出了一种不确定度增强下基于模型的探索-学习策略联合优化方法。本方法联合优化一对策略,分别为用于与真实环境交互的探索策略和与环境模型交互的学习策略。在优化探索策略的过程中,本文引入了基于模型不确定度的内在激励以提升探索真实环境的效率,同时在优化学习策略时将模型不确定度作为约束条件,保证学习策略优化的稳定性。多个连续控制任务的实验结果表明,本文方法相比于现有的代表性方法在渐进性能和样本效率上有明显的优势。

关键词: 深度强化学习, 基于模型的强化学习, 样本效率, 强化学习探索, 不确定度, 模型误差

Abstract: In existing model-based reinforcement learning methods, a single policy is adopted to interact with the real environment and the environment model, which makes agent hard to balance the efficiency of exploring the environment and the stability of policy updating. To address this issue, this paper proposes a Model-based Explorer-Learner Joint Optimization via Uncertainty Augmentation method (MELO-UA). MELO-UA simultaneously optimizes a pair of policies, namely the explorer policy for interacting with the real environment, and the learner policy for interacting with the environment model. During the optimization of the explorer, implicit bonus based on model uncertainty is introduced to enhance the efficiency of exploring the real environment. At the same time, during the optimization of the learner, the model uncertainty is used as a constraint to ensure the stability of the policy optimization. Experimental results on multiple continuous control tasks show that the proposed method has significant advantages in asymptotic performance and sample efficiency compared to state-of-the-art methods.

Key words: deep reinforcement learning, model-based reinforcement learning, sample efficiency, exploration in reinforcement learning, uncertainty, model errors

中图分类号: