不确定度增强下基于模型的探索-学习策略联合优化*

doi:10.7523/j.ucas.2024.072

摘要/Abstract

摘要： 在现有基于模型的强化学习方法中,智能体采用同一策略进行与真实环境和环境模型的交互,难以兼顾探索环境的效率和策略更新的稳定性。为了解决此问题,本文提出了一种不确定度增强下基于模型的探索-学习策略联合优化方法。本方法联合优化一对策略,分别为用于与真实环境交互的探索策略和与环境模型交互的学习策略。在优化探索策略的过程中,本文引入了基于模型不确定度的内在激励以提升探索真实环境的效率,同时在优化学习策略时将模型不确定度作为约束条件,保证学习策略优化的稳定性。多个连续控制任务的实验结果表明,本文方法相比于现有的代表性方法在渐进性能和样本效率上有明显的优势。

关键词: 深度强化学习, 基于模型的强化学习, 样本效率, 强化学习探索, 不确定度, 模型误差

Abstract: In existing model-based reinforcement learning methods, a single policy is adopted to interact with the real environment and the environment model, which makes agent hard to balance the efficiency of exploring the environment and the stability of policy updating. To address this issue, this paper proposes a Model-based Explorer-Learner Joint Optimization via Uncertainty Augmentation method (MELO-UA). MELO-UA simultaneously optimizes a pair of policies, namely the explorer policy for interacting with the real environment, and the learner policy for interacting with the environment model. During the optimization of the explorer, implicit bonus based on model uncertainty is introduced to enhance the efficiency of exploring the real environment. At the same time, during the optimization of the learner, the model uncertainty is used as a constraint to ensure the stability of the policy optimization. Experimental results on multiple continuous control tasks show that the proposed method has significant advantages in asymptotic performance and sample efficiency compared to state-of-the-art methods.

Key words: deep reinforcement learning, model-based reinforcement learning, sample efficiency, exploration in reinforcement learning, uncertainty, model errors

中图分类号:

TP183

肖士湘, 黄文振, 焦建彬. 不确定度增强下基于模型的探索-学习策略联合优化^*[J]. 中国科学院大学学报, DOI: 10.7523/j.ucas.2024.072.

XIAO Shixiang, HUANG Wenzhen, JIAO Jianbin. Model-based explorer-learner joint optimization via uncertainty augmentation[J]. Journal of University of Chinese Academy of Sciences, DOI: 10.7523/j.ucas.2024.072.

参考文献

[1] Schrittwieser J, Antonoglou I, Hubert T, et al.Mastering Atari, Go, chess and shogi by planning with a learned model[J]. Nature, 2020, 588(7839): 604-609. DOI: 10.1038/s41586-020-03051-4.
[2] Silver D, Schrittwieser J, Simonyan K, et al.Mastering the game of Go without human knowledge[J]. Nature, 2017, 550(7676): 354-359. DOI: 10.1038/nature24270.
[3] Schulman J, Wolski F, Dhariwal P, et al. Proximal policy optimization algorithms[EB/OL]. arXiv:1707.06347. (2017-07-20)[2024-07-23]. http://arxiv.org/abs/1707.06347v2.
[4] Mnih V, Kavukcuoglu K, Silver D, et al.Human-level control through deep reinforcement learning[J]. Nature, 2015, 518(7540): 529-533. DOI: 10.1038/nature14236.
[5] Janner M, Fu J, Zhang M, et al. When to trust your model: Model-based policy optimization[EB/OL]. arXiv:1906.08253. (2019-06-19)[2024-07-23]. http://arxiv.org/abs/1906.08253v3.
[6] Nagabandi A, Kahn G, Fearing R S, et al.Neural network dynamics for model-based deep reinforcement learning with model-free fine-tuning[C]//2018 IEEE International Conference on Robotics and Automation (ICRA). Brisbane, QLD, Australia. IEEE, 2018: 7559-7566. DOI: 10.1109/ICRA.2018.8463189.
[7] Higuera J C G, Meger D, Dudek G. Synthesizing neural network controllers with probabilistic model-based reinforcement learning[C]//2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2018: 2538-2544. DOI: 10.1109/IROS.2018.8594018.
[8] Chua K, Calandra R, McAllister R, et al. Deep reinforcement learning in a handful of trials using probabilistic dynamics models[C]//Proceedings of the 32nd International Conference on Neural Information Processing Systems. December 3 - 8, 2018, Montréal, Canada. ACM, 2018: 4759-4770. DOI: 10.5555/3327345.3327385.
[9] Feinberg V, Wan A, Stoica I, et al. Model-based value estimation for efficient model-free reinforcement learning[EB/OL]. arXiv:1803.00101. (2018-02-28)[2024-07-23]. http://arxiv.org/abs/1803.00101v1.
[10] Sutton R S.Integrated architectures for learning, planning, and reacting based on approximating dynamic programming[M]//Machine Learning Proceedings 1990. Amsterdam: Elsevier, 1990: 216-224. DOI: 10.1016/b978-1-55860-141-3.50030-4.
[11] Luo Y, Xu H, Li Y, et al. Algorithmic Framework for Model-based Deep Reinforcement Learning with Theoretical Guarantees[C/OL]//International Conference on Learning Representations.2019. (2018-12-21)[2024-07-23]. https://openreview.net/forum?id=BJe1E2R5KX.
[12] Buckman J, Hafner D, Tucker G, et al.Sample-efficient reinforcement learning with stochastic ensemble value expansion[C]//Proceedings of the 32nd International Conference on Neural Information Processing Systems. December 3 - 8, 2018, Montréal, Canada. ACM, 2018: 8234-8244. DOI: 10.5555/3327757.3327916.
[13] Deisenroth M P, Rasmussen C E.PILCO: A model-based and data-efficient approach to policy search[C]//Proceedings of the 28th International Conference on International Conference on Machine Learning. 28 June 2011, Bellevue, Washington, USA. ACM, 2011: 465-472. DOI: 10.5555/3104482.3104541.
[14] Heess N, Wayne G, Silver D, et al. Learning continuous control policies by stochastic value gradients[EB/OL]. arXiv:1510.09142. (2015-10-30)[2024-07-23]. http://arxiv.org/abs/1510.09142v1.
[15] Hafner D, Lillicrap T, Ba J, et al. Dream to control: Learning behaviors by latent imagination[EB/OL]. arXiv:1912.01603. (2019-12-03)[2024-07-24]. http://arxiv.org/abs/1912.01603v3.
[16] Hafner D, Pasukonis J, Ba J, et al. Mastering diverse domains through world models[EB/OL]. arXiv:2301.04104. (2023-01-10)[2024-07-23]. http://arxiv.org/abs/2301.04104v2.
[17] Hafner D, Lillicrap T, Fischer I, et al. Learning latent dynamics for planning from pixels[EB/OL]. arXiv:1811.04551. (2018-11-12)[2024-07-23]. http://arxiv.org/abs/1811.04551v5.
[18] Kurutach T, Clavera I, Duan Y, et al. Model-ensemble trust-region policy optimization[EB/OL]. arXiv:1802.10592. (2018-02-28)[2024-07-23]. http://arxiv.org/abs/1802.10592v2.
[19] He W, Jiang Z. A Comprehensive Survey on Uncertainty Quantification for Deep Learning[EB/OL]. arXiv:2302.13425. (2023-02-26)[2024-07-23]. https://arxiv.org/abs/2302.13425.
[20] Psaros A F, Meng X H, Zou Z R, et al.Uncertainty quantification in scientific machine learning: Methods, metrics, and comparisons[J]. Journal of Computational Physics, 2023, 477: 111902. DOI: 10.1016/j.jcp.2022.111902.
[21] Lakshminarayanan B, Pritzel A, Blundell C. Simple and scalable predictive uncertainty estimation using deep Ensembles[EB/OL]. arXiv:1612.01474. (2016-12-05)[2024-07-23]. https://arxiv.org/abs/1612.01474.
[22] Wang Z, Jusup M, Shi L, et al.Exploiting a cognitive bias promotes cooperation in social dilemma experiments[J]. Nature Communications, 2018, 9(1): 2954. DOI: 10.1038/s41467-018-05259-5.
[23] Lu C, Ball P, Parker-Holder J, et al. Revisiting design choices in offline model-based reinforcement learning[EB/OL]. arXiv:2110.04135. (2021-10-08)[2024-07-23]. https://arxiv.org/abs/2110.04135.
[24] Pan F, He J, Tu D, et al. Trust the model when it is confident: masked model-based actor-critic[EB/OL]. arXiv:2020.04893. (2020-10-10)[2024-07-23]. https://arxiv.org/abs/2010.04893.
[25] Yu T, Thomas G, Yu L, et al. MOPO: Model-based offline policy optimization[EB/OL]. arXiv:2005.13239. (2020-05-27)[2024-07-23]. https://arxiv.org/abs/2005.13239.
[26] Bechtle S, Lin Y X, Rai A, et al. Curious iLQR: Resolving uncertainty in model-based RL[EB/OL]. aiXiv:1904.06786. (2019-04-15)[2024-07-23]. http://arxiv.org/abs/1904.06786v2.
[27] Hao J Y, Yang T P, Tang H Y, et al.Exploration in deep reinforcement learning: From single-agent to multiagent domain[J]. IEEE Transactions on Neural Networks and Learning Systems, 2023. DOI: 10.1109/TNNLS.2023.3236361.
[28] Abbasi-Yadkori Y, Pál D, Szepesvári C.Improved algorithms for linear stochastic bandits[J]. Advances in neural information processing systems, 2011, 24. DOI: 10.5555/2986459.2986717.
[29] Abbasi-Yadkori Y, Pál D, Szepesvári C. Improved algorithms for linear stochastic bandits[C]//Proceedings of the 24th International Conference on Neural Information Processing Systems. December 12 - 15, 2011, Granada, Spain. ACM, 2011: 2312-2320. DOI: 10.5555/2986459.2986717&widgetKey=ux3-publicationContent-widget_f69d88a8-b404-4aae-83a9-9acea4426d78_1498_en.
[30] Bai C J, Wang L X, Han L, et al. Principled exploration via optimistic bootstrapping and backward induction[EB/OL]. arXiv:2105.06022. (2021-05-13)[2024-07-23]. http://arxiv.org/abs/2105.06022v2.
[31] Zheng Y, Liu Y, Xie X F, et al.Automatic web testing using curiosity-driven reinforcement learning[C]//2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE). Madrid, ES. IEEE, 2021: 423-435. DOI: 10.1109/ICSE43902.2021.00048.
[32] Ciosek K, Vuong Q, Loftin R, et al. Better exploration with optimistic actor-critic[EB/OL]. arXiv:1910.12807. (2019-10-28)[2024-07-23]. http://arxiv.org/abs/1910.12807v1.
[33] Haarnoja T, Zhou A, Abbeel P, et al. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor[EB/OL]. arXiv:1801.01290. (2018-01-04)[2024-07-23]. http://arxiv.org/abs/1801.01290v2.
[34] Burda Y, Edwards H, Storkey A, et al. Exploration by random network distillation[EB/OL]. arXiv:1810.12894. (2018-10-30)[2024-07-23]. http://arxiv.org/abs/1810.12894v1.
[35] Liu J Y, Wang Z, Zheng Y, et al.OVD-explorer: Optimism should not be the sole pursuit of exploration in noisy environments[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2024, 38(12): 13954-13962. DOI: 10.1609/aaai.v38i12.29303.
[36] Bai C J, Wang L X, Han L, et al. Dynamic bottleneck for robust self-supervised exploration[EB/OL]. arXiv:2110.10735. (2021-10-20)[2024-07-23]. http://arxiv.org/abs/2110.10735v2.
[37] Oudeyer P Y, Kaplan F, Hafner V V.Intrinsic motivation systems for autonomous mental development[J]. IEEE Transactions on Evolutionary Computation, 2007, 11(2): 265-286. DOI: 10.1109/TEVC.2006.890271.
[38] Pathak D, Agrawal P, Efros A A, er al. Curiosity-driven exploration by self-Supervised prediction[C/OL]//2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). 2017: 488-489. DOI:10.1109/CVPRW.2017.70.
[39] Pathak D, Gandhi D, Gupta A. Self-Supervised Exploration via Disagreement[EB/OL]. arXiv:1906.04161. (2019-01-10)[2024-07-23]. https://arxiv.org/abs/1906.04161.
[40] Yuan Y F, Hao J Y, Ni F, et al. EUCLID: Towards efficient unsupervised reinforcement learning with multi-choice dynamics model[EB/OL]. arXiv:2210.00498. (2022-10-02)[2024-07-23]. http://arxiv.org/abs/2210.00498v2.
[41] Haarnoja T, Tang H R, Abbeel P, et al. Reinforcement learning with deep energy-based policies[EB/OL]. arXiv:1702.08165. (2017-02-27)[2024-07-23]. http://arxiv.org/abs/1702.08165v2.
[42] Pineda L, Amos B, Zhang A, et al. MBRL-lib: A modular library for model-based reinforcement learning[EB/OL]. arXiv:2104.10159. (2021-04-20)[2024-07-23]. http://arxiv.org/abs/2104.10159v1.

不确定度增强下基于模型的探索-学习策略联合优化^*

Model-based explorer-learner joint optimization via uncertainty augmentation

PDF (PC)

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 5

编辑推荐

Metrics

本文评价

访问统计

联系我们

[1]	李雪源, 韩丛英. Actor-critic框架下的二次指派问题求解方法[J]. 中国科学院大学学报, 2024, 41(2): 275-284.
[2]	杨光开, 陈皓, 张茗奕, 尹奇跃, 黄凯奇. 基于不确定度的多智能体信用分配方法[J]. 中国科学院大学学报, 2024, 41(2): 231-240.
[3]	陈皓, 杨立昆, 尹奇跃, 黄凯奇. 面向Ad-Hoc协作的局部观测重建方法[J]. 中国科学院大学学报, 2024, 41(1): 117-126.
[4]	张华明, 李强. 基于深度强化学习的低轨卫星下行功率分配方案[J]. 中国科学院大学学报, 2022, 39(4): 543-550.
[5]	靳晓晗, 岩延, 张宝贤. 基于深度强化学习方法的无线多跳网络能量高效机会路由[J]. 中国科学院大学学报, 2022, 39(3): 410-420.