[1] Schrittwieser J, Antonoglou I, Hubert T, et al.Mastering Atari, Go, chess and shogi by planning with a learned model[J]. Nature, 2020, 588(7839): 604-609. DOI: 10.1038/s41586-020-03051-4. [2] Silver D, Schrittwieser J, Simonyan K, et al.Mastering the game of Go without human knowledge[J]. Nature, 2017, 550(7676): 354-359. DOI: 10.1038/nature24270. [3] Schulman J, Wolski F, Dhariwal P, et al. Proximal policy optimization algorithms[EB/OL]. arXiv:1707.06347. (2017-07-20)[2024-07-23]. http://arxiv.org/abs/1707.06347v2. [4] Mnih V, Kavukcuoglu K, Silver D, et al.Human-level control through deep reinforcement learning[J]. Nature, 2015, 518(7540): 529-533. DOI: 10.1038/nature14236. [5] Janner M, Fu J, Zhang M, et al. When to trust your model: Model-based policy optimization[EB/OL]. arXiv:1906.08253. (2019-06-19)[2024-07-23]. http://arxiv.org/abs/1906.08253v3. [6] Nagabandi A, Kahn G, Fearing R S, et al.Neural network dynamics for model-based deep reinforcement learning with model-free fine-tuning[C]//2018 IEEE International Conference on Robotics and Automation (ICRA). Brisbane, QLD, Australia. IEEE, 2018: 7559-7566. DOI: 10.1109/ICRA.2018.8463189. [7] Higuera J C G, Meger D, Dudek G. Synthesizing neural network controllers with probabilistic model-based reinforcement learning[C]//2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2018: 2538-2544. DOI: 10.1109/IROS.2018.8594018. [8] Chua K, Calandra R, McAllister R, et al. Deep reinforcement learning in a handful of trials using probabilistic dynamics models[C]//Proceedings of the 32nd International Conference on Neural Information Processing Systems. December 3 - 8, 2018, Montréal, Canada. ACM, 2018: 4759-4770. DOI: 10.5555/3327345.3327385. [9] Feinberg V, Wan A, Stoica I, et al. Model-based value estimation for efficient model-free reinforcement learning[EB/OL]. arXiv:1803.00101. (2018-02-28)[2024-07-23]. http://arxiv.org/abs/1803.00101v1. [10] Sutton R S.Integrated architectures for learning, planning, and reacting based on approximating dynamic programming[M]//Machine Learning Proceedings 1990. Amsterdam: Elsevier, 1990: 216-224. DOI: 10.1016/b978-1-55860-141-3.50030-4. [11] Luo Y, Xu H, Li Y, et al. Algorithmic Framework for Model-based Deep Reinforcement Learning with Theoretical Guarantees[C/OL]//International Conference on Learning Representations.2019. (2018-12-21)[2024-07-23]. https://openreview.net/forum?id=BJe1E2R5KX. [12] Buckman J, Hafner D, Tucker G, et al.Sample-efficient reinforcement learning with stochastic ensemble value expansion[C]//Proceedings of the 32nd International Conference on Neural Information Processing Systems. December 3 - 8, 2018, Montréal, Canada. ACM, 2018: 8234-8244. DOI: 10.5555/3327757.3327916. [13] Deisenroth M P, Rasmussen C E.PILCO: A model-based and data-efficient approach to policy search[C]//Proceedings of the 28th International Conference on International Conference on Machine Learning. 28 June 2011, Bellevue, Washington, USA. ACM, 2011: 465-472. DOI: 10.5555/3104482.3104541. [14] Heess N, Wayne G, Silver D, et al. Learning continuous control policies by stochastic value gradients[EB/OL]. arXiv:1510.09142. (2015-10-30)[2024-07-23]. http://arxiv.org/abs/1510.09142v1. [15] Hafner D, Lillicrap T, Ba J, et al. Dream to control: Learning behaviors by latent imagination[EB/OL]. arXiv:1912.01603. (2019-12-03)[2024-07-24]. http://arxiv.org/abs/1912.01603v3. [16] Hafner D, Pasukonis J, Ba J, et al. Mastering diverse domains through world models[EB/OL]. arXiv:2301.04104. (2023-01-10)[2024-07-23]. http://arxiv.org/abs/2301.04104v2. [17] Hafner D, Lillicrap T, Fischer I, et al. Learning latent dynamics for planning from pixels[EB/OL]. arXiv:1811.04551. (2018-11-12)[2024-07-23]. http://arxiv.org/abs/1811.04551v5. [18] Kurutach T, Clavera I, Duan Y, et al. Model-ensemble trust-region policy optimization[EB/OL]. arXiv:1802.10592. (2018-02-28)[2024-07-23]. http://arxiv.org/abs/1802.10592v2. [19] He W, Jiang Z. A Comprehensive Survey on Uncertainty Quantification for Deep Learning[EB/OL]. arXiv:2302.13425. (2023-02-26)[2024-07-23]. https://arxiv.org/abs/2302.13425. [20] Psaros A F, Meng X H, Zou Z R, et al.Uncertainty quantification in scientific machine learning: Methods, metrics, and comparisons[J]. Journal of Computational Physics, 2023, 477: 111902. DOI: 10.1016/j.jcp.2022.111902. [21] Lakshminarayanan B, Pritzel A, Blundell C. Simple and scalable predictive uncertainty estimation using deep Ensembles[EB/OL]. arXiv:1612.01474. (2016-12-05)[2024-07-23]. https://arxiv.org/abs/1612.01474. [22] Wang Z, Jusup M, Shi L, et al.Exploiting a cognitive bias promotes cooperation in social dilemma experiments[J]. Nature Communications, 2018, 9(1): 2954. DOI: 10.1038/s41467-018-05259-5. [23] Lu C, Ball P, Parker-Holder J, et al. Revisiting design choices in offline model-based reinforcement learning[EB/OL]. arXiv:2110.04135. (2021-10-08)[2024-07-23]. https://arxiv.org/abs/2110.04135. [24] Pan F, He J, Tu D, et al. Trust the model when it is confident: masked model-based actor-critic[EB/OL]. arXiv:2020.04893. (2020-10-10)[2024-07-23]. https://arxiv.org/abs/2010.04893. [25] Yu T, Thomas G, Yu L, et al. MOPO: Model-based offline policy optimization[EB/OL]. arXiv:2005.13239. (2020-05-27)[2024-07-23]. https://arxiv.org/abs/2005.13239. [26] Bechtle S, Lin Y X, Rai A, et al. Curious iLQR: Resolving uncertainty in model-based RL[EB/OL]. aiXiv:1904.06786. (2019-04-15)[2024-07-23]. http://arxiv.org/abs/1904.06786v2. [27] Hao J Y, Yang T P, Tang H Y, et al.Exploration in deep reinforcement learning: From single-agent to multiagent domain[J]. IEEE Transactions on Neural Networks and Learning Systems, 2023. DOI: 10.1109/TNNLS.2023.3236361. [28] Abbasi-Yadkori Y, Pál D, Szepesvári C.Improved algorithms for linear stochastic bandits[J]. Advances in neural information processing systems, 2011, 24. DOI: 10.5555/2986459.2986717. [29] Abbasi-Yadkori Y, Pál D, Szepesvári C. Improved algorithms for linear stochastic bandits[C]//Proceedings of the 24th International Conference on Neural Information Processing Systems. December 12 - 15, 2011, Granada, Spain. ACM, 2011: 2312-2320. DOI: 10.5555/2986459.2986717&widgetKey=ux3-publicationContent-widget_f69d88a8-b404-4aae-83a9-9acea4426d78_1498_en. [30] Bai C J, Wang L X, Han L, et al. Principled exploration via optimistic bootstrapping and backward induction[EB/OL]. arXiv:2105.06022. (2021-05-13)[2024-07-23]. http://arxiv.org/abs/2105.06022v2. [31] Zheng Y, Liu Y, Xie X F, et al.Automatic web testing using curiosity-driven reinforcement learning[C]//2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE). Madrid, ES. IEEE, 2021: 423-435. DOI: 10.1109/ICSE43902.2021.00048. [32] Ciosek K, Vuong Q, Loftin R, et al. Better exploration with optimistic actor-critic[EB/OL]. arXiv:1910.12807. (2019-10-28)[2024-07-23]. http://arxiv.org/abs/1910.12807v1. [33] Haarnoja T, Zhou A, Abbeel P, et al. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor[EB/OL]. arXiv:1801.01290. (2018-01-04)[2024-07-23]. http://arxiv.org/abs/1801.01290v2. [34] Burda Y, Edwards H, Storkey A, et al. Exploration by random network distillation[EB/OL]. arXiv:1810.12894. (2018-10-30)[2024-07-23]. http://arxiv.org/abs/1810.12894v1. [35] Liu J Y, Wang Z, Zheng Y, et al.OVD-explorer: Optimism should not be the sole pursuit of exploration in noisy environments[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2024, 38(12): 13954-13962. DOI: 10.1609/aaai.v38i12.29303. [36] Bai C J, Wang L X, Han L, et al. Dynamic bottleneck for robust self-supervised exploration[EB/OL]. arXiv:2110.10735. (2021-10-20)[2024-07-23]. http://arxiv.org/abs/2110.10735v2. [37] Oudeyer P Y, Kaplan F, Hafner V V.Intrinsic motivation systems for autonomous mental development[J]. IEEE Transactions on Evolutionary Computation, 2007, 11(2): 265-286. DOI: 10.1109/TEVC.2006.890271. [38] Pathak D, Agrawal P, Efros A A, er al. Curiosity-driven exploration by self-Supervised prediction[C/OL]//2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). 2017: 488-489. DOI:10.1109/CVPRW.2017.70. [39] Pathak D, Gandhi D, Gupta A. Self-Supervised Exploration via Disagreement[EB/OL]. arXiv:1906.04161. (2019-01-10)[2024-07-23]. https://arxiv.org/abs/1906.04161. [40] Yuan Y F, Hao J Y, Ni F, et al. EUCLID: Towards efficient unsupervised reinforcement learning with multi-choice dynamics model[EB/OL]. arXiv:2210.00498. (2022-10-02)[2024-07-23]. http://arxiv.org/abs/2210.00498v2. [41] Haarnoja T, Tang H R, Abbeel P, et al. Reinforcement learning with deep energy-based policies[EB/OL]. arXiv:1702.08165. (2017-02-27)[2024-07-23]. http://arxiv.org/abs/1702.08165v2. [42] Pineda L, Amos B, Zhang A, et al. MBRL-lib: A modular library for model-based reinforcement learning[EB/OL]. arXiv:2104.10159. (2021-04-20)[2024-07-23]. http://arxiv.org/abs/2104.10159v1. |