Neural Least Square Policy Iteration learning with Critic-only architecture

Mehrabi, Omid; Fakharian, Ahmad; Siahi, Mehdi; Ramezani, Amin

doi:10.61186/joc.17.1.61

Volume 17, Issue 1 (Journal of Control, V.17, N.1 Spring 2023) JoC 2023, 17(1): 61-76 | Back to browse issues page

‎ 10.61186/joc.17.1.61

Mendeley

Zotero

RefWorks

Mehrabi O, Fakharian A, Siahi M, Ramezani A. Neural Least Square Policy Iteration learning with Critic-only architecture. JoC 2023; 17 (1) :61-76
URL: http://joc.kntu.ac.ir/article-1-964-en.html

Neural Least Square Policy Iteration learning with Critic-only architecture

Omid Mehrabi¹

, Ahmad Fakharian ^*¹

, Mehdi Siahi¹

, Amin Ramezani²

1- Islamic Azad University
2- Tarbiat Modares University

Abstract: (5196 Views)

Intelligent control of real control problems based on reinforcement learning often requires decision-making in a large or continuous state-action space. Since the number of adjustable parameters in discrete reinforcement learning has a direct relationship with cardinality of the state-action space of the problem, so in such problems, we are faced with the curse of dimensiality, low learning speed and low efficiency. The use of continuous reinforcement learning methods to overcome these problems have attracted many research interests. In this paper a novel Neural Reinforcement Learning (NRL) scheme is proposed. The presented method is model free and learning rate independent, and is obtained by combining Least Squares Policy Iteration (LSPI) with Radial Basis Functions (RBF) as a function approximator, and we call it "Neural Least Squares Policy Iteration" (NLSPI). In this method, by using the basis functions defined in the RBF neural network structure, we have provided a solution to solve the challenge of defining the state-action basis functions in LSPI. In order to validate the presented method, the performance of the proposed algorithm in solving two control problems has been compared with other methods. The overall results show the superiority of our method in learning the pseudo-optimal policy.

Keywords: Neural reinforcement learning, Critic-only architecture, Least Square Policy Iteration, RBF network.

Full-Text [PDF 1596 kb] (1631 Downloads)

Type of Article: Research paper | Subject: Special
Received: 2022/12/23 | Accepted: 2023/06/6 | ePublished ahead of print: 2023/06/10 | Published: 2023/06/22

References

1. [1] Sutton, R. S., and Barto, A. G., Reinforcement learning: An introduction, Second Edition, MIT Press, Massachusetts, 2017.

2. [2] Derhami, V., Alamiyan, F., Dowlatshahi, M.B., Reinforcement Learning, Yazd University Press, 2017.

3. [3] Derhami, V., Mehrabi, O., Action value function approximation based on radial basis function network for reinforcement learning, Journal of control, Vol.5, No. 1, pp. 50-63, 2011.

4. [4] Liu, Y. J., Tang, L., Tong, S., Chen, C. P., and Li, D. J., Reinforcement learning design-based adaptive tracking control with less learning parameters for nonlinear discrete-time MIMO systems, IEEE Transactions on Neural Networks and Learning Systems, vol. 26, no.1, pp. 165-176, 2015. [DOI:10.1109/TNNLS.2014.2360724]

5. [5] Derhami, V., V.J. Majd, and M.N. Ahmadabadi, Fuzzy Sarsa Learning and The Proof of Existence of its Stationary Points, Asian Journal of Control, pp. 535-549, 2008. [DOI:10.1002/asjc.54]

6. [6] Ghorbani, F., Derhami, V., and Afsharchi, M., Fuzzy Least Square Policy Iteration and Its Mathematical Analysis, International Journal of fuzzy systems, pp.1-14, 2016. [DOI:10.1007/s40815-016-0270-1]

7. [7] Barakat, A., Bianchi, P., and Lehmann, J. Analysis of a target-based actor-critic algorithm with linear function approximation. CoRR, abs/2106.07472, 2021.

8. [8] Zaki, M., Mohan, A., Goplan, A., and Manner, S., Actor-Critic based Improper Reinforcement Learning, arXiv, 2022.

9. [9] Allahverdy, D., Fakharian, A. & Menhaj, M.B. Back-Stepping Integral Sliding Mode Control with Iterative Learning Control Algorithm for Quadrotor UAVs. J. Electr. Eng. Technol. 14, 2539-2547, 2019. [DOI:10.1007/s42835-019-00257-z]

10. [10] Sheikhlar, A., and Fakharian, A. "Online policy iteration-based tracking control of four wheeled omni-directional robots." Journal of Dynamic Systems, Measurement, and Control 140, no. 8, 2018. [DOI:10.1115/1.4039287]

11. [11] Jia, Y., and Zhou, X. Y., "Policy Gradient and Actor-Critic Learning in Continuous Time and Space: Theory and Algorithms", arXiv, 2021. [DOI:10.2139/ssrn.3969101]

12. [12] Lagoudakis, M. G., and Par, R., Least-squares policy iteration, Journal of Machine Learning Research, p. 1107-1249, 2003.

13. [13] Hwang, K.S., Tan, S.W., and Tsai, M. C., "Reinforcement Learning to Adaptive Control of Nonlinear Systems", IEEE Transactions on Systems, Man, and Cybernetics-Part B, Vol.33, No.3, pp.514-521, 2003. [DOI:10.1109/TSMCB.2003.811112]

14. [14] Bus¸oniu, L., et al., Online least-squares policy iteration for reinforcement learning control, American Control Conference (ACC-10), 2010. [DOI:10.1109/ACC.2010.5530856]

15. [15] Buşoniu, L., Lazaric, A., Ghavamzadeh, M., Munos, R., Babuška, R., De Schutter, B., Least-squares methods for policy iteration. In: Wiering, M., van Otterlo, M. (Eds.), Reinforcement Learning: State-of-the-Art. In: Adaptation, Learning, and Optimization, vol. 12, Springer, Heidelberg, Germany, pp. 75-109, 2012. [DOI:10.1007/978-3-642-27645-3_3]

16. [16] Xu, X., Hu, D., Lu, X., Kernel-based least squares policy iteration for reinforcement learning. IEEE Trans. Neural Netw. 18 (4), 973-992, 2007. [DOI:10.1109/TNN.2007.899161]

17. [17] Yahyaa, S., Manderick, B., Knowledge gradient for online reinforcement learning. In: Duval, B., van den Herik, J., Loiseau, S., Filipe, J. (Eds.), Agents and Artificial Intelligence. In: ICAART 2014 LNCS, vol. 8946, Springer, Cham, pp. 103-118, 2014. [DOI:10.1007/978-3-319-25210-0_7]

18. [18] Jakab, H.S., Csató, L., Sparse approximations to value functions in reinforcement learning. In: Koprinkova-Hristova, 9999P., Mladenov, V., Kasabov, N.K. (Eds.), Artificial Neural Networks. Springer, Cham, pp. 295-314, 2015. [DOI:10.1007/978-3-319-09903-3_14]

19. [19] Cui, Y., Matsubara, T., Sugimoto, K., Kernel dynamic policy programming: Applicable reinforcement learning to robot systems with high dimensional states, Neural Netw. 94, 13-23, 2017. [DOI:10.1016/j.neunet.2017.06.007]

20. [20] Ruan, A., Shi, A., Qin, L., Xu, S., and Zhao, Y., "A Reinforcement Learning-Based Markov-Decision Process (MDP) Implementation for SRAM FPGAs," in IEEE Transactions on Circuits and Systems II: Express Briefs, vol. 67, no. 10, pp. 2124-2128, 2020. [DOI:10.1109/TCSII.2019.2943958]

21. [21] Howard, R.A., Dynamic Programming and Markov Processes. MIT Press, Cambridge, Massachusetts, 1960.

22. [22] Perkins, T.J. and D. Precup, A convergent form of approximate policy iteration. Proc. Int. Conf. Neural Information Processing Systems, p. 1595-1602, 2002.

23. [23] Koller, D. and R. Parr, Policy iteration for factored MDPs. The Sixteenth Conference on Uncertainty in Artificial Intelligence, p. 326-334, 2000.

24. [24] Hartman, E., Keeler, J. D., Kowalski, J. M, "Layered neural networks with Gaussian hidden units as universal approximations", Neural Computation, Vol. 2, No. 2, pp. 210-215, 1990. [DOI:10.1162/neco.1990.2.2.210]

25. [25] R. M. Kretchmar and C. W. Anderson, "Comparison of CMACs and radial basis functions for local function approximators in reinforcement learning," Proceedings of International Conference on Neural Networks (ICNN'97), Houston, TX, USA, pp. 834-837 vol.2, 1997.

26. [26] Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. arXiv preprint arXiv:1604.06778, 2016.

27. [27] Varga, B, Kulcsár, B, Chehreghani, MH. Deep Q-learning: A robust control approach Int J Robust Nonlinear Control. 33(1): 526- 54, 2023. [DOI:10.1002/rnc.6457]

28. [28] Xin Xu, Lei Zuo, Zhenhua Huang, Reinforcement learning algorithms with function approximation: Recent advances and applications, Information Sciences, Volume 261, Pages 1-31, 2014. [DOI:10.1016/j.ins.2013.08.037]

29. [29] second Annual Reinforcement Learning Competition http://rl-competition.org.

30. [30] André da Motta Salles Barreto, Charles W. Anderson, Restricted gradient-descent algorithm for value-function approximation in reinforcement learning, Artificial Intelligence, Volume 172, Issues 4-5, Pages 454-482, 2008. [DOI:10.1016/j.artint.2007.08.001]

31. [31] Snehal, N., Pooja, Sonam, W., K., Wagh, S. R., and Singh, N. M., Control of an Acrobot system using reinforcement learning with probabilistic policy search, Australian & New Zealand Control Conference, pp. 68-73, 2021. [DOI:10.1109/ANZCC53563.2021.9628194]

32. [32] Lim, H. -K., Kim, J. -B., Ullah, I., Heo, J. -S., and Han, Y. -H., Federated Reinforcement Learning Acceleration Method for Precise Control of Multiple Devices, in IEEE Access, vol. 9, pp. 76296-76306, 2021. [DOI:10.1109/ACCESS.2021.3083087]

33. [33] FRÄMLING, K., Light-weight reinforcement learning with function approximation for real-life control tasks, In: Proceedings of 5th International Conference on Informatics in Control, Automation and Robotics, Funchal, Madeira, Portugal, pp. 127-134, 2008.

Send email to the article author

Rights and permissions
	This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

Designed & Developed by : Yektaweb

Related Websites

Site Keywords