Deep Q-learning policy optimization method for enhancing generalization in autonomous vehicle control

Вантажиться...
Ескіз

Дата

2025

Науковий керівник

Назва журналу

Номер ISSN

Назва тому

Видавець

National Technical University of Ukraine "Igor Sikorsky Kyiv Polytechnic Institute"

Анотація

The development of autonomous vehicle control policies based on deep reinforcement learning is a principal technical problem for cyber-physical systems, fundamentally constrained by the high dimensionality of state spaces, inherent algorithmic instability, and a pervasive risk of policy over-specialization that severely limits generalization to real-world scenarios. The object of this investigation is the iterative process of forming a robust control policy within a simulated environment, while the subject focuses on the influence of specialized reward structures and initial training conditions on policy convergence and generalization capability. The study's aim is to develop and empirically evaluate a deep Q-learning policy optimization method that utilizes dynamic initial conditions to mitigate over-specialization and achieve stable, globally optimal adaptive control. The developed method formalizes two optimization criteria. First, the adaptive reward function serves as the safety and convergence criterion, defined hierarchically with major penalties for collision, intermediate incentives for passing checkpoints and a continuous minor penalty for elapsed time to drive efficiency. Second, the mechanism of dynamic initial conditions acts as the policy generalization criterion, designed to inject necessary stochasticity into the state distribution. The agent is modeled as a vehicle equipped with an eight-sensor system providing 360 degrees coverage, making decisions from a discrete action space of seven options. Its ten-dimensional state vector integrates normalized sensor distance readings with normalized dynamic characteristics, including speed and angular error. Empirical testing confirmed the policy's vulnerability under baseline fixed-start conditions, where the agent demonstrated over-specialization and stagnated at a traveled distance of approximately 960 conventional units after 40,000 episodes. The subsequent application of the dynamic initial conditions criterion successfully addressed this failure. By forcing the agent to rely on its generalized state mapping instead of trajectory memory, this approach successfully overcame the learning plateau, enabling the agent to achieve full, collision-free track traversal between 53,000 and 54,000 episodes. Final optimization, driven by penalty, reduced the total track completion time by nearly half. This verification confirms the method's value in producing robust, stable, and efficient control policies suitable for integration into autonomous transport cyber-physical systems.

Опис

Ключові слова

deep Q-learning, autonomous vehicle, policy generalization, reward function, dynamic initial conditions, cyber-physical systems, глибоке Q-навчання, автономний транспортний засіб, політика узагальнення, функція винагороди, динамічні початкові умови, кіберфізичні системи

Бібліографічний опис

Drahan, M. Deep Q-learning policy optimization method for enhancing generalization in autonomous vehicle control / Andrii Pysarenko, Mykhailo Drahan // Information, Computing and Intelligent systems. – 2025. – No. 7. – P. 96-109. – Bibliogr.: 15 ref.

ORCID