In developing and testing autonomous vehicles, simulators play an essential role in providing controlled environments for experimentation. To address specific research requirements, a custom driving simulator was developed using Pygame, a cross-platform set of Python modules designed for writing video games. Leveraging the capabilities of the Pygame library was instrumental in defining the basic entities within our simulation environment, which operates on Pygames grid system at its core. This simulated driving environment was structured to conform to OpenAI Gym's specifications to enhance the simulator's compatibility with a range of reinforcement-learning algorithms. The custom simulator1 offers a blend of simplicity and functionality meeting the specific needs of our research.

The simulator was designed with two primary modes of operation: a render mode and a headless mode. Figure 1 depicts the rendered display as shown to participants in behavioural experiments. The render mode was developed for visualising the interactions between the autonomous vehicle and the pedestrian. This visual representation provides a user-friendly interface where participants can visually discern the road, the car, and their avatar as a pedestrian. Open-source assets1 were employed in this mode to ensure an aesthetically pleasing environment, facilitating participant immersion in the simulation. Conversely, the headless mode operates without any graphical display, offering a significant advantage during the training phase of the autonomous vehicle. In this mode, the environment was essentially reduced to four primary rectangles representing the pedestrian, the car, and the vertical and horizontal stretches of the road.

The vehicle's observations within this environment were straightforward yet effective. The vehicle perceived bounding rectangles, or 'Rects', for both itself and the pedestrian. These Rects were defined within Pygame's internal grid system, ensuring precision and consistency in the vehicle's perception. Additionally, the vehicle was provided with a clear goal, indicating the coordinates it should aim for from its starting position. The simulated actions were defined as one of the four cardinal and four intercardinal directions at a speed of 1,2 or 3 grid-squares/pixels per computational time step or a no-move action resulting in 25 total actions.

The other important modelling features, in the context of the simulator, were if a vehicle collides with a pedestrian, the episode concludes immediately, and a large negative reward is allocated to that state. In addition, agents were granted an augmented positive reward when they intersected with their designated goal coordinates. The DQN loss function was computed over actions with a lower cost for actions considered best. As the action space is discrete with negative rewards, there was the possibility of an infinite valued loss-term arising, in the event of a pedestrian collision. The model avoided such infinite-valued loss-term issues because there was always an action which avoids a collision by having pedestrians limited to a move speed of 2 while the vehicle can move at a speed of 3. Combined with either the shaped or constant reward, there was always an action better than colliding with a pedestrian to which the policy could converge. When the task was successfully accomplished, the reward given was positive and matched the absolute value of the smallest possible reward the agent could obtain while still reaching the goal, meaning that a spatially optimal trajectory received a reward of 0.

Additionally, the simulator considered only two sets of starting points and goals, representing driving directions from right to left and vice versa. The maximum episode duration was set to the distance to the goal plus one time step. This constraint ensured that the agent-controlled vehicle can reach its destination by moving at the slowest pace in a direct path. By implementing a negative reward system and concluding the task upon successful completion, this approach ensured that agents were motivated to identify the most efficient routes. This approach aims to terminate episodes swiftly, thereby reducing the accrual of negative rewards.

During the preliminary stages of model training and testing, the DQN models struggled to develop an effective driving policy. The models exhibited a tendency to achieve only slight improvement in rewards without demonstrating any genuine driving behaviour. A common tactic adopted by the models was to drive off-screen, thereby avoiding the accumulation of penalties immediately. Rather than further refining the reward structure to address this challenge, which could potentially obfuscate the desired behavioural outcomes, a technique known as action-masking was implemented [15]. Action-masking restricts an agent's choices by allowing it to execute only valid actions in a given state. To facilitate this constraint, the simulator was enhanced with built-in action masking functionality. This addition ensured that for every state presented to the agent, a corresponding action mask was provided. Actions that would result in the vehicle exiting the screen were deemed invalid and were consequently masked out, preventing the agent from selecting them. This approach aimed to guide the DQN models towards more realistic and effective driving behaviours without overly complicating the reward structure.

## Reward Design and RL Training

The agent’s goal is to find the best set of actions \(A\) given the states \(S\) visited, referred to as the optimal policy \({\pi }^{\text{*}}\left(s\right),\) where policy function \(\pi\) returns an action \(a\in A\) to be taken in the state \(s\in S\) of the environment of interest. A trajectory\({\tau }^{\pi }=\left\{{\left(s,a=\pi (s\right))}_{t}\right\} \forall t\in \left\{\text{1,2},\cdots ,T\right\}\)is the set of state-action pairs returned by policy \(\pi\) for a given terminated instance of environment interaction over time \(T\). The total expected return from a trajectory is:

$${R}_{{\tau }^{\pi }}= {\sum }_{t=0}^{T}r\left({s}_{t}\right) {s}_{t}\in {\tau }^{\pi }$$

,

( 1 )

which is general to the theory of RL. For this work, however, we also consider a shaped reward, based on the inter-action vector, decomposed into reward for relevant behaviours. The reward function is determined by three terms, providing rewards for speed \({r}_{\nu }\), proximity to destination \({r}_{\delta }\), and rate of heading change \({r}_{{\Delta }}\). These are summed and scaled to give us the full reward for a given state \(r\left(s\right)\):

$$r\left({s}_{t}\right)={r}_{\nu }+{r}_{\delta }+{r}_{{\Delta }}-3=\frac{{\nu }_{t}}{{\nu }_{max}}+\left(1-\frac{{\delta }_{t}}{{\delta }_{max}}\right)+\left(1-\frac{{{\Delta }}_{t}}{2\pi }\right)-3,$$

( 2 )

where \({\nu }_{t}\) is the speed of the vehicle at time \(t\), and \({{\Delta }}_{t}\) is the amount of heading change or the amount of steering at time \(t\); both are determined by the previous action \({a}_{t-1}\) which is a dependence on the previous state-action pair as Markovian dynamics allows. The term \({\delta }_{t}\) is the distance from the state that the vehicle agent is currently in \({s}_{t}\), to the agents given destination. Each of the three terms are in the interval \(\left(\text{0,1}\right]\) with \(1\) being the optimal value, and we apply a negative \(3\) to our reward to ensure that the reward is always negative, making an expected reward of \(0\) the theoretically optimal solution.

All three of these quantities are required to be within the range \(\left[\text{0,1}\right]\), to avoid violating constraints for optimisation that we discuss further below and we restrict this further to \(\left(\text{0,1}\right]\) to avoid numerical errors in computation with variables that are exactly \(0\). As such, they are normalised by their maximums, \({\nu }_{max},{\delta }_{max}\) and \(2\pi\) in order of appearance. The term \({{\Delta }}_{t}\) is a radian angle so \({{\Delta }}_{t}\in \left[\text{0,2}\pi \right)\). The other 2 terms are set by environment variables which are effectively hyperparameters that could be explored, but for the purposes of this study, they are set to fixed values as the best environment hyperparameters are not of particular interest. Also, to accurately reflect the task, close distances from the goal and smooth/slight steering should be rewarded more so both \({\delta }_{t}\) and \({{\Delta }}_{t}\) are negated, as the task requires minimising these quantities whilst maximising reward.

The reward structure, as described in Eq. 2, was designed to capture three critical aspects of the vehicle's behaviour: its speed, its proximity to the goal, and the smoothness of its steering. Each of these aspects plays a distinct role in determining the overall performance and safety of the vehicle. In essence, while distance serves as a foundational metric, speed and steering encapsulate both basic and advanced driving behaviours. The dual nature of the speed term, where it can belong to either category, highlights its versatility in capturing different facets of driving. The reward structure, therefore, provides a comprehensive evaluation of the vehicle's performance, balancing both essential and subtle driving behaviours.

In our approach, utility functions are transforms of the reward function, which directly influence the system, serving as an alternative to traditional reward models. Instead of altering the inherent task reward, we adjust the reward to align with individual preferences. Utility takes up a role of direct influence in the place of a reward model, but is not a reward model; we are shifting the reward function to match the individual, not changing what the task reward is. By leveraging utility, we aim to discern implicit preferences derived from judgments, thereby facilitating both implicit preference selection and explicit task discovery in tandem.

We apply a utility transform \({U}_{\theta }\) to each term, which takes the form of a parameterised exponent:

$${U}_{\theta }\left(r\left({s}_{t}\right)\right)={r}_{\nu }^{{\theta }_{\nu }}+{r}_{\delta }^{{\theta }_{\delta }}+{r}_{{\Delta }}^{{\theta }_{{\Delta }}}-3={\left(\frac{{\nu }_{t}}{{\nu }_{max}}\right)}^{{\theta }_{\nu }}-1+{\left(1-\frac{{\delta }_{t}}{{\delta }_{max}}\right)}^{{\theta }_{\delta }}-1+{\left(1-\frac{{{\Delta }}_{t}}{2\pi }\right)}^{{\theta }_{{\Delta }}}-1.$$

( 3 )

The advantage of parameterising each term is that it enables feedback to be elicited for each term individually, allowing each term to be non-linear over a trajectory, but decompose into a linear sum at a policy execution level. As we do not want to yield a complex valued reward, the negative domain shifting is also decomposed and applied to each term individually, importantly after applying the parametrised exponent. Figure 2 shows the variation the utility parameters provide and informs as to why the construction of the reward requires such care to remain within the range \(\left[\text{0,1}\right]\), ensuring a symmetric and bounded variation.

At policy execution, we can still return a sum of 3 scalar terms only depending on the previous state, therefore not violating either of the assumptions of Markovian and sub-game optimal dynamics for learning. Hence the nonlinearity is applied to expressions in the bounded positive interval \(\left(\text{0,1}\right]\) (Fig. 2), before other scaling is applied. As a result, the utility function mathematically cannot violate this assumption as it remains within this bound, respecting the linear optimality by preserving monotonicity and only altering the asymptotic nature of the function. This utility function therefore acts as a filter over the reward and can be optimised directly but statically, only learning a policy for one fixed set of parameters. When the utility is higher than the base linear reward, rewards increase across all states with the effect compounding as the linear reward increases, effectively reinforcing the current policy’s behaviours. The converse is true for low ratings, which encourage exploration of different behaviours as the policy reconverges. The linear form of the utility, equivalent to the standard shaped reward, is used as a pre-trained model from which to train each utility variant model for a further fixed number of episodes.

For the training of agents, the framework of choice was Stable Baselines3 (SB3) [16], a popular library that offers a variety of reinforcement-learning algorithms. A notable limitation of SB3, however, is its lack of support for action-masked DQN, a crucial feature for our simulator. Thus, adaptations to the standard SB3 DQN implementation were needed1. We used a simulation grid with dimensions of 640x480 pixels. The training process was initiated with two distinct DQN models. The primary model was trained using our specially designed reward-shaping mechanism, while another model was trained with a consistent reward for each time step. When the vehicle deviated from the road in the simulation, the fixed reward was marginally decreased. The magnitude of these fixed rewards was deliberately chosen to align with the rewards from our shaped function, facilitating a more straightforward comparison between the two models. The primary purpose of the baseline shaped model, where all parameters were set to zero, was to serve as a foundational model. From this baseline, variant models were trained, each for a predetermined number of episodes. These variant models were derived by multiplying the utility-parameterised three-term reward function with a five-point Likert scale, resulting in a range of \(125\) parameter sets for the experiments.

The first objective an agent must learn is the path to the goal, and Fig. 2 shows how this takes a much longer time when the reward is not shaped to guide the agent's progress in that sparse reward regime. To learn that heading to the goal improves the return, the agent must avoid a pedestrian with noisy movement heading perpendicular to them to get there. Again, the shaped reward helps with this learning; however, a straight-line path has a high probability to intersect with pedestrian in both cases. There is then a trade-off between an efficient route with high reward and the possibility of hitting the pedestrian. Honing in on early successes too strongly can lead to failures and can cause the agent to change its policy to find safer behaviours.

## Behavioural Experimental Design

The feedback collection experiment was conducted in a controlled environment, within the behavioural science lab in the Psychology Department at the University of Warwick. There was a total of 124 participants in the experiment. Experiments took up to 30 minutes, and participants received course credit for participating. All participants provided informed consent, and the research was approved by the Department of Psychology Research Ethics committee.

Participants were invited to the lab where they were seated in front of a PC and presented with an instruction sheet. Each session accommodated approximately 12 participants. Upon arrival, participants were given a brief overview of the study's objectives and what was expected of them during the session; they also were asked to sign for their consent for their data to be collected and anonymised before commencing. Following this introduction, the trained models were presented. Participants provided responses judging different aspects of the AV behaviour. These responses were obtained with three different initialization seeds, ensuring variability in the scenarios and interactions presented to the participants. Aggregating the feedback, each participant contributed a total of 200 feedback data points. Of these, 150 were direct evaluations related to the utility parameters, while the remaining 50 were judgments about the likelihood of the agent being an AV.

Each participant went through a structured process comprising five blocks of trials, with each block containing 10 individual trials before a reset to the linear model configuration for the next block. During each trial, participants were prompted to provide feedback through the three safety ratings. Additionally, after every trial, participants were asked to make a judgment about whether they believed the agent's behaviour was more characteristic of an AV or a human driver. To elicit feedback from human participants, an interactive experiment2 was set up using PsychoPy [17]. In this experiment, participants took on the role of a pedestrian within a simulated crossing environment, as depicted in Fig. 1. Their task was to navigate this environment while interacting with the various models. Integrating the Pygame environment into a PsychoPy experiment enabled participants to experience the agent trained with a given reward function. Participants then provided safety ratings on each of the 3 aspects considered in the reward design in a seamless loop. This setup allowed the parameters for subsequent models to be adjusted based on feedback from the previous trial.

To ensure consistency and estimate inherent uncertainty, participants were reintroduced to the linear model after every 10 trials, restarting the process of selecting subsequent parameters through safety ratings. This approach created a consistent baseline for comparison, providing a consistent reference point for participants throughout the experiment and aided in pinpointing any potentially insincere feedback. The 7th trial in every set of 10 featured a random non-shaped reward trajectory. These trajectories were either from researcher-controlled vehicles or derived from the fixed-reward model. Additionally, after each trial, participants were asked to judge whether the vehicle was human-controlled or operated by an AV. By collecting judgments on human-like behaviour, we aimed to discern if trajectories deemed "expert" were indeed superior based on participant evaluations. To further refine our study and account for another potential bias, we introduced variability in vehicle colours. With a palette of eight colours, the vehicle's hue was randomly selected each time the simulator was run, and this colour data was stored alongside other trial information. This measure was implemented to identify and account for any colour-based biases in judgments, such as the perception that red cars, often associated with sportiness, might be inherently riskier.

Feedback was gathered from the human participants by asking them to evaluate specific interactive trajectories. Their judgments were recorded on a 7-point Likert scale ranging from − 3 (very unsafe) to + 3 (very safe). A 5-point scale would not have had enough intervals to include an interval for 0 as the linear utility parameter, so to ensure the scale remained symmetric around 0, a 7-point scale was used. After gathering this feedback, participants were then presented with a new interactive trajectory. This trajectory was produced by a utility-adjusted reward function that matched their initial judgment, as described in Eq. 3. This relationship allowed a mapping from the integer Likert-scale points (LP) to the parameterised exponents in Eq. 3 directly, according to:

$${\theta }_{i}=\left\{\begin{array}{c}sgn\left(L{P}_{i}\right)\cdot 2 \left|L{P}_{i}\right|\ge 1.8\\ sgn\left(L{P}_{i}\right)\cdot 1 \left|L{P}_{i}\right|\ge 0.6\\ 0 Otherwise\end{array}\right.$$

( 4 )

At \({\theta }_{i}=0\), this equation reverts to the non-parameterized reward function, as shown in Eq. 2. The structure of Eq. 4 ensures that higher ratings correspond to increased utility, especially when the utility mirrors the form of \({U}_{\theta }\). When we consider a three-term reward/utility function combined with the reduced seven-point Likert scale (see below), we obtain \(125\) (or \({5}^{3}\)) distinct parameter sets. These five sets are derived from the Likert scale's range of \(-3\) to \(3\), which was divided into five distinct intervals: \([-3,-1.8]\), \((-1.8,-0.6]\), \((-\text{0.6,0.6})\), \(\left[\text{0.6,1.8}\right)\), and \(\left[\text{1.8,3}\right].\) Each of these intervals corresponds to an integer parameter ranging from \(-2\) to \(2\) in consecutive order. The resulting function retains its symmetry and exhibits convexity for positive LP values and concavity for negative LP values, as depicted in Fig. 2. This approach ensures a systematic and mathematically sound method of correlating human judgments with specific utility parameters.

The participant thus effectively picked the subsequent model from the pre-trained set with which they then interacted. By analysing these selection data, we aim to identify the utility that aligns most closely with their perception of a superior-performing agent. While individual preferences will naturally vary, often significantly, any consistent trends observed in aggregate data can validate our parameter choices. We have pre-emptively trained models for all potential parameter sets. This enables us to present participants with agents that appear to adjust in real-time based on their feedback, without the need for extensive re-training in real time. During the behavioural experiments, participants were exposed to policies that reflect adjustments in parameters, guided by their LP judgments.

Participants' judgments were not based on passive observation of a scenario. Instead, the participant actively controlled the pedestrian in the environment, and so they based their evaluations on direct interactions with the agent rather than just observing a predetermined trajectory. This active engagement can lead to unique instances, revealing a broader spectrum of the policies' learned behaviours, even when utility parameters are similar or identical.

[2] www.github.com/Rik-Fox/paavi