Research Article

The Journal of the Acoustical Society of Korea. 31 March 2025. 153-159
https://doi.org/10.7776/ASK.2025.44.2.153

ABSTRACT


MAIN

  • I. Introduction

  • II. Method

  •   2.1 Experimental environment

  •   2.2 Subject and experiment procedure

  • III. Results

  •   3.1 TGS and its relationship with the HRTF conditions

  • IV. Discussion

  • V. Conclusions

I. Introduction

First-Person Shooter (FPS) games are action games controlled from the player’s perspective. These games require players to anticipate the opponents’ behavior as quickly as possible and shoot them before being attacked. This process needs quick reactions and decisions for success. A player’s Reaction Time (RT) is particularly crucial, as faster responses increase their chances of defeating opponents and achieving victory.

One major factor influencing RT is the level of gaming proficiency. The more experienced players become, the better they can strategize and control in-game actions to efficiently eliminate opponents. For example, Toth et al.[1] showed that among three groups of participants classified by gaming skill level, the high-skilled gamer group tended to have a lower ‘Time To Destroy’ (TTD) than the other groups. TTD refers to the time from when a target appears until it is shot and destroyed. The authors mentioned that the difference in TTD between the high-skilled gamer group and other groups can be attributed to advanced sensory-motor integration and optimized decision-making skills. Their extensive gaming experience likely improved sensory efficiency, allowing them to process audio-visual cues, predict target movements, and execute precise actions more quickly. Furthermore, deliberate practice and familiarity with complex in-game scenarios likely enabled them to minimize reaction delays and improve overall task execution.

However, improving gaming proficiency requires a considerable investment of time and effort. As an alternative, methods that optimize information processing by accounting for human perception have been proposed to reduce RT, with auditory optimization standing out as a particularly promising approach. In FPS games, players need to determine the location of opponents quickly using visual and audio cues. Among the various auditory optimization approaches, methods based on Head-Related Transfer Functions (HRTFs) are particularly prominent. HRTFs replicate spectral characteristics received at a listener’s eardrums, providing more plausible sound localization than conventional audio panning.[2] When tailored to match the ear and head shape of an individual, these HRTFs are called personalized HRTFs (pHRTFs).[3] Compared to general HRTFs generated from dummy-head microphones, pHRTFs can achieve superior localization performance. However, the practical use of pHRTFs in gaming is limited by the extensive time and resources required for customization.[4] To overcome these challenges, researchers are exploring simpler and more efficient techniques.

For example, Poirier-Quinot and Katz[4] employed a personalized auditory profile-matching approach to evaluate its impact on player performance in a Virtual Reality (VR) based FPS game. In this experiment, participants evaluated the localization quality of a three-dimensional (3D) stimulus using various auditory profiles and selected the best-and worst-matching profiles for themselves. This study found a significant improvement on RTs in the participant group when they played with the best-matching profile. This suggests that selecting the closest auditory profile can influence improvements in RT and has the potential to enhance gaming performance.

Kim et al.[5] proposed an Interaural Time Difference (ITD) optimization method based on 3D scans of individual users’ head shapes. Their approach involves capturing head-related anthropometric data such as head width and depth using a mobile sensor and then identifying the HRTF profile most similar to the participant’s characteristics from existing HRTF databases. To evaluate the effect of the proposed method, the participants performed a sound position detection task in an FPS-like environment. The evaluation results showed that the ITD optimization method reduced RTs more than dummy-head-based HRTFs. These findings imply that the ITD optimization method could enhance gaming performance by shortening RT.

However, the interplay between auditory optimization and game proficiency in reducing RTs remains insufficiently understood. In particular, it is unclear how different auditory optimization methods influence the magnitude of RT improvements, and whether the effectiveness of these methods varies across different game skills. These gaps may influence the design of advanced and more personalized in-game audio systems. To address these issues, the present study investigates the interaction between multiple auditory optimization techniques and gaming proficiency in relation to players’ RTs.

II. Method

In this study, we investigated the interaction between gaming proficiency and auditory optimization in reduced RTs. We conducted a sound detection task under multiple auditory optimization conditions in a simulated FPS gaming environment.[5] After completing the task, we assessed participants’ gaming skills. This paper presents an overview of the experimental paradigm and the questionnaire employed.

2.1 Experimental environment

In this experiment, participants performed a sound position detection task in a virtual environment designed to resemble an FPS game. We developed the FPS-style environment in Unity (version 2022.3.20f) and integrated it with Steam Audio (SA) (version 4.5.3), which is a spatial audio plugin customized for game productions. As shown in Fig. 1, participants controlled a character in a custom-designed virtual room and indicated the direction of the sound stimuli by moving within this space.

https://cdn.apub.kr/journalsite/sites/ask/2025-044-02/N0660440208/images/ASK_44_02_08_F1.jpg
Fig. 1.

(Color available online) Side view of the experimental environment. The green circle represents the horizontal layer, and the purple circle represents the upper layer (+30° layer above the horizontal layer). The red circles serve as visual references, enabling participants to distinguish between the horizontal (orange arrow) and elevated directions (cyan arrow). Sound stimuli were placed at 30° intervals on each layer. The participants controlled a centrally positioned character to indicate the perceived locations of these sound stimuli.

The stimuli were placed around the player and designed to arrive from both horizontal and elevated directions. All sound directions were processed using three HRTFs, with time delays among them remaining under 1 ms: MIT-KEMAR,[6] the default HRTF in SA (hereinafter SA HRTF),[7] and the ITD optimization method.[5] The SA HRTF and MIT-KEMAR were selected for their universal use in binaural rendering for spatial audio production. The detailed algorithms of the ITD optimization method are described in Reference [5].

This experiment used two stimuli: a footstep sound and a pink burst noise. The footstep sound was a male walking on leaves with boots, with fast and slow steps alternating. The pink burst noise had a duration of 250 ms and was repeated at 250 ms intervals. All sounds were continuously played and were normalized to –12 dB FS. The experiment used a MacBook Pro M3, an RME Babyface Pro FS audio interface, and Beyerdynamic DT 770 Pro 80 Ω headphones. The headphones were used without frequency equalization to match real-world gaming scenarios. The sound stimuli were played through headphones at an average level of 68 dB SPL. The participants were not allowed to adjust the volume during the experiment.

2.2 Subject and experiment procedure

The participants consisted of 16 individuals, aged 21 to 32 years (average 24.8 years old), with 5 females. All the participants reported no auditory impairments.

In the experiment, they were required to identify the direction of the target sound. This experiment incorporates a ‘sound position detection task in an FPS environment,’ which examines the important role of audio cues in FPS games. They were asked to identify where a target sound was played and then shoot in that direction based on auditory information. This experiment setup was designed to replicate a real game environment as closely as possible to an FPS scenario. To minimize potential biases from environmental cues, visual cues were limited to solely indicate elevation differences, thereby ensuring that the participants could not rely on visual information to guide their aiming direction.

Each target under the same conditions was presented four times, resulting in a total of 336 trials per participant (7 horizontal sound positions × 2 position layers × 2 sound stimuli × 3 HRTF conditions × 4 repetitions). Fig. 1 illustrates 12 auditory stimulus positions per layer. In the experiment, stimuli at five lateral positions (excluding the front and back) were alternated between the left and right sides and presented in a randomized order.

Participants received no auditory or visual feedback, and the three HRTFs were randomly switched throughout the experiment to prevent learning or familiarization effects from repeated tasks. The experimental environment except HRTF was the same regardless of whether participants used our ITD optimization or not.

After the experiment, we collected the Gaming Skills Questionnaire (GSQ)[8] from each participant. The GSQ is a self-assessment tool designed to measure participants’ perceived gaming skills and experiences across various video game genres. This questionnaire asks two aspects for each game genre: frequency of play and self-assessed skill level. The participants rated how often they engage with a specific genre (e.g., daily, weekly) and their perceived skill level (e.g., novice to expert) using a 6-point scale for each dimension. Each genre-specific score is calculated as the sum of these two ratings, and the Total Gaming Skill (TGS) score is derived by summing all the genre-specific scores.

III. Results

In this experiment, we measured two key indices: error distance and RT. The error distance was defined as the difference between the actual position of the sound stimulus and the participant’s answered direction. The RT was calculated by the duration from when the trial starts to when the participants answered their responses. In the experiment, the participants were required to search the target, changing their in-game direction, and shoot in that direction.

A two-way ANOVA was conducted on the error distance to examine whether the participants’ responses varied as a function of the given HRTF condition and stimulus position. The result found no significant main effect of the HRTF condition [F(2, 5264) = 0.9947, p=0.3699], while the stimuli position was significant [F(13, 5264) = 65.3353, p < 0.001]. In contrast, a three-way repeated measures ANOVA was performed on the RT to investigate the effects of the HRTF condition, type of sound stimulus, and stimulus position. Significant main effects were observed for position [F(13, 195) =3.1741, p < 0.001], HRTF condition [F(2, 30) =10.9397, p < 0.001], and sound stimuli [F(1, 15) = 4.6309, p < 0.050]. The mean RTs were 7.51 s for MIT-KEMAR, 7.37 s for SA HRTF, and 6.91 s for the ITD optimization method. Post-hoc multiple comparisons revealed significant differences between the ITD optimization method and both MIT-KEMAR and SA HRTF (adjusted p < 0.05). The RTs were longer than the typical RT time, which means when the participants react to a stimulus and answer their response.

Based on the finding that RTs differed significantly between HRTF conditions, the following will discuss how individual gaming proficiency influences the observed discrepancy.

3.1 TGS and its relationship with the HRTF conditions

To further investigate how individual gaming proficiencies influence RT under different HRTF conditions, a Linear Mixed-effects Model (LMM)[9] was employed. This model analyzed fixed effects, such as the TGS scores and the HRTF conditions, while accounting for inter-subject variability as random effects.

The model is specified as follows:

(1)
RTij=β0+β1TGSi+β2SAHRTFij+β3ITDOpt.ij+β4TGSi× SAHRTF ij+β5TGSi× ITD Opt. ij+ui+εij,

where RTij is the RT for subject i in trial j. β0 is the intercept, representing the mean RT for the reference HRTF condition (MIT-KEMAR). β1 reflects the fixed effect of TGS, while β2 and β3 represent the effects of SA HRTF and the ITD optimization method (ITD Opt.) compared to MIT-KEMAR. β4 and β5 capture interaction effects between TGS and either SA HRTF or the ITD optimization method. ui is the random intercept for each subject, and εij is the residual error.

The LMM was fitted using Restricted Maximum Likelihood (REML),[9,10] and the results are summarized in Table 1.

Table 1.

Fixed effects of the linear mixed-effects model.

Effect Estimate Std.error t-value p-value
(Intercept) 9.713 6.269 1.549 0.121
TGS (β1) ‑0.084 0.235 ‑0.357 0.721
HRTF: SA HRTF (β2) ‑0.060 0.568 ‑0.106 0.916
HRTF: ITD Opt. (β3) ‑2.694 0.571 ‑4.715 < 0.001
TGS-HRTF: SA HRTF (β4) ‑0.002 0.021 ‑0.087 0.931
TGS-HRTF: ITD Opt. (β5) 0.082 0.021 3.804 < 0.001

The results indicate that, for a user with a TGS (gaming proficiency) score of zero, the ITD optimization method has an estimated ‑2.694-second effect (β3), (p < 0.001) on RT compared to MIT-KEMAR. Furthermore, the interaction between ITD optimization and TGS (β5) suggests that as TGS increases by one, the magnitude of this ‑2.694-second effect is reduced by 0.082 s.

Other HRTF conditions, such as SA HRTF (β2), along with their interactions with TGS (β4), did not significantly affect RT. These findings highlight the effectiveness of the ITD optimization method, particularly for the user with lower TGS.

IV. Discussion

This study performed the sound position detection task in the FPS environment to observe the effect of auditory optimizations. Our results did not indicate significant differences in the positional accuracy of the participants’ response. However, it demonstrated a significant reduction in RT with the proposed ITD optimization method, without depending on visual cues.

Specifically, we found that higher TGS values, which indicate greater gaming proficiency, were associated with smaller reductions in RT. In other words, the benefit of the ITD optimization method on RT improvement was more pronounced for users with lower TGS values, or those less experienced at gaming. Given that novice players typically exhibit slower RTs in gaming environments,[1] our ITD optimization method could be especially beneficial for this group by improving their sound localization speed on the horizontal plane.

In FPS games, players need to make instantaneous decisions to eliminate opponents as quickly as possible. Novice gamers often lag behind experienced players in their ability to infer the opponent’s positions from sensory efficiency and prior experience.[1] These novice players face particular challenges in high-pressure situations where they must process multiple information streams simultaneously, including both auditory and visual cues for opponent locations.

Under these conditions, they need to quickly determine which sensory input to prioritize. In this case, they may rely more on auditory cues because they have less experience and fewer strategies for interpreting complex visual scenes. Shelton and Kumar[11] showed that the human sensory systems process auditory information more rapidly than visual information. This helps explain why the novices can focus on auditory cues more, which are simpler to interpret and can influence decision-making more quickly. Consequently, in the high-pressure gaming situations, the novices may depend on auditory cues for initial orientation and immediate responses.

The ITD optimization method can provide more individualized cues for horizontal sound localization. This enhancement particularly benefits the novices, who may rely more on auditory information than others. As a result, they may reduce their RTs more effectively when using our approach compared to the other HRTF methods.

Several limitations should be noted in this discussion. The extent to which the ITD optimization method benefits the novices may depend on factors we did not examine, such as how quickly different individuals learn to navigate and adapt to the simulated acoustic environments. Moreover, these findings may not fully generalize to other conditions where visual cues carry greater weight. While we examined novices’ initial auditory processing, further research is needed to understand how players transition from auditory to visual cue reliance as their skills develop. Such studies would provide a more comprehensive understanding of how auditory optimization strategies interact with different stages of player experience.

V. Conclusions

This study investigated how different auditory optimization methods interact with different game proficiency to reduce players’ RTs in FPS games. We compared three HRTF conditions: MIT-KEMAR, SA HRTF, and the ITD optimization method. The ITD optimization method achieved greater RT reductions compared to other methods, with the participants those who less experienced at gaming showing the greatest improvement. These findings underscore the value of personalized horizontal sound cues in helping less-experienced players quickly locate for their successful game play. While our results point to a promising avenue for more inclusive and adaptive game audio, further research is needed to validate these benefits across different player populations, to inform the design of more inclusive in-game audio systems.

Acknowledgements

This research was supported by the National Research Foundation of Korea (Project No. RS-2024-01240813).

References

1

A. J. Toth, N. Ramsbottom, C. Constantin, A. Milliet, and M. J. Campbell, "The effect of expertise, training and neurostimulation on sensory-motor skill in e-sports," Comput. Human Behav. 121, 106782 (2021).

10.1016/j.chb.2021.106782
2

C. I. Cheng and G. H. Wakefield, "Introduction to head-related transfer functions (HRTFs): Representations of HRTFs in time, frequency, and space," JAES. 49, 231-249 (2001).

3

K. Sunder, "Binaural audio engineering" in 3D Audio, edited by J. Paterson and H. Lee (Routledge, New York, 2021).

10.4324/9780429491214-7
4

D. Poirier-Quinot and B. F. Katz, "Assessing the impact of head-related transfer function individualization on task performance: Case of a virtual reality shooter game," J. Audio. Eng. Soc. 68, 248-260 (2020).

10.17743/jaes.2020.0004
5

S. Kim, R. Sato, P. Koh, K. Lee, and S. Kim, "Investigating the role of customized interaural time differences on first-person shooter gaming performance," AES 157th Convention, paper no. 157 (2024).

6

Valve Corporation, Steam Audio Settings - Steam Audio Unity Integration, https://valvesoftware.github.io/steam-audio/doc/unity/settings, (Last viewed January 12, 2025).

7

W. G. Gardner and K. D. Martin, "HRTF measurements of a KEMAR," J. Acoust. Soc. Am. 97, 3907-3908 (1995).

10.1121/1.412407
8

T. Zioga, C. Nega, P. Roussos, and P. Kourtesis, "Validation of the gaming skills questionnaire in adolescence: effects of gaming skills on cognitive and affective functioning," Eur. J. Investig. Health. Psychol. Educ. 14, 722-752 (2024).

10.3390/ejihpe1403004838534909PMC10969436
9

J. C. Pinheiro and D. M. Bates, Mixed-Effects Models in S and S-PLUS (Springer, Berlin, 2000), pp. 3-56.

10.1007/978-1-4419-0318-1
10

H. D. Patterson and R. Thompson, "Recovery of interblock information when block sizes are unequal," Biometrika, 58, 545-554 (1971).

10.1093/biomet/58.3.545
11

J. Shelton and G. P. Kumar, "Comparison between auditory and visual simple reaction times," Neurosci. Med. 1, 30-32 (2010).

10.4236/nm.2010.11004
페이지 상단으로 이동하기