Detecting driver fatigue using heart rate variability: A systematic review

.


Introduction
Driver fatigue is a major concern for road safety, and it accounts for 10-30 % of all fatal crashes (Hallvig et al., 2014;Philip and Åkerstedt, 2006;Zwahlen et al., 2016). Therefore, driver fatigue detection systems could potentially reduce fatigue related road fatalities and severe injuries. In the European Union, driver monitoring systems will become mandatory for new produced vehicles (European Parliament and Council, 2019) and it will become part of Euro NCAP safety assessment (Euro NCAP, 2017).
Fatigue is a complex phenomenon caused by multiple factors, and there is no consensus in the literature on the definition of fatigue and its relationship to sleepiness (Weinbeer et al., 2018). These terms have often been used synonymously in the literature. In this review, we will not develop the definition of fatigue or sleepiness and distinguish between them. Instead, we will break them down into common causation factors for fatigue in road driving (Fig. 1), and both terms will be used with the intention to follow the original literature cited in the review. It has been suggested that driver fatigue has both sleep related and task related causes (May and Baldwin, 2009). As shown in Fig. 1, sleep related fatigue is influenced by the circadian rhythm of sleepiness as well as the sleep homeostat, which depends on sleep duration and time awake since the last sleep episode. Task related fatigue is influenced by the driving itself and depends on time on task and the mental task load. Both underload and overload can contribute to fatigue, and the influence on driver performance and countermeasures may vary accordingly (Williamson et al., 2011). It is worth noting that studies focusing on driver sleepiness can include not only sleep related factors but also task related factors, and sleepiness measures such as subjective sleepiness rating could be influenced by task related factors as well (Åkerstedt et al., 2014).
Current fatigue detection systems are typically based on assessments of either driving performance such as speed and steering, facial features such as head pose, eye closure, and eye gaze, or physiological measurements such as electroencephalography (EEG), electrocardiography (ECG) and electromyography (EMG). Most of the current commercially available systems are based on driving performance and facial features detected by cameras (Chowdhury et al., 2018). However, those methods will be challenged by the increasing application of vehicle automation systems. SAE international defines 6 levels of driving automation from Level 0 (no automation) to Level 5 (full automation) (SAE, 2016). Many currently produced cars are equipped with Level 1 and 2 automation systems. In this case, speed and steering could be controlled by the vehicle through lane keeping and adaptive cruise control functions, and that information can then no longer be used as a measure of the driver's performance (Gonçalves and Bengler, 2015). Prototypes of Level 3 and 4 vehicles are on trial in demonstration sites, in which case the driver will no longer be responsible for monitoring the environment when automation is active. When reaching Level 3 and above in the future, facial features including eye gaze, eyelid closure and head positioning may not be available as indicators of driver fatigue (Wörle et al., 2019). At the same time, fatigue could become more frequent under automated driving if the driver does not have active task engagement (Ahlström et al., 2021;Körber et al., 2015;Schömig et al., 2015;Vogelpohl et al., 2019). Physiological measurements of fatigue could become a potential solution to this challenge.
A recent review investigated the performance of driver sleepiness detection methods using physiological signals (Watling et al., 2021). They concluded that progress is needed to reach sufficient specificity and sensitivity and that using multiple physiological signals resulted in improved assessment. However, many physiological sensors are not favorable for daily usage due to the obtrusive measurement setups that require attachment of gel electrodes and wiring (Lohani et al., 2019). Heart rate variability (HRV) has drawn particular interest due to its relationship with fatigue and ease of measure in real life (Lohani et al., 2019). HRV is the fluctuation of time between adjacent heart beats. The variation of heart rate (HR) is generated by heart-brain interaction through the sympathetic and parasympathetic branches of the autonomic nervous system. HRV reflects the response of cardiac autonomic nerves to inputs from baro-, chemo-, nasopharyngeal and other receptors, as well as central autonomic commands that are associated with stress, physical activity, arousal, sleep, etc. (Silvani et al., 2016). Several sleep laboratory studies show that HRV can be a good indicator for vigilance state measured by reaction speed to visual stimulus under total sleep deprivation (Chua et al., 2012) and partial sleep deprivation (Henelius et al., 2014;Kaida et al., 2007). HRV has also been shown to be associated with cognitive task demand and time on task effects (Hidalgo-Muñoz et al., 2018;Luque-Casado et al., 2016). With the development of unobtrusive sensing techniques, HR and HRV could be measured through wearable sensors (Sikander and Anwar, 2019;Zheng et al., 2014) or vehicle integrated sensors (Leonhardt et al., 2018;Pinto et al., 2017) in daily driving scenarios. Several studies have approached the relationship between driver fatigue and HRV parameters by building fatigue classifiers based on HRV features (Abtahi et al., 2018;Buendia et al., 2019;Fujiwara et al., 2019;Kundinger et al., 2020a;Lenis et al., 2016;Li and Chung, 2013;Mahachandra et al., 2012;Patel et al., 2011;Persson et al., 2021;Zeng et al., 2020).
Although many studies have reported HRV as a driver fatigue indicator, there is not yet a consensus on how HRV changes during the development of driver fatigue. Several reviews on driver monitoring systems have included solutions based on HR (Sahayadhas et al., 2012;Sikander and Anwar, 2019;Watling et al., 2021), but the relation between HRV and fatigue has not been summarized. This review aims to summarize and analyze the literature on 1) how HRV features change under fatigue, 2) Performance of HRV based fatigue detection systems, and 3) the potential for HRV to be used as an indicator of driver fatigue in real life settings. We conducted a systematic review of studies that have explored the relationships between HRV and driver fatigue and that have developed HRV based driver fatigue detection systems.

Search methods
Three databases deemed most relevant for the research topic were searched in this systematic review, i.e., PubMed, Scopus, and Web of Science (Web of Science Core Collection). The search was conducted in July 2021 and there was no limit to the starting date. The terms used in the search were '(heart rate OR hr OR hrv) AND (sleep* OR drows* OR fatigue) AND driver'. The terms were searched for in the fields of title, abstract and keywords. Metadata (title, author list, journal, volume, etc.) of the articles from the search results together with the abstracts were downloaded and imported to Rayyan (Ouzzani et al., 2016) for screening and selection.

Eligibility criteria
We included only original research journal articles written in English. As we were aiming to investigate solely the relationship between HRV and driver fatigue, the included studies should report the relation between driver fatigue and HRV explicitly. Studies that mix HR or HRV together with other measurements were excluded from the review. Studies that were not conducted with car driving task, e.g., airplane, ship, train driving, as well as race car driving were also excluded. Since the focus was on mental fatigue, studies that targeted physical fatigue were also excluded. The selection process is shown in Fig. 2. In total, 977 records were found in the three databases and 633 records remained after duplication removal, in which 384 journal articles in English were kept for screening. After reading through the title and abstract, 348 articles were excluded, and 36 articles were left for full text assessment. Following full text investigation, 18 articles were removed, leaving 18 articles for review. Among the 18 removed articles, six articles were removed due to mixing HRV with other measurements, four studies were not performed with car driving tasks, two did not include HR or HRV, two did not have a reference fatigue measure, two used HR or HRV directly as a valid fatigue indicator, one demonstrated the development of a HR based fatigue detection system without evaluation, and one studied how a HR based fatigue detection system correlated to driver behavior. The eligibility criteria were decided by all authors together. Search, screening, and final article selection were performed by the first author (K.L).

Data extraction
The information extracted from the selected publications included demographics, driving tasks, measurement methods, classification approach, and detection performance and results. For demographics information, number of participants and the age of the participants were extracted. For the driving task, we extracted the type of driving task (simulator or on-road), the duration of each driving session, how many driving sessions each participant performed, and any manipulation method to introduce fatigue. Regarding measurements, we extracted the method for the HR measurement and how the reference level of driver fatigue was measured. For studies that aimed to build classifiers the validation methods and detection performance was extracted. For studies that reported HRV under fatigued conditions compared to alert conditions, we extracted the HRV features that differed between conditions and the direction of the changes. We focused on standard HRV features included in the Task Force guidelines for HRV measurements (Malik et al., 1996), which have been widely used in this field and in the included studies. Included HRV features and their short descriptions can be found in Table 1, and more detailed definition information can be found in (Malik et al., 1996;Shaffer and Ginsberg, 2017).

Results and discussion
Details of the reviewed studies are listed in Table 2. A substantial variation in the study implementation and results can be found across all studies. In total, 11 of the reviewed studies demonstrated differences in HRV features between fatigued states and alert states (Table 2). In studies that examined the HR level (or mean NN interval), reduced HR was found when drivers were fatigued, with only one exception where no significant change was found (Egelund, 1982). However, when it comes to the other time and frequency domain HRV features, the changes are not consistent across all studies. There were 11 reviewed studies that had developed HRV based fatigue detection systems and the reported detection performance ranged from 44 % to 100 % percent in accuracy. The difference in outcomes could be the result of differences in several aspects of the study designs including experiment setups, fatigue definition, and validation methods. The differences in study design also makes it difficult to compare the results quantitatively across all studies. In the following discussion we will highlight several key elements in the study design and their potential influences on the outcomes.

Study population
The sample sizes of reviewed studies ranged from 2 to 86. Half of the studies had relatively small study samples with less than 10 participants. There were six studies that included >30 participants in their experiments (Buendia et al., 2019;Fujiwara et al., 2019;Kundinger et al., 2020aKundinger et al., , 2020bPersson et al., 2021;Vicente et al., 2016).
Age has been known as an influencing factor for the majority of linear and nonlinear HRV indices for both long term (Voss et al., 2009) and short term (Voss et al., 2012) measurements. Among the reviewed articles, five studies included participants with a wide age range (Buendia et al., 2019;Kundinger et al., 2020aKundinger et al., , 2020bPatel et al., 2011;Persson et al., 2021). In addition, one study approached the influence of age by separating participants into two age groups (Kundinger et al., 2020b), the study suggests that a model developed with a specific age group is not well suited for another age group.
Sex is another demographic factor potentially associated with HRV. Sex differences in HRV measures have been reported by many studies (Koenig and Thayer, 2016). Among the reviewed articles, (Zeng et al., 2020) investigated sex differences for 13 HRV measures in both alert and fatigued states. In that study, more measures in the fatigued state showed significant differences between sexes than in the alert state, and male drivers had more measures with significant differences between fatigued and alert states than female drivers.

Driving task
About two thirds of the reviewed studies were performed in a simulator, and six studies used a real road driving task, (Buendia et al., 2019;Egelund, 1982;Jung et al., 2014;Persson et al., 2021;Salvati et al., 2021;Wang et al., 2019). The study by (Vicente et al., 2016) used data from both simulator and on-road driving. Two reviewed studies had an automated driving task with simulated SAE Level 2 driving (Kundinger et al., 2020a(Kundinger et al., , 2020b, whereas remaining articles were performed with manual driving. 3.1.2.1. Fatigue manipulation. The reviewed studies have taken different approaches to introduce fatigue to the subjects. Circadian rhythms and sleep homeostasis are two main contributors to sleep related fatigue (Franken and Dijk, 2009). Six studies manipulated fatigue by letting the participants perform driving tasks at different times of the day (Buendia et al., 2019;Kundinger et al., 2020aKundinger et al., , 2020bLee et al., 2019;Murugan et al., 2020;Persson et al., 2021). (Kundinger et al., 2020a;Vicente et al., 2016) introduced partial or complete sleep deprivation before the driving session. When it comes to task related fatigue, under-stimulated and prolonged driving is known to introduce higher risk of driver fatigue (Williamson et al., 2011). Several studies opted to use monotonous driving tasks to speed up the development of fatigue (Fujiwara et al., 2019;Kundinger et al., 2020aKundinger et al., , 2020bLenis et al., 2016;Murugan et al., 2020). The duration of the driving task varied from 10 min up to several hours. For studies with continuous and prolonged driving tasks, the time-on-task also became an important factor for fatigue development.

Measurements
3.1.3.1. HR measurement method. Several types of HR measurement devices have been used in the studies included in the review. Conventional ECG with gel electrodes was used by the majority (Buendia et al., 2019;Egelund, 1982;Fujiwara et al., 2019;Lenis et al., 2016;Murugan et al., 2020;Patel et al., 2011;Persson et al., 2021;Vicente et al., 2016;Wang et al., 2019;Zeng et al., 2020). Wearable ECG-based HR chest straps is another solution that brings better usability than the ECG with gel electrodes, which was used by (Khamis et al., 2016) and (Lee et al., 2019). (Jung et al., 2014) used integrated ECG electrodes on the steering wheel. Photoplethysmography (PPG) based solutions can be even easier to use since they can be integrated in wrist bands (Kundinger et al., 2020a(Kundinger et al., , 2020bLee et al., 2015) or the steering wheel (Rahim et al., 2015). This advantage may enable pervasive usage in daily driving scenarios for HRV based monitoring. However, PPG based solutions are more sensitive to motion artifacts compared to ECG based devices. Two studies (Kundinger et al., 2020a;Lee et al., 2015) compared the wrist band PPG to ECG for fatigue detection and show that PPG-based HR can be used for this application but with reduced detection performance compared to ECG. Salvati et al. (2021) used a microphone sensor integrated in the seat cover for HR detection.

3.1.3.2.
Reference fatigue measure. Different approaches were taken to measure the fatigue level as the ground truth. Some studies provided insufficient information about the definition of fatigued state used. This makes it difficult to interpret results and compare results across different studies. Observer rating was the most used method for reference fatigue (Kundinger et al., 2020a(Kundinger et al., , 2020bLee et al., 2019;Lenis et al., 2016;Murugan et al., 2020;Rahim et al., 2015;Vicente et al., 2016;Zeng et al., 2020). However, low inter-rater agreement has been found for observer ratings of fatigue (Ahlstrom et al., 2015). Subjective ratings were also used for many studies. The Karolinska Sleepiness Scale (KSS) is a well validated scale for subject rating (Kaida et al., 2006), it was used by several studies (Buendia et al., 2019;Khamis et al., 2016;Kundinger et al., 2020b;Persson et al., 2021;Salvati et al., 2021). Some studies used self-defined scales, e.g., (Wang et al., 2019) used a 4-level scale that  has not been evaluated and the cut off level for classification was not reported. Another approach was to define fatigue based on percentage of eyelid closure over the pupil over time (PERCLOS) (Li and Chung, 2013). PERCLOS has been used extensively as a measure of fatigue but the relationship with subjective sleepiness is not straightforward (Sommer and Golz, 2010). (Fujiwara et al., 2019) used EEG signals to find the N1 sleep stage onset defined by alpha wave attenuation. One study did not use a reference measure for fatigue but used the driving distance or time-on-task as the reference (Egelund, 1982  -Fields in HRV response and fatigue detection system left empty when it was not investigated by the study.
-For descriptions of driving scenarios, original phrasing from the reviewed article is used.
-For HRV response, '+' and '-' stands for higher and lower value under fatigue state comparing to alert state, respectively, 'n.s.' for no significant change.
between 1 min and 5 min. Few studies used only one sample (a short driving session or a short window picked from the driving session) for each participant under each condition (fatigued/alert).

Learning and validation.
Most of the reviewed detection systems were built with supervised learning methods, where each sample (usually containing data measured within a certain time window) used for the model training was labeled with the fatigue condition. (Fujiwara et al., 2019) and (Wang et al., 2019) applied semi-supervised and unsupervised learning with anomaly detection approaches instead. In these cases, the models were first built with the data under alert conditions or the entire dataset with the majority being alert conditions, and then the models were used to detect anomalies in data, which were identified as the fatigued conditions. Validation methods can have a significant impact on the performance measure (Saeb et al., 2017). It should be taken into consideration that HRV measures have inter-individual differences in both resting level and response to different stimulations (Nunan et al., 2010;Ohyama et al., 2007). Data samples from the same driving session and participants are likely to be highly correlated. To evaluate real life performance for new users, the data from the same driving session and same participant needs to be separated from the training and test set. Some reviewed studies have applied leave one subject out (LOSO) cross validation where such separation was achieved (Kundinger et al., 2020a;Persson et al., 2021;Vicente et al., 2016). Most of the remaining studies used 10-fold cross validation or hold out validation without arrangement for participant separation in training and testing, which may bias the results and exaggerate detection performance in relation to future use in real life driving scenarios.
To deal with the inter-individual differences, some reviewed studies have applied personalization methods. (Vicente et al., 2016) and (Persson et al., 2021) used personalized baselines to create a personalized feature set that accounts for the basal level of personal HRV measures. For models developed using the anomaly detection approach (Fujiwara et al., 2019;Wang et al., 2019), each person has their own model based on data from themselves.

HRV response for driver fatigue
Not all reviewed studies reported how HRV variables were related to fatigue, i.e., the direction of change when going from alert to fatigued. In total, 11 studies reported the difference in HR or HRV between fatigued and alert states. Among all measures, the LF/HF and HR level (or mean NN interval) were most investigated (Table 2). Decreased HR (increased mean NN interval) in the fatigued state was reported by five out of six reviewed studies that investigated HR change, while the remaining study did not find a significant change. For other time and frequency domain HRV parameters, contradictory results can be found where both increased and decreased values have been reported (Table 2). Several reviewed studies (Li and Chung, 2013;Patel et al., 2011;Rahim et al., 2015;Vicente et al., 2016) and other studies (Awais et al., 2014;Byeon et al., 2006) have considered LF/HF to be an important indicator of fatigue, as a reflection of the balance between parasympathetic and sympathetic nerve activity. However, the changes of LF/HF were not consistent across all reviewed studies. The inconsistency can be caused by different experiment setups, including the driving task, cause of fatigue and level of fatigue. Small study samples could also limit the generalizability of some studies.
It has been hypothesized that fatigue activates the parasympathetic nervous system, which leads to higher levels of HF, whereas when the sleep demand is counteracted by subjects fighting to stay awake this will lead to sympathetic activation that increases LF (Vicente et al., 2016). Therefore, for real road driving, drivers might have higher intention to fight against sleepiness than in the simulator studies, which leads to higher sympathetic activation. However, the physiological base of such an assumption is questionable. The HRV LF power is reflecting a mixture of sympathetic and parasympathetic activities together with other factors and the LF power is thus not directly correlated to sympathetic nerve activity (Moak et al., 2007;Piccirillo et al., 2009). The physiological base for LF/HF is indistinct and to interpret LF/HF as the balance between the parasympathetic and sympathetic nerve activity has been challenged (Billman, 2013).
Differences in the cause of fatigue can also be the reason for different HRV responses in relation to fatigue. In sleep research, several studies have reported increased HRV measured as SDNN (Kaida et al., 2007) in sleepy subjects. Increased VLF and LF power (Henelius et al., 2014), (Chua et al., 2012) have been associated to decreased vigilance caused by total or partial sleep deprivaiton. In those studies the effect of sleep homeostats and circardian effects are involved. While falling asleep, reduced HRV has been observed (Shinar et al., 2006). For task related factors, studies have shown that HRV changes are reflecting the cognitive task demand (Luque-Casado et al., 2016) or the time on task effect. For the time on task effect, both increased (Matuz et al., 2021) and decreased (Luque-Casado et al., 2016;Melo et al., 2017) HRV has been reported. The difference could be caused by different task demands and engagement (Pendleton et al., 2016).
Even when studying the same type of fatigue, the level of fatigue is another factor that needs to be considered. In study by (Henelius et al., 2014), a strong correlation between HRV spectral power and the psychomotor vigilance performance was only found for high levels of sleepiness under sleep deprivation, but not slight vigilance decrement in an ordinary day. Hence, studies that target low levels of fatigue may have different results than those targeting high levels of fatigue.

Performance of fatigue detection
The performance measure of HRV based driver fatigue detection systems varied from poor to perfect across the reviewed studies. Very high accuracy (i.e., both high sensitivity and specificity) was found in some simulator studies without subject-wise separation in learning and testing. Due to the differences in study design, those performance measures have different meanings in practice and cannot be compared directly by the numbers. Several design aspects can have a significant impact on the study outcome. The factors that were discussed in the previous section, including how fatigue was introduced and measured, can also affect the performance of a fatigue detection system. Another key factor is whether the data from the same participant were separated from the training and testing set. The difference in detection performance measures between LOSO cross validation and k-fold cross validation was shown by (Kundinger et al., 2020a) and (Persson et al., 2021). Having real road scenarios rather than simulator, having a broader population coverage (e.g., sex, age, fitness, etc.), and using less accurate HR sensors can also lead to lower performance measures (Persson et al., 2021).

Use of HRV based detection in real life
A majority of the reviewed studies were conducted in a simulator environment, and few were conducted with controlled real road scenarios. Findings from those studies may face many challenges in real life scenarios due to a variety of contextual factors that can alter HRV. HR and HRV have been shown to reflect cognitive workload (Luque-Casado et al., 2016). In real life scenarios, varying complexity of the driving context and involvement of secondary tasks will influence HR and HRV as well as fatigue development. Changes in emotional states and stress, food intake, and change in environmental factors (e.g., altitude, temperature) can also introduce variation in HR and HRV (Appelhans and Luecken, 2006;Boos et al., 2017;Castaldo et al., 2015;Sollers et al., 2002). Whether those stimulations can overshadow the HRV changes due to fatigue needs to be investigated in real life driving scenarios. HRV based monitoring can also be challenging in people with certain medical conditions and medications whose HRV regulation is affected.
Having personalized algorithms could be a key element for an accurate HRV based fatigue detection system. Current studies involve measurements for each participant during one or two days only. However, the personal basal HRV level can change overtime and good strategies for personalization need to be investigated. At the same time, those personalization methods should also consider local regulations regarding personal data usage in practice.

Future perspectives
The development of driver fatigue is a complex process influenced by multiple factors, and so is the physiological regulation of HRV. With this review we are not able to conclude that there are solid relationships between HRV and driver fatigue. How HRV is related to different causes of fatigue and its relation to driving performance under different types of fatigue needs to be further investigated. It will be helpful for future studies to have a transparent reporting on factors that influence fatigue and on reference measures for fatigue.
There is a need to find a good reference measure for fatigue that can reflect deterioration of driving performance and safety outcomes. Some reviewed studies have used fatigue measurements that have not been validated. Finding a valid and reliable ground truth measure of fatigue is challenging. There are drawbacks with both the subjective and objective physiological measures of fatigue and sleepiness. The drivers might not be fully aware of or might not acknowledge their signs of fatigue. They can also experience difficulties in reporting the correct level on subjective rating scales. Objective physiological measurements often do not correspond fully to the level of fatigue or sleepiness experienced by the subject. At the same time, test procedures in newly developed regulations such as European Union General Safety Regulation could be used as a base for future study design (European Parliament and Council, 2019).
A recent on-road experiment with a relatively large population did not achieve a satisfactory result for fatigue detection with direct usage of HRV (Persson et al., 2021). Future studies could consider alternative personalization strategies, introduce time dependent modeling, and combining HRV with other information to improve the performance of HRV based assessment.

Conclusions
HRV has the potential to be a valuable marker for detecting driver fatigue. However, substantial progress is still required before HRV-based driver fatigue detection can be deployed in real life driving. Reviewed articles show that reduced HR is associated with fatigued driving states. However, when it comes to other HRV measures, the direction of change is not consistent. We believe the inconsistency could be introduced by the differences in causal factors and reference measurement for fatigue that were implemented in different studies. There is a need for more concrete knowledge about how HRV changes with different levels and causes of fatigue and their relation to driver performance. The performance of HRV based fatigue detection systems show a wide range of accuracy, the results are difficult to compare across all studies due to differences in the experiment setups. Reduced detection performance can be found in studies with large on-road experiments and subjectindependent modeling. Using alternative personalization strategies, time dependent modeling, and utilizing other types of information could potentially contribute to more accurate detection in the future. Current findings from simulator and controlled on-road studies need to be further validated with real life driving studies.

Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.