1. Introduction
Heart rate is one of the most important vital signs observed to determine a person’s health. This is because heart rate directly indicates cardiovascular diseases (CVDs) such as heart disease, arrhythmia, stroke, and high blood pressure. CVD is a noncommunicable condition that can lead to sudden death due to issues with the heart and blood vessels [1,2]. In addition, heart rate can also be an indicator of physical and emotional stress. The precious information provided by the heart rate makes it important to monitor. There are some tools for heart rate monitoring, such as stethoscopes, electrocardiograms, and pulse oximeters. However, those methods often involve uncomfortable physical contact with sensors, presenting challenges in continuous monitoring and real-time data collection [3,4,5].
The first non-contact heart rate monitoring techniques using optical sensors were invented by Hertzman in 1937, developing with the advancement of science and technology in medicine, known as photoplethysmography (PPG) [6]. PPG allows the measurement of the heart rate by placing a sensor directly on the skin, typically at areas like the fingertip or earlobe, to detect changes in the light absorbed or reflected by the skin due to blood volume pulsatile. With the development of light source technology and photo detectors in the 1960s, more accurate and portable PPG devices were feasible. With further advances in electronics and signal processing in the 1970s, applications of PPG became more feasible in a clinical environment, as devices could now be smaller and more integrated.
In the early 2000s, researchers began exploring the potential of detecting PPG signals remotely using regular digital cameras which later provided the roots for remote photoplethysmography (rPPG). The main idea of rPPG is that skin color changes from the circulating blood flow were to be recorded and analyzed, mainly on facial regions. Initial attempts demonstrated that small changes in skin color due to blood flow could indeed be captured using webcams and camcorders, paving the way for remote, non-contact measurements. By using this technique, heart rate monitoring can be carried out remotely and faster. One of the rPPG technique implication is mass screening for life insurance underwriting. By using this approach, the assessment of prospective policyholders is much faster and cheaper [7,8,9].
Different algorithms have been developed to perform heart rate calculations using the rPPG technique. The first algorithm developed is called Green’s algorithm, presented by Verkruysse et al. [10] in 2008. This algorithm highlights the green channel to find changes in blood volume within cutaneous blood vessels, which can later be used to monitor heart rate. The reason for choosing the green channel is related to the favorable absorption characteristics of hemoglobin. However, it is susceptible to individual differences in skin pigmentation, which might result in changing the obtained signal because of the various amounts of light absorbed and reflected.
In 2010, Poh et al. [11] developed a method for calculating heart rate named Independent Component Analysis (ICA). This method is a statistical method used to separate overlapping signals into independent components. The use of ICA in rPPG is very important because it helps improve the accuracy of noninvasive heart rate monitoring via video, making it very useful in more convenient and easy health monitoring. However, ICA has a major weakness—the assumption that the source signals that want to be separated are completely independent of each other. In practice, especially in the rPPG technique, the signal is related to the heartbeat, and the noise is not always completely independent.
Therefore, the next method is proposed, namely chrominance-based (CHROM) rPPG signal extraction proposed by De Haan and Jeanne [12] in 2013. CHROM does not completely rely on the assumption of independence. Instead, CHROM utilizes color relationships in the video to separate the heartbeat signal from noise so that it is more robust to situations where the signal is not completely independent. Although more stable than ICA, CHROM can still be affected by drastic changes in lighting because these changes affect the chrominance values used. Therefore, this method was refined by research conducted by W. Wang et al. [13] in 2017, who proposed the algorithm plane orthogonal to skin (POS). Due to the linear combination of color signals in the POS representation, it is invariant to the changes in the lighting conditions and, correspondingly, reduces the intensity variations within an improved signal stability.
Recently, with the emergence of deep learning, many studies on noninvasive heart rate measurement are shifted towards deep learning models such as convolutional neural network (CNN) [14,15,16,17]. However, CNNs often work as black boxes, making it difficult to understand how the inputs lead to the outputs. The lack of interpretability may raise concerns about accountability and reliability in many applications, especially in healthcare. On the other hand, rPPG has an advantage over CNN in terms of interpretability since the theoretical background of rPPG is rooted in optics and physiology, not to mention that the rPPG mechanisms are straightforward, making it easier for practitioners to understand how the measure correlates with physiological signal [18].
Although progress has been made in rPPG for contactless heart rate measurement, the current methods still struggle with noise and other external influences, including lighting conditions and the skin pigmentation of the individual. Such challenges hint that the current methods need a more wholesome approach by incorporating machine learning integration and multiple data types to enhance robustness and accuracy. In the event of these gaps, our research proposes incorporating integration machine learning and a multimodal approach: 1 min facial video data along with physical characteristics, such as age, gender, height, weight, and body mass index (BMI), to enhance heart rate estimation. Two machine learning algorithms, namely support vector regression (SVR) and random forest, will be utilized to model the data input to predict the heart rate. Ablation experiments will be employed by removing or replacing each modality to investigate the impact on the model prediction. Our work will contribute to the integration of robust heart rate measurement for more personalized estimation, considering variability between individuals.
2. Methodology
2.1. Hypothesis
A study published by Quer et al. [19] utilized wearable fitness trackers to investigate age, body mass index, and sex-related differences in resting heart rate among 92,457 adults. It was concluded that the association of age and body mass index with RHR were statistically significant but small. Overall, the RHR tends to decrease with increasing age and was slightly higher with increased BMI. Similarly, two studies conducted by Ehrenwald et al. [20] and Gonzales et al. [21] studied the temporal changes in RHR. It has been observed that in general, the RHR decreases as age increases and women have a higher RHR compared to men. Furthermore, changes in BMI also affect the RHR; an increase in body mass results in an increased heart rate. Consequently, based on this study, we hypothesized that the inclusion of the rPPG technique and physical attributes, such as age, gender, weight, height, and BMI, may enhance the measurement system in its performance and reliability aspects.
2.2. Remote Photoplethysmography
Remote photoplethysmography (rPPG) is an innovative technique that allows measuring the heart rate noninvasively and without direct contact with the skin. The basic principle involves using light emitted onto the surface of the skin and detecting changes in the intensity of light reflected or absorbed by the skin during the blood circulation cycle [22]. The resulting signal, known as blood volume pulse, describes changes in blood vessel volume over time and allows the measurement of heart rate. The skin color measured by the camera is a combination of the intensity and spectrum of the light source, the intrinsic skin color, and the sensitivity of the camera’s color channels. Temporal changes in skin color are proportional to the magnitude of lighting intensity that would be influenced by specular variations caused by movement and subtle color changes due to blood circulation. In rPPG, two components contribute to the observed color variations on the skin surface: specular and diffuse reflections.
Specular reflection refers to light reflection due to a smooth surface, such as the skin. This is the result of the direct reflection of light from the skin without any scattering. For many cases, it may be the most dominant component in observed color variations and overshadows other components. However, this does not contain information regarding heart rate, so it needs to be rejected or minimized in the rPPG technique. On the other hand, diffuse reflection may be said to deal with skin tissue absorption and light scattering. This is influenced by factors such as the hemoglobin and melanin content of the skin [23]. Diffuse reflection contains information related to changes in blood volume in the vessels around the skin. This component is of interest in rPPG because it can be used to extract signals related to changes in blood volume [24]. Figure 1 shows an illustration of the use of light emitted into the skin in the rPPG technique.
To extract heartbeat signals from observed color variations, the rPPG method focuses on isolating and analyzing diffuse reflection components while minimizing the impact of specular reflections. For this purpose, the Plane-Orthogonal-to-Skin (POS) approach put out by [13] is the methodology employed in this study. The main idea of the POS is that the video sequence captures a visible skin region, where color variations correspond to heartbeat-related blood flow changes. The POS algorithm aims to account for variations in skin tone by projecting the RGB signal onto a plane orthogonal to the temporally normalized skin tone direction when given a 15 s facial video where each frame’s RGB values are averaged to obtain a single-color intensity value per frame, representing the overall color change over time. This results in a temporal array of and where represents frame index. The array is then projected using a matrix , designed to minimize the influence of skin tone on pulse extraction. This guarantees that there are as many beat components as possible in the final rPPG signal. After that, the signal’s component will be separated to calculate heart rate representation. The algorithm of POS is formulated in Algorithm 1 as follows:
Algorithm 1: Plane orthogonal to skin |
Assumption: Video duration 15 s; Frame rate 30 fps; total frame 450 frames |
Input: Video contains frame |
Output: rPPG signal |
1: = zeros (l, N), l = 48 (30 fps camera) |
2: for do: |
3: |
4: if then |
5: |
6: |
7: # The first row of |
8: # The second row of |
9: |
10: |
11: end if |
12: end for |
13: return |
This algorithm has the benefit of being less constrictive regarding the degree of distortion shown in the video and more resilient against noise. It is appropriate for a range of applications since it can function well in both fixed and moving scenarios.
2.3. Signal Filtering
In the actual world, noise frequently taints signals. There are many different types of noise, such as environmental interference, jitter in data acquisition, and electromagnetic interference. To make the signal being measured or analyzed cleaner and more precise, noise can be reduced or eliminated using the filter process. Butterworth bandpass filter, which is thought to enhance the quality of rPPG signals, is one of the most popular filters [25,26]. According to Steiglitz [27], Butterworth filters are made to have a flat frequency response within the passband frequency range, which permits all frequencies to go through without distortion. The transfer function, a polynomial ratio of output to input, is a way the filter process response to input signals may be described. The mathematical expression for the transfer function is as follows:
where the transfer function is the transfer characteristics, and is a complex variable function. The parameter indicates the filter order, whereas signifies the cutoff frequency. In this study, a Butterworth filter will use the cutoff parameters in the range 0.8–2.0 Hz and the order used is 4. The reason for choosing this interval is because the average resting human heart rate is in the interval 48–120 bpm [28]. A fourth-order Butterworth filter is chosen for several reasons. First, higher-order filters provide steeper roll-offs, which enhance the filter’s ability to sharply distinguish between the passband and the stopband.
2.4. Fast Fourier Transform
After being filtered, the rPPG signals may now be used to obtain crucial information for heart rate computations. The Fourier transform may be used to analyze frequency by applying it to the rPPG signal. Signals in the time domain will be converted into the frequency domain using this technique. Consider a discrete time signal of length . Discrete Fourier transform (DFT) from can be defined as follows:
where is the frequency spectrum, is time signal of sample, and is the signal length. However, operations are needed for classical DFT, which can be quite slow for long signal lengths. This issue is resolved by the fast Fourier transform (FFT), which divides the DFT into smaller DFTs that can be computed more quickly. The Cooley–Tukey algorithm is the most well-known FFT algorithm. The FFT is split into two smaller pieces by this algorithm, which then processes each piece recursively [29].
2.5. Support Vector Machine
The support vector machine is one of the most powerful models of machine learning widely used for both classification and regression. SVMs which are used for regression tasks are called support vector regression (SVR). It has based its idea on finding the optimal decision boundaries or a hyperplane with maximum margin [30]. The main optimization problem of the hyperplane for regression involves the minimization of the sum of slack variables, both positive and negative, allowing deviations within a margin from the true target value. Let be the total number of data points in the training set; then, the formula is as follows:
The defines the weight vector that sets the hyperplane separating the classes in a higher-dimensional space. Therefore, is the squared norm of the weight vector, which represents the margin of separation between classes. Variable is a regularization parameter that controls the trade-off between maximizing margin and minimizing errors. Two slack variables and are introduced to handle deviations above and below the margin. The Figure 2 below illustrates the hyperplane problem of the SVR and the role of each variable.
However, Equation (3) is complicated to solve. Therefore, the dual form will be formulated using the Lagrange function and Karush–Kuhn–Tucker (KKT) condition. For regression problems, the dual of the hyperplane optimization method can be obtained using similar techniques. Therefore, the dual of regression hyperplane optimization is given as follows:
where is associated with the upper constraint and is associated with the lower constraint . Variable is a kernel function. There are several kernel functions such as linear, polynomial, and RBF (Radial Basis Function) kernel [31].
The SVR model can be implemented for multimodal heart rate prediction by incorporating signal features derived from photoplethysmography alongside the physical information, such as age and body mass index, to improve accuracy. The SVR will build the best model based on the optimization of the hyperplane problem during the training process. The best model represented in mathematical function will be used to predict the heart rate based on the input. To enhance machine learning performance, the hyperparameters of the SVR need to be tuned. One of the most important hyperparameters to be tuned is the kernel function. The kernel function converts data into the right format, hence allowing SVRs to identify the best boundaries. Another hyperparameter to be tuned is which is the balance between attempting to fit the training data with a minimal error and maximizing the margin of the decision boundary; a higher value of will make the considerations of correct classification more important. γ is a factor associated with the influence of training data on the decision boundary, whereas ϵ is a margin within which no penalty is incurred in the training loss function [32].
2.6. Random Forest Regression
The random forest regression is an ensemble machine learning model that relies on the bagging approach: building multiple base models in parallel, and the final prediction is performed with an average vote, ensuring variance reduction by building models using several bootstrap datasets [33]. It consists of several fixed-size base models, which are usually classification and regression trees (CARTs). It builds binary trees, where each internal node splits the data into two child nodes based on a split criterion. The goal is to create a tree structure that maximizes the accuracy of predictions for new data points.
Given the training data , where is a label. Let be the set of data at a node, be the set of candidate feature and threshold value , and and be the data in the left and right partitions of the threshold, defined as the following equation:
where is the number of data points at the node, and , are the numbers of data points in the left and right partitions, respectively. Therefore, selecting the feature and threshold that produces the purest partition at a node can be expressed as the following optimization problem:
For regression tasks, the input is divided into several partitions called . Each region contains a subset of the data points (observations) that fall within it. The goal is to find optimal partitions that minimize the error in each region. The node m represents a region with observations, and the mean of the target values within each region is . The common criterion to minimize is the mean squared error as shown in the following equations.
Next, using the steps above, the random forest regression algorithm builds multiple decision trees by bootstrapping and aggregating the results. The algorithm to perform random forest regression is given in Algorithm 2 as follows:
Algorithm 2: Random forest regression |
Input: training dataset, n: number of trees; d: tree depth; m: number of features at each node |
Output: prediction of Random Forest model |
1: for i = 1, 2, …, n do: |
2: Bootstrap sampling of the dataset Di |
3: Build decision tree: |
4: At each node, randomly sample m features |
5: Find the best split using the m subset |
6: Split the node and repeat until stopping criteria (max depth or minimumsamples) are met. |
7: Save the decision tree Ti |
8: end for |
9: For each tree Ti, get the predicted value from all trees |
10: return Average of the predictions from each tree |
The random forest model can be implemented for multimodal heart rate prediction by incorporating signal features derived from photoplethysmography alongside the physical information, such as age and body mass index, to improve accuracy. Multiple decision trees will be built to model the data, and the final prediction will be calculated by averaging the result of each tree.
For random forests, there are three main hyperparameters that will be tuned to improve the model’s performance. Firstly, the number of trees which controls how many trees are involved in the ensemble. Secondly, max depth is the setting that controls the depth of each tree, and this must be performed to avoid overfitting. The final hyperparameter, minimum samples per split denotes the minimal number of samples that are necessary to split an internal node, which prevents the trees from growing too complex [34]. The illustration of the random forest regression can be seen in Figure 3.
2.7. Data Collection
In this experiment, a dataset with the facial video recordings of 60 individuals was utilized. Each video’s length was standardized to 1 min, where the resolution and frame rate were set to be uniform at , and 30 fps, respectively. This resolution and frame rate are used to balance the performance and computational efficiency, leading to an accurate prediction but still feasible for real-time implementation. The data collection was conducted at the same place and with fixed lighting and subjects had to take off any facial accessories, such as eyeglasses. Together with video recording, there was manual measurement using an oximeter to gather the actual value as the ground truth. The height in cm, weight in kg, age in years, and gender are also recorded in the dataset. To ensure privacy, we anonymized the subjects by masking facial features such as eyes, nose, and mouth to prevent unauthorized recognition. We also introduced variable body mass index (BMI) into the dataset. Figure 4 depict the characteristics of the dataset used in this study.
From Figure 4a, it can be concluded that from the 60 individuals, there are 45 males and 15 females in the dataset. Figure 4b shows the variations in heart rate between the males and females. It also shows that females generally present a higher heart rate compared to males, since their median rate is just above 75 beats per minute, while the median for males is just about 70 beats per minute. The interquartile range (IQR), which represents the middle 50% of the dataset, is wider for females, ranging approximately between 70 and 85 bpm. This indicates that there is a greater spread or deviation in the pulse rates of females. In contrast, the IQR for males varies between 65 and 80 beats per minute; this means that male heart rates are more huddled around the lower end of the spectrum. Figure 4c–f indicate that our dataset has a good diversity of heart rate, height, weight, and BMI, making this ideal for training machine learning.
2.8. Region of Interest Selection
Region of interest (RoI) selection in the rPPG technique involves selecting a specific area on a person’s skin or face from which the rPPG signal will be extracted. The face was chosen as the RoI in the rPPG algorithm because it has several advantages over other body parts. First, the face has a high concentration of capillaries, making it easier to detect changes in blood flow. Second, faces have a low level of reflection, making it easier to separate PPG signals from environmental noise. Third, faces are easy to access and reach, making them suitable for real-time video capture applications [26,35]. For this purpose, we utilized Dlib facial landmarks that combine Histogram of Oriented Gradients (HOG) as the feature extraction and SVM as the key points classifier [36]. The RoI used in this research is the facial skin without eyes, nose, and lips. For this purpose, the landmarks (36, 39, 40, and 41) for the left eye, (42, 45, 46, and 47) for the right eye, (32, 33, 34, 35, and 36) for the nose, and (49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, and 60) for the lips will be filled as black as shown in Figure 5.
2.9. Feature Selection
In machine learning, signal feature selection is the process of identifying and selecting the most relevant features included in the signal data that contribute to an improvement in the performance of machine learning models. After several surveys in the literature [27,37,38,39], we decided to utilize signal features as follows:
Maximum and minimum values of the signal: They are the basic features of a signal representing the maximum and minimum value of signal amplitude, respectively. From those values, a new feature called peak to peak, which is the difference between the maximum and minimum values of the signal, can be generated. Given that represents the signal amplitude at time frame , the maximum, minimum, and peak-to-peak values of the signal are represented as follows:
Skewness and Kurtosis: Skewness is a measure of the asymmetry of the signal’s amplitude distribution around the mean, while kurtosis is the existence of outliers in the signal’s distribution. Given that represents the signal amplitude at time frame , and are the mean and standard deviation of the signal, respectively, skewness and kurtosis of the signal are represented as follows:
Entropy: It measures the complexity or randomness in the signal based on the probability distribution of its values. It is often used to assess the predictability of a signal. Given that represents the signal amplitude at time frame , to calculate entropy, the signal needs to be discretized over a period of 30, matching the fps assumption used in this study. This process results in where . Each covers a specific range that can take. The entropy can be calculated as follows:
represents the probability that falls within the range specified by . Higher entropy indicates a more complex or less predictable signal, while lower entropy suggests a more regular, predictable pattern.
Zero crossings: Zero crossings are a count of how many times the signal crosses the horizontal axis—that is, changes from positive to negative or vice versa. This feature also gives some insight into the frequency content of the signal. Given that represents the signal amplitude at time frame , zero crossings can be formulated as follows:
Spectral centroid and spectral bandwidth: The spectral centroid is that point, considering the signal in the frequency domain, where the signal can be said to balance domain while spectral bandwidth is a measure of the spread or width of the frequency spectrum around the spectral centroid. They provide a measure of the amount the energy that is concentrated or dispersed across frequencies. Given that represents the frequency domain of the signal obtained by implementing into DFT in Formula (2), spectral centroid and spectral bandwidth can be formulated as follows:
Dominant frequency and total power: Dominant frequency is the frequency at which, in the Fourier spectrum of the signal, its magnitude is the largest, whereas total power is a notion of the overall energy, computed by summing over squared amplitudes. Given that is the representation of time-domain signal , dominant frequency and total power can be calculated as follows:
3. Experiments
3.1. DeepFace for Gender and Age Prediction
In this study, it is necessary to utilize a sensor to automatically predict the physical properties. Therefore, DeepFace, a popular open-source framework for facial analysis, is utilized. DeepFace, which internally uses deep learning models pre-trained on very large datasets, offers a very high degree of accuracy for various facial attribute predictions, including age estimation and gender classification. Currently, the library supports several backend models, such as VGG-Face, Google FaceNet, and OpenFace. The VGG-Face backend provided by DeepFace was used in our implementation to analyze the first frame of every video regarding predicted age and gender due to its balance of speed and accuracy.
3.2. Experiment with Random Forest and Support Vector Regression for Multimodal Heart Rate Calculation
In this section, we outline the workflow utilized for our experiment, which encompasses a series of systematic steps designed to ensure a comprehensive evaluation of the machine learning models for heart rate estimation. The process begins with RoI extraction. Following this, feature engineering is employed to extract meaningful features from the signal data, including critical metrics such as the maximum, minimum, and peak-to-peak values; skewness; and kurtosis, which are vital for accurate model performance. All the features collected from facial videos will be concatenated with physical features such as age, gender, weight, height, and BMI.
Next, we conduct model selection, opting for support vector regression (SVR) and random forest regression by tuning the hyperparameter. This tuning phase is executed using a grid search method, where various hyperparameters are evaluated to find the optimal configurations for each model. Hyperparameters for each model are given as follows:
Hyperparameters for SVR: Kernel {Linear, RBF}, and ;
Hyperparameters for Random Forest Regression: number of trees , max depth , and min samples per split .
All the processes are implemented using the scikit-learn library, which is a well-known and open-source tool for data analysis and machine learning implementation provided by Python.
Subsequently, we apply 5-fold cross-validation (CV), which is a statistical technique used to assess the performance of the tuned models on unseen data. This iterative process works by splitting the dataset into 5 folds, followed by training process using 4 folds as the training set and the remaining fold as the validation set. We repeat the training process until every fold becomes a training set. The result of the model can be obtained by averaging the results across all the folds [40]. Figure 6 below illustrates model evaluation using 5-fold CV.
Finally, we evaluate the models using performance metrics such as mean absolute error (MAE), root mean squared error (RMSE), and mean absolute percentage error (MAPE) to quantify their predictive accuracy. Some commonly used techniques for measuring model performance in this study are MAE, RMSE, and MAPE, which can be formulated as follows:
where is the actual value and is the predicted value. For each metric, the lower the value, the better the performance of the model being tested. A detailed diagram illustrating this workflow is presented in Figure 7 to provide a visual representation of the steps taken throughout the experimental process.
3.3. Performance Results
This study explored machine learning integration and multimodal approach for heart rate measurement by combining two types of data: signal features obtained from the facial video as listed in Equation (8) until Equation (13) were given the name “Video” and physical attributes, such as age, gender, height, and weight, were given the name “Physical”. Two distinct machine learning models, namely support vector regression and random forest regression, are utilized. We systematically conducted ablation study by removing or altering specific video features and physical attributes to assess their impact on the MAE, RMSE, and MAPE. The focus was on understanding how each feature contributes to the model’s overall accuracy in predicting the target variable. Table 1 shows the performance of all the ablation experiments conducted in this study. The columns Video and Physical use checkmarks (✓) and crosses (✗) to indicate whether the corresponding input type is included in each method. A checkmark means the input is included, while a cross means it is not. Generally, the random forest models perform better than the SVR with the best-performing method obtained when using the multimodal approach with the lowest MAE, RMSE, and MAPE of 3.057, 10.532, and 4.2%, respectively. In contrast, the best SVR is obtained when using physical information only with the MAE, RMSE, and MAPE of 3.251, 12.797, and 4.9%, respectively.
To gain more valuable insights from the random forest model, we checked predictive performance and interpretability using feature importances as shown in Figure 8 Feature importance provides a quick sense of how each feature is useful for the decisions of the model and, therefore, helps identify the most influential variables driving the predictions. This visualization is useful for making informed decisions about which features to retain, prioritize, or analyze further. By focusing on the most influential features, we can not only improve model performance but also uncover actionable insights for the problem domain.
In addition, Figure 9 shows Shapley additive explanation (SHAP) values, which completely explain how each feature contributes toward the output of the model for specific predictions. The values to the right, or more positive SHAP values, increase the prediction, while the values to the left, or negative SHAP values, decrease the prediction. Features that are further away from zero have a greater influence on a particular prediction. The SHAP values tell not only the direction or magnitude of each feature’s effect but also consider the dependencies among the features, thus giving rich insight into the behavior of the model.
For deeper analysis, Figure 10 shows a partial dependence plot (PDP), which provides insights into the effect of various features on the predictions of the random forest model on the multimodal data. Each subplot shows the partial dependence of a single feature, revealing how changes in its values impact the model’s output. The -axis, representing partial dependence, indicates the direction and strength of each feature’s influence on the prediction. Higher values suggest a positive effect, while lower values suggest a negative one. Meanwhile, the -axis shows the range of feature values, which are often normalized to facilitate comparison across different features.
For certain features like zero crossings, peak to peak, and BMI, the PDP curves display notable variations across specific ranges, suggesting that these features significantly impact the model’s predictions in those areas. In contrast, features such as age, weight, height, skewness, and spectral bandwidth exhibit relatively flat lines, indicating that changes in these values have minimal effect on the prediction. The plot also shows the behavior of binary or categorical variables, like gender, where the flat line across 0 and 1 values suggests a limited impact on the outcome.
Features with sharp increases or decreases, like zero crossings, may indicate threshold effects, where a small change within a certain range can substantially affect predictions. This pattern suggests that the model may have learned non-linear relationships with these features, capturing complex interactions. Overall, this PDP grid highlights features such as zero crossings, peak to peak, and BMI which play a more prominent role in driving predictions, whereas others have less influence on the model’s decision-making process.
3.4. Comparison with State-of-the-Art Methods
To evaluate the performance of the proposed method, we compared the result of our work to four different state-of-the-art methods explained in the introduction. The comparison is carried out by applying the state-of-the-art methods to the dataset used in this study and evaluated using the same metrics. It aims to compare how the proposed methodology measures the heart rate much more accurately than the state-of-the-art. It is easy to see from Figure 11 that our work considerably outperforms the existing state-of-the-art methods of remote photoplethysmography in heart rate measurement, evident from the lowest MAE, RMSE, and MAPE of 3.06, 10.53, and 4.02%, respectively. The random forest with multimodal approach outperforms the random forest without multimodal by 30% in terms of error, indicating the multimodal approach improves the remote heart rate measurement. Furthermore, the integration of random forest model and multimodal approach outperforms the base POS with FFT by 72%, indicating that our approach is more reliable.
4. Discussion
In this study, we investigated the performance of machine learning integration and a multimodal approach for heart rate measurement by using the combination of the rPPG blood volume pulsatile signal extracted from the facial video and physical attributes such as age, gender, height, weight, and BMI. By combining these diverse data sources, we aim to improve heart rate estimation accuracy in non-contact scenarios, which is beneficial for telemedicine, fitness monitoring, and mass health screening.
This study explored machine learning integration and multimodal approach for heart rate measurement by combining two types of data: facial video as listed in Equation (8) until Equation (13) was given the name “Video” and physical attributes, such as age, gender, height, and weight, was given the name “Physical”. Two distinct machine learning models, namely support vector regression and random forest regression, were utilized. We systematically conducted an ablation study by removing or altering specific video features and physical attributes to assess their impact on the MAE, RMSE, and MAPE. The focus was on understanding how each feature contributes to the model’s overall accuracy in predicting the target variable. Figure 8 shows the performance of all the ablation experiments conducted in this study. Generally, the random forest models perform better than the SVR with the best-performing method obtained when using the multimodal approach with the lowest MAE, RMSE, and MAPE of 3.057, 10.532, and 4.2%, respectively. In contrast, the best SVR is obtained when using physical information only with the MAE, RMSE, and MAPE of 3.251, 12.797, and 4.9%, respectively.
Table 1 shows that the random forest regression with a multimodal approach is the best method which gives the lowest error with an MAE, RMSE, and MAPE of 3.057, 10.532, and 4.2%, respectively. This indicates the effectiveness of combining the two different data types in the prediction of heart rate. The models that did not include physical data or relied totally on video data had a higher error rate, underpinning the importance of physical attributes in enhancing the accuracy of the prediction.
In contrast, the SVR models perform comparably but are not better than the RF model that leverages both video and physical attributes. Interestingly, the SVR model with only physical data, Index 5, yielded a relatively low MAE of 3.251, while further addition of either video or physical features degraded its performance. This suggests that while the SVR is successful with physical data, it is sensitive to the integration of video features, possibly due to the feature complexity or noise contributed by the multiple data types, which is the main drawback of SVR as mentioned by Farahmand, Desa, and Nilashi [41].
From these findings, it becomes clear that a multimodal approach, especially one that incorporates video and physical information, could greatly affect the performance of the model in estimating heart rate. The combined-data RF model’s superior performance strengthens the promise of multimodal machine learning methods for improvement in the accuracy of remote heart rate measurement systems. This has led to the conclusion that multiple sources of data contribute to the overall better performance for applications, such as rPPG, when compared with single-modality methods.
These conclusions are further supported by the feature importance and SHAP value analyses of the best-performing RF scenario shown in Figure 9. First, it shows zero crossings as the most critical feature, with its importance score way higher than the other features, valued at 0.3979; hence, it contributed most to the model’s predictions. Physical features such as BMI and dominant frequency also contribute to key roles but at a lower degree compared to the others. The SHAP value plot in Figure 9 offers more granularity on this, pointing out that zero crossings tend to have a consistent, high positive impact on the model output, while the impact of features such as BMI and peak to peak varies depending on their value. The domination of zero crossings as the most important feature in this study is due to its ability to measure the frequency of estimation, which is correlated with the heart rate.
Coupled with the discussion of feature importance, this suggests that the better performance of the RF model is indeed derived from its ability to exploit both video-derived features such as zero crossings and physical attributes like BMI. Allowing it to capture the complex interaction between these multimodal features tends to yield a low error rate in heart rate prediction. The SHAP analysis points out how these key features weigh in the output from the model, further reinforcing the fact that the conclusion must be about the inclusion of both video and physical data for the best performance. These results align with our hypothesis that physical attributes such as BMI influence the heart rate prediction [19,20,21].
The comparison in Figure 11 shows that the combination of random forest regression, POS algorithm, and multimodal data, which includes both video and physical attributes, achieves the lowest error rates among all the methods evaluated, with an MAE of 3.057, RMSE of 10.532, and MAPE of 4.2%. These results significantly outperform the other state-of-the-art methods, such as Green + FFT, ICA + FFT, and CHROM + FFT, which show considerably higher error rates. For instance, the Green + FFT method exhibits an MAE of 15.217 and a MAPE of 18.5%, highlighting the vast improvement achieved by integrating multimodal data with the random forest model.
5. Limitations and Future Works
Despite the promising results, there are some limitations in this study. First, the dataset used in this work is relatively small because no open-source site provides facial videos along with physical attributes information. Due to limited access to a large multimodal dataset, there is a risk that the model may be overfitting to the dataset used in this study. Future work should prioritize the collection of a larger, more diverse dataset to improve the model’s generalizability and robustness across different populations.
Furthermore, an inaccurate physical attribute feed to the system such as the input of false weight or height may influence the result because measuring heart rate is based on false information. The limitation of this study is one that should be resolved in the future by collecting more multimodal datasets and aiming to develop a model that automatically predicts physical attributes to avoid manual inputting with the intent of avoiding feeding false information to the measurement system. Lastly, the rPPG signals extracted from the video are sensitive to lighting and environmental conditions, which could affect the accuracy of heart rate estimation in non-controlled settings. Variations in light sources, shadows, and even skin tone may introduce noise in the video data. Future work could involve improving robustness to such factors by implementing data normalization techniques or training the model with data captured under diverse lighting conditions.
6. Conclusions
In this study, we aim to make noninvasive heart rate measurements more robust and reliable. We propose a machine learning integration into multimodal data including facial video, age, gender, weight, height, and BMI. The plane orthogonal to skin algorithm is utilized to extract blood volume pulsatile signal from the RGB facial video. Region of interest for video recording covers facial areas without eyes, nose, and lips. Signal filtering is implemented to denoise the obtained signal followed by signal extraction. Several features are selected, such as the maximum, minimum, and peak-to-peak value; skewness; kurtosis; zero crossings; spectral centroid; spectral bandwidth; dominant frequency; total power; and entropy value of the signal. Two machine learning models, support vector regression (SVR) and random forest regression are compared for this task. The experimental results demonstrate that incorporating a multimodal approach enhances model performance, with the random forest regression model achieving the best result surpassing the state-of-the-art methods. Furthermore, our method improves the robustness of non-contact heart rate measurement.
Author Contributions
Conceptualization, R.A.B., M.R.S.R. and A.B.; methodology, R.A.B. and M.R.S.R.; software, M.R.S.R.; validation, A.B., M.R.S.R. and R.A.B.; formal analysis, R.A.B.; investigation, R.A.B.; resources, R.A.B.; data curation, M.R.S.R.; writing—original draft preparation, R.A.B. and M.R.S.R.; writing—review and editing, R.A.B. and M.R.S.R.; visualization, M.R.S.R.; supervision, A.B.; project administration, A.B.; funding acquisition, R.A.B. All authors have read and agreed to the published version of the manuscript.
Funding
This research was funded by PUTI Q2 grant batch 3 with grant number NKB-1402/UN2.RST/HKP.05.00/2022.
Institutional Review Board Statement
Ethical review and approval were waived for this study due to all participants providing informed consent and the assurance that their data would remain protected.
Informed Consent Statement
Informed consent was obtained from all subjects involved in the study.
Data Availability Statement
The data presented in this study are available upon request from the corresponding author due to privacy and ethical restrictions, as the data involve human subjects.
Acknowledgments
This work is supported by Seleris Asia Pacific Technology Pte. Ltd., Data Science Center Universitas Indonesia (DSCUI), and Laboratory of Bioinformatics and Advanced Computing Department of Mathematics Universitas Indonesia.
Conflicts of Interest
The authors declare no conflicts of interest.
References
- Yehezky, H.; Bustamam, A.; Hermawan, H. Enhancing ECG Signal Data through Denoising Features with Transformer Generative Adversarial Networks for Model Classication 1D-CNN. Res. Sq. 2023. [Google Scholar] [CrossRef]
- Atmadja, H.Y.N.; Bustamam, A. Hermawan Generated Tabular Data with Multi-GANs for Arrhythmia Classification Based on DNN Models. In Proceedings of the 2022 International Conference of Science and Information Technology in Smart Administration, ICSINTESA 2022, IEEE, Denpasar, Indonesia, 10–12 November 2022; pp. 69–74. [Google Scholar]
- Shyam, A.; Ravichandran, V.; Preejith, S.P.; Joseph, J.; Sivaprakasam, M. PPGnet: Deep Network for Device Independent Heart Rate Estimationfrom Photoplethysmogram. In Proceedings of the 2019 41st Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), Berlin, Germany, 23–27 July 2019. [Google Scholar]
- Schiweck, C.; Piette, D.; Berckmans, D.; Claes, S.; Vrieze, E. Heart Rate and High Frequency Heart Rate Variability during Stress as Biomarker for Clinical Depression. A Systematic Review. Psychol. Med. 2019, 49, 200–211. [Google Scholar] [CrossRef] [PubMed]
- Qiao, D.; Ayesha, A.H.; Zulkernine, F.; Jaffar, N.; Masroor, R. ReViSe: Remote Vital Signs Measurement Using Smartphone Camera. IEEE Access 2022, 10, 131656–131670. [Google Scholar] [CrossRef]
- Hertzman, A.B. Observations on the Finger Volume Pulse Recorded Photoelectrically. Am. J. Physiol. 1937, 119, 334–335. [Google Scholar]
- Soldatos, J. Big Data and Artificial Intelligence in Digital Finance; Kyriazis, D., Ed.; Springer International Publishing: Cham, Switzerland, 2022. [Google Scholar]
- Upadhyay, B.R.; Baral, A.B.; Torlak, M. Vital Sign Detection via Angular and Range Measurements with MmWave MIMO Radars: Algorithms and Trials. IEEE Access 2022, 10, 106017–106032. [Google Scholar] [CrossRef]
- Pnevmatikakis, A.; Kanavos, S.; Perikleous, A.; Kyriazakos, S. Risk Assessment for Personalized Health Insurance Products. In Big Data and Artificial Intelligence in Digital Finance; Springer: Cham, Switzerland, 2022; pp. 281–294. [Google Scholar] [CrossRef]
- Verkruysse, W.; Svaasand, L.O.; Stuart Nelson, J. Remote Plethysmographic Imaging Using Ambient Light. Opt. Express 2008, 16, 2878–2886. [Google Scholar] [CrossRef]
- Poh, M.-Z.; Mcduff, D.J.; Picard, R.W. Non-Contact, Automated Cardiac Pulse Measurements Using Video Imaging and Blind Source Separation. References and Links. Opt. Express 2010, 18, 10762–10774. [Google Scholar] [CrossRef]
- De Haan, G.; Jeanne, V. Robust Pulse Rate from Chrominance-Based RPPG. IEEE Trans. Biomed. Eng. 2013, 60, 2878–2886. [Google Scholar] [CrossRef]
- Wang, W.; Den Brinker, A.C.; Stuijk, S.; De Haan, G. Algorithmic Principles of Remote PPG. IEEE Trans. Biomed. Eng. 2017, 64, 1479–1491. [Google Scholar] [CrossRef]
- Huang, X.; Lv, J. Resilient Remote Heart Rate Measurement from Partially Masked Faces. In Proceedings of the International Joint Conference on Neural Networks, IEEE, Gold Coast, Australia, 18–23 June 2023. [Google Scholar]
- Niu, X.; Zhao, X.; Han, H.; Das, A.; Dantcheva, A.; Shan, S.; Chen, X. Robust Remote Heart Rate Estimation from Face Utilizing-Temporal Attention. In Proceedings of the 2019 14th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2019), Lille, France, 14–18 May 2019. [Google Scholar]
- Kodithuwakkuge, J.; Sekhar, V.; Lamahewa, T.A. Blind Symbol Rate Estimation Using Convolutional Neural Networks for PSK Modulated Signals. In Proceedings of the 2021 15th International Conference on Signal Processing and Communication Systems, ICSPCS 2021—Proceedings, IEEE, Sydney, Australia, 13–15 December 2021. [Google Scholar]
- Reiss, A.; Indlekofer, I.; Schmidt, P.; Van Laerhoven, K. Deep PPG: Large-Scale Heart Rate Estimation with Convolutional Neural Networks. Sensors 2019, 19, 3079. [Google Scholar] [CrossRef]
- Kim, K.B.; Baek, H.J. Photoplethysmography in Wearable Devices: A Comprehensive Review of Technological Advances, Current Challenges, and Future Directions. Electronics 2023, 12, 2923. [Google Scholar] [CrossRef]
- Quer, G.; Gouda, P.; Galarnyk, M.; Topol, E.J.; Steinhubl, S.R. Inter- and Intraindividual Variability in Daily Resting Heart Rate and Its Associations with Age, Sex, Sleep, BMI, and Time of Year: Retrospective, Longitudinal Cohort Study of 92,457 Adults. PLoS ONE 2020, 15, e0227709. [Google Scholar] [CrossRef]
- Ehrenwald, M.; Wasserman, A.; Shenhar-Tsarfaty, S.; Zeltser, D.; Friedensohn, L.; Shapira, I.; Berliner, S.; Rogowski, O. Exercise Capacity and Body Mass Index—Important Predictors of Change in Resting Heart Rate. BMC Cardiovasc. Disord. 2019, 19, 307. [Google Scholar] [CrossRef]
- Gonzales, T.I.; Jeon, J.Y.; Lindsay, T.; Westgate, K.; Perez-Pozuelo, I.; Hollidge, S.; Wijndaele, K.; Rennie, K.; Forouhi, N.; Griffin, S.; et al. Resting Heart Rate Is a Population-Level Biomarker of Cardiorespiratory Fitness: The Fenland Study. PLoS ONE 2023, 18, e0285272. [Google Scholar] [CrossRef] [PubMed]
- Wu, T.; Schmitt, H.J.; Blazek, V. Photoplethysmography Imaging: A New Noninvasive and Noncontact Method for Mapping of the Dermal Perfusion Changes. Opt. Tech. Instrum. Meas. Blood Compos. Struct. Dyn. 2000, 4163, 62. [Google Scholar] [CrossRef]
- Stricker, R.; Müller, S.; Gross, H.-M. Non-Contact Video-Based Pulse Rate Measurement on a Mobile Service Robot. In Proceedings of the 23rd IEEE International Symposium on Robot and Human Interactive Communication, Edinburgh, UK, 25–29 August 2014. [Google Scholar]
- De Haan, G.; Van Leest, A. Improved Motion Robustness of Remote-PPG by Using the Blood Volume Pulse Signature. Physiol. Meas. 2014, 35, 1913. [Google Scholar] [CrossRef] [PubMed]
- Kim, J.K.; Mok Ahn, J. Design of an Optimal Digital IIR Filter for Heart Rate Variability by Photoplethysmogram. Int. J. Eng. Res. Technol. 2018, 11, 2009–2021. [Google Scholar]
- Tarvainen, M.P.; Ranta-aho, P.O.; Karjalainen, P.A. An Advanced Detrending Method with Application to HRV Analysis. IEEE Trans. Biomed. Eng. 2002, 49, 172–175. [Google Scholar] [CrossRef]
- Steiglitz, K. Digital Signal Processing; Courier Dover Publications: Mineola, NY, USA, 2020. [Google Scholar]
- Guler, S.; Golparvar, A.; Ozturk, O.; Dogan, H.; Kaya Yapici, M. Optimal Digital Filter Selection for Remote Photoplethysmography (RPPG) Signal Conditioning. Biomed. Phys. Eng. Express 2023, 9, 027001. [Google Scholar] [CrossRef]
- Burden, R.L. Numerical Analysis; Brooks/Cole: Pacific Grove, CA, USA, 2011. [Google Scholar]
- Bustamam, A.; Sarwinda, D.; Abdillah, B.; Kaloka, T.Q. Detecting Lesion Characteristics of Diabetic Retinopathy Using Machine Learning and Computer Vision. J. Electron. Syst. 2020, 10, 23–33. [Google Scholar] [CrossRef]
- Octaria, E.A.; Siswantining, T.; Bustamam, A.; Sarwinda, D. Kernel PCA and SVM-RFE Based Feature Selection for Classification of Dengue Microarray Dataset. AIP Conf. Proc. 2020, 2264, 030004. [Google Scholar]
- Açıkkar, M.; Altunkol, Y. A Novel Hybrid PSO- and GS-Based Hyperparameter Optimization Algorithm for Support Vector Regression. Neural Comput. Appl. 2023, 35, 19961–19977. [Google Scholar] [CrossRef]
- Bustamam, A.; Musti, M.I.S.; Hartomo, S.; Aprilia, S.; Tampubolon, P.P.; Lestari, D. Performance of Rotation Forest Ensemble Classifier and Feature Extractor in Predicting Protein Interactions Using Amino Acid Sequences. BMC Genom. 2019, 20, 950. [Google Scholar] [CrossRef]
- Testas, A. Hyperparameter Tuning with Scikit-Learn and PySpark. In Distributed Machine Learning with PySpark; Apress: Berkeley, CA, USA, 2023; pp. 417–440. [Google Scholar] [CrossRef]
- Kim, D.; Lee, K.; Sohn, C.B. Assessment of ROI Selection for Facial Video-Based RPPG. Sensors 2021, 21, 7923. [Google Scholar] [CrossRef]
- Sarwinda, D.; Siswantining, T.; Bustamam, A. Classification of Diabetic Retinopathy Stages Using Histogram of Oriented Gradients and Shallow Learning. In Proceedings of the 2018 International Conference on Computer, Control, Informatics and its Applications (IC3INA), IEEE, Tangerang, Indonesia, 1–2 November 2018. [Google Scholar]
- Oppenheim, A.V.; Willsky, A.S.; Nawab, S.H.; Ding, J.J. Signals and Systems; Prentice Hall: New York, NY, USA, 1997; Volume 2. [Google Scholar]
- Singh, A.K.; Krishnan, S. ECG Signal Feature Extraction Trends in Methods and Applications. BioMed. Eng. OnLine 2023, 22, 22. [Google Scholar] [CrossRef] [PubMed]
- Krishnan, S.; Athavale, Y. Trends in Biomedical Signal Feature Extraction. Biomed. Signal Process. Control 2018, 43, 41–63. [Google Scholar] [CrossRef]
- Anguita, D.; Ridella, S.; Rivieccio, F. K-Fold Generalization Capability Assessment for Support Vector Classifiers. In Proceedings of the 2005 IEEE International Joint Conference on Neural Networks, Montreal, QC, Canada, 31 July–4 August 2005; Volume 2, pp. 855–858. [Google Scholar] [CrossRef]
- Farahmand, M.; Desa, M.I.; Nilashi, M. A Combined Data Envelopment Analysis and Support Vector Regression for Efficiency Evaluation of Large Decision Making Units. Int. J. Eng. Technol. 2014, 6, 2310–2321. [Google Scholar]
Figure 1. Illustration of remote photoplethysmography.
Figure 1. Illustration of remote photoplethysmography.
Figure 2. Hyperplane problem of the SVR.
Figure 2. Hyperplane problem of the SVR.
Figure 3. Random Forest Regression model.
Figure 3. Random Forest Regression model.
Figure 4. Dataset used in this study. (a) Gender distribution; (b) Heart rate distribution by gender; (c) Heart rate distribution by category; (d) Height vs. Heart Rate by Gender; (e) Weight vs. heart rate by gender; (f) BMI vs. heart rate by gender.
Figure 4. Dataset used in this study. (a) Gender distribution; (b) Heart rate distribution by gender; (c) Heart rate distribution by category; (d) Height vs. Heart Rate by Gender; (e) Weight vs. heart rate by gender; (f) BMI vs. heart rate by gender.
Figure 5. Region of interest used in this study.
Figure 5. Region of interest used in this study.
Figure 6. 5-fold cross-validation.
Figure 6. 5-fold cross-validation.
Figure 7. Workflow of the proposed method.
Figure 7. Workflow of the proposed method.
Figure 8. Feature importance.
Figure 8. Feature importance.
Figure 9. Shapley additive explanations.
Figure 9. Shapley additive explanations.
Figure 10. Partial dependence plot for random forest model.
Figure 10. Partial dependence plot for random forest model.
Figure 11. Comparison with state-of-the-art methods.
Figure 11. Comparison with state-of-the-art methods.
Table 1. Performance of proposed methods.
Table 1. Performance of proposed methods.
Index | Method | Video | Physical | MAE | RMSE | MAPE | Hyperparameter Pairs |
---|---|---|---|---|---|---|---|
1 | RF | ✓ | ✗ | ||||
2 | RF | ✗ | ✓ | % | |||
3 | RF | ✓ | ✓ | 3.057 | |||
4 | SVM | ✓ | ✗ | 7.601 | 18.624 | % | |
5 | SVM | ✗ | ✓ | 3.251 | 12.797 | % | |
6 | SVM | ✓ | ✓ | 7.496 | 18.787 | % |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).