This study presents a novel approach for detecting heart disease using audio signals to optimize the detection technique. The schematic in Fig. 1 illustrates the proposed technique in a block diagram format. The proposed methodology encompasses a series of sequential stages: data acquisition, data augmentation, data pre-processing, feature extraction, feature normalization, model selection, model implementation, and result prediction. This study aims to enhance the reliability of the comparative analyses conducted in previous studies8,9. Consistency in the experimental setup and data collection methods is maintained throughout this investigation. The feature extraction process involves MFCCs, and eight other main feature extraction methods are employed to extract the most significant attributes from the dataset. Our methodology utilized a combination of ML and DL models to tackle the multi-classification problem in detecting heart disease. This investigation initially utilized the PASCAL classifying heart sound challenge Dataset and the 2016 PhysioNet/Computing in Cardiology (CinC) Challenge datasets. Ventricular Septal Defect (VSD), Atrial Septal Defect (ASD), Patent Ductus Arteriosus (PDA), murmur, and extrasystole are just a few of the disorders covered by the databases. The proposed methodology exhibits a heightened capacity for expedited disease detection and precision compared to previous research. The proposed methodology incorporates a total of five machine learning models, namely Random Forest (RF), K-Nearest Neighbour (KNN), Decision Tree (DT), Extreme Gradient Boosting (XGB), Multilayer Perceptron (MLP), and two deep learning models, Deep Neural Network (DNN), and 1D-Convolutional Neural Network (CNN1D). The evaluation of the model involved the consideration of various metrics, including accuracy, precision, recall, and the F1-score. Additionally, a confusion matrix was generated to comprehensively analyze the model’s performance concerning established benchmarks within the industry.
Dataset selection and noise induction
The design project selected for this endeavor is the Classifying Heart Sounds Pascal Challenge (CHSPC). The dataset comprises a collection of heart sound recordings obtained from a sample size of 400 individuals. The participants were categorized into two groups: a group of 200 individuals with typical cardiac function and another group of 200 individuals with atypical cardiac function. The patients in the dataset were collected from four different clinical sites, each contributing an almost equal number of subjects. According to32, the dataset encompasses a maximum of three recordings, each lasting approximately 10 seconds, for every subject. These recordings are obtained from distinct chest positions. The WAV files contain audio recordings that were made with an electronic stethoscope. In addition to the recordings, the dataset also includes a set of annotations for each one, pinpointing where the heart sounds can be heard in the recording and classifying them as normal or abnormal. Professional cardiologists with years of experience annotated the data. Machine learning competitions used the CHSPC dataset to classify heart sounds as normal or pathological. This work aimed to create algorithms with the intelligence to analyze and classify heart sound recordings independently. The CHSPC dataset is a great resource for researchers and machine learning practitioners when building algorithms for identifying cardiac diseases using heart sound recordings.
The contest consisted of two rounds. The proficiency of the participants’ segmentation algorithm abilities was assessed during the initial round. In contrast, the subsequent round focused on evaluating the algorithm’s effectiveness in accurately categorizing heart sounds as “normal,” “murmur,” “extra heart sound,” or “artifact” within a laboratory setting. To assess the efficacy of the novel methodology, only the outcomes derived from the initial segment of the experiment, encompassing both datasets, were considered for analysis. The algorithm’s robustness was tested using two datasets, including clean and noisy cardiac sounds. There are more audible heart sounds in the Digiscope data collection. The initial dataset consists of 175 audio signals, each belonging to one of four categories: “normal,” “murmur,” “extra heart sound,” or “artifact.” The distribution of classes in Dataset A is depicted in Fig. 2a. Dataset B comprises a total of 655 audio signals on heart disease. Dataset B consists of three distinct categories of audio signals, namely “normal,” “murmur,” and “extra stole.” The distribution of classes in Dataset B is depicted in Fig. 2b.
The integration of both datasets was undertaken to enhance the complexity of this approach. The ultimate dataset comprises a total of 832 audio signals. Figure 3 illustrates the visual representation of the audio signals. The description of diseases presented in the dataset is shown in Table 2.
The utilization of Dataset A and Dataset B allows for establishing a uniform benchmark, facilitating the comparison of various algorithms and enabling researchers to replicate and further develop prior research endeavors. Moreover, including diverse pathological heart sounds within these datasets renders them highly valuable in developing diagnostic tools for cardiovascular conditions. The primary objective of this study was to investigate the detection of heart disease through the analysis of sound signals contaminated with noise. The study employed an existing publicly accessible dataset and generated a novel dataset by merging original and noisy heart disease sound signals. This new dataset facilitates further investigation and allows researchers to derive more significant findings from the data.
Noise induction and audio data augmentation
This research employed a data augmentation technique to enhance the generality and complexity of the dataset. Audio data augmentation refers to transforming current audio data into different variations. This technique enhances the generalization capabilities of machine learning models by exposing them to a diverse range of input data, thereby expanding the size of the training dataset. Many different modifications can be applied to audio data through audio augmentation techniques, such as changing the pitch or tempo, adding noise or other sound effects, adjusting the volume or balance, and performing time-stretching or time-shifting procedures. These techniques are flexible enough for audio information, such as music, sound effects, and speech. Audio data augmentation might be especially useful in applications where machine learning models’ high accuracy and robustness require extensive and diverse training datasets. Such uses can be seen in voice recognition, speaker verification, and music classification systems.
There were 832 different audio samples in the original dataset. Audio pitch and tempo changes and the addition of noise were among the methods used to supplement the data. The updated dataset now includes a total of 2882 audio sound signals, comprising 1538 “normal” signals, 746 “murmur” signals, 320 “artifact” signals, 176 “extrasystole” signals, and 102 “extra heart sounds.” Figure 4 shows the distribution of the final dataset used in the analysis.
Data pre-processing
Pre-processing is a crucial step in ensuring optimal machine learning model performance. The audio data underwent several preprocessing stages before integrating into the training phase. The initial stage in editing audio data involves converting it into a format that can be understood by a computer, thereby enabling the extraction of essential values in subsequent steps.
Sampling rate
A sample refers to a discrete subset of data, exemplified by a fragment of audio lasting briefly, typically measured in seconds. The sample rate describes the frequency at which samples are collected. The sample rate (frame rate) utilized in our study was 44100. The equation 1 allows the total number of frames in an audio file to be calculated by multiplying the sampling (frame) rate by the file’s duration in seconds9.
$$\begin{aligned} \,total\,frame = samplingrate \times time \end{aligned}$$
(1)
If the signal labeled as “file1” is an analog signal that spans 9 seconds, we can utilize Equation 2 to calculate its overall frame rate.
$$\begin{aligned} file1 = 44100 \times 9 \end{aligned}$$
(2)
Data framing
Data framing is a technique that can be employed to ensure uniformity in the sampling (frame) rate of all audio files. The initial stage in sound processing often involves extracting pertinent acoustical features, followed by decision-making processes encompassing information acquisition, categorization, and integration. Subsequently, the data derived from the audio signal is converted into a format appropriate for depiction in an alternative domain, namely the frequency domain. It was determined that a greater sampling rate and a significantly larger number of data points were required to depict audio data effectively. Samples indicate the magnitude of an audio waveform at a particular moment in time. Figure 5 displays a mel-spectrogram of a synthetic audio file, showing how the “loudness” of the signal changes over time at various frequencies. The horizontal axis of the audio clip represents time, specifically 9 seconds. On the other hand, the vertical axis displays frequencies ranging from 0 to 8 kHz. The mel-spectrogram visually represents a sound wave’s amplitude by using purple hues.
Data normalization and encoding
Data normalization refers to scaling numerical data to a standardized scale or range. This practice is to mitigate the influence of variations in scale on the analysis and processing methods employed for data. Normalization techniques are utilized in machine learning, statistics, and data mining. The data normalization in this study was conducted using the standard scalar method. The utilization of the standard scalar normalization technique is prevalent in the field of machine learning. The pre-processing stage involves standardizing the characteristics by subtracting the mean and dividing by the standard deviation. The resulting data collection is transformed such that each characteristic has been adjusted to have a mean of zero and a standard deviation of one. The utilization of Standard Scalar is particularly advantageous in cases where the dimensions of the features within the dataset exhibit variations, as this discrepancy can potentially impair the efficacy of numerous machine learning methodologies. When utilizing the Standard Scalar, the features will possess a uniform scale, facilitating their comparability and evaluability. The present study employs an Equation 3 to standardize the feature set of the dataset.
$$\begin{aligned} { X^{‘} = \frac{(X – X\_mean)}{X\_std}} \end{aligned}$$
(3)
Let X represent the original feature, \(X\_mean\) denote its mean, \(X\_std\) represent its standard deviation, and \(X^{‘}\) denote its standardized version. In addition, the process of converting categorical variables into numerical representations is accomplished by utilizing the One-Hot-Encoder feature transformation technique.
Feature extraction
Multiple characteristics can be discerned within each sound wave. However, we must emphasize the specific aspects of the forthcoming event we intend to unveil. The initial stage of this analysis involved the utilization of Mel Frequency Cepstral Coefficients (MFCC). The process of extracting MFCC features is depicted in Fig. 6, with each step being elucidated subsequently. The process of extracting Mel-frequency MFCC features is succinctly outlined in this section.
Audio Preparation step involves applying preprocessing techniques to the audio stream to eliminate background noise and non-speech or silent intervals.
Framing is a step that occurs after preprocessing, where the signal is divided into shorter frames that typically last between 20 to 30 milliseconds. There is usually some overlap between consecutive frames. The temporal variations of the signal can be captured, leading to an enhancement in temporal resolution.
Windowing is a technique to mitigate spectral leakage and emphasize essential information within each frame. This is achieved by applying a window function, such as the Hamming window, to the frame. The choice of window size is 25 milliseconds to make short segments of the audio signal before computing the MFCCs. A common choice is 25 milliseconds, equivalent to 400 samples.
Fourier transform, specifically the short-time Fourier transform (STFT), is utilized to convert the windowed frames into the frequency domain. This transformation results in a set of spectra with complex values. For a window length of 25 milliseconds, corresponding to 400 samples, we use 512 1024 FFT points.
Mel-frequency wrapping is a technique developed to approximate the non-linear frequency response of the human ear. This technique utilizes a perceptual frequency scale known as the Mel scale. A filter bank consisting of triangle filters is employed to map the amplitude of each spectrum onto the Mel scale. These filters are designed with narrower spacing at lower frequencies and wider spacing at higher frequencies.
Logarithmic compression is employed to compress the dynamic range and accentuate the distinctions among the filter-bank coefficients by taking the logarithm of the magnitude values in each Mel filter-bank.
The Discrete Cosine Transform (DCT) is employed to convert the coefficients of the resulting log-Mel filter bank to the cepstral domain, enabling their utilization. Typically, only the coefficients with the lowest order are preserved as they effectively capture the fundamental characteristics of the signal.
The delta and delta-delta features can be obtained by computing the first and second derivatives of the MFCCs. The aforementioned characteristics capture the temporal progression of the MFCCs and have the potential to provide further insights into the dynamics of the signal.
In addition to employing the MFCC feature extraction method, this study incorporated nine additional feature extraction methods and developed a feature ensembler. This study applies various feature extraction techniques: spectral centroid, spectrum, zero-cross examination rate, spectral bandwidth and spectrum roll-off. These methods are presented in Table 3.
Zero Crossing Rate: The pace at which the sign of a signal change can be used as an indicator of how noisy or clean the sound is; this rate is known as the zero crossing rate. More high-frequency content or noise is associated with a higher zero crossing rate, while a smoother and less noisy signal is associated with a lower rate34.
Spectral Roll-off: It is the frequency at which a given fraction of the total spectral energy is located. It aids in defining the spectral profile of a sound. Most of the energy is concentrated at low frequencies if the spectral roll-off is small, while a larger number implies a more even distribution over the audible spectrum35.
Spectral Centroid: The “brightness” of an audio signal can be determined by calculating the spectral centroid, which is the spectrum’s mathematical center of mass. A high spectral centroid value indicates that the audio is treble- or high-heavy, whereas a low value indicates that the music is bass- or low-heavy36.
Spectral Contrast: The difference in amplitude between peaks and valleys in the spectrum is measured by spectral contrast. It is used to determine the prominence of various spectral peaks. A higher contrast value suggests sharper spectral peaks, implying different sound components36.
Spectral Bandwidth: The spectral bandwidth measures the spectral content’s width and the audio signal’s frequency spread. While a low value indicates a lower concentration of frequencies, a high value denotes a wide distribution of frequencies37.
Chroma STFT: The harmonic content of the audio is represented by the chroma Short-Time Fourier Transform (STFT). The study of tonal qualities and musical notes is made possible by the extraction of information about the pitch class of audio frames37.
Root Mean Square (RMS): measures an audio signal’s root mean square amplitude. It divulges details about the signal’s overall energy level. A louder audio is indicated by a greater RMS value38.
Mel-Spectrogram: is a representation of the audio signal in the mel-frequency domain is a mel-spectrogram. It facilitates human-like audio analysis by converting the linear frequency scale into a perceptually appropriate mel-frequency scale34.
To determine the value of each element, a computation is performed by taking the average of all the numbers obtained within each frame, followed by the subtraction of their respective standard deviations. A spectrum energy map was generated for the Mel scale by applying the Fourier transform (using the window abbreviation) to the signal. This was done using the MFCC series of infant audio recordings. Subsequently, by employing an independent cosine transformation on the Mel log energy array, extract the logarithms of the power values. MFCCs represent the intensities of the emerging spectrum. The present study introduces a novel feature ensembler. The ensembler in question incorporates a collection of features derived from various feature extraction methods. The data frame contains 288 elements extracted from each audio file. In our approach, we first used the Standard Scaler, a standard normalization technique, to normalize the features in our dataset. Normalization is essential for ensuring that the various features are on a similar scale and that one does not dominate the others during the modeling process. This step enhances the model’s stability and performance. Following normalization, we transformed the features into a numpy array, a data structure ideal for numerical computations and analysis. When working with data, Numpy arrays provide efficiency and flexibility. Following that, we reshaped the data to meet the needs of our chosen machine-learning models. Data shaping ensures compatibility and consistency when sending data to models.
Finally, we separated our data into training and testing sets. The training set is used to train our classification models so that they can learn and predict. The test set, which the models did not see during training, assesses their performance and generalization to new, previously unseen data. This methodical approach, from normalization to data conversion, reshaping, and splitting, lays the groundwork for successful classification model application to our dataset. It ensures that the models are appropriately trained and rigorously evaluated for their classification tasks, contributing to our results’ overall quality and reliability.
Classification models and parameters settings
The study employed five machine learning models, specifically Random Forest (RF), K-Nearest Neighbour (KNN), Decision Tree (DT), Extreme Gradient Boosting (XGB), and Multilayer Perceptron (MLP). Two deep learning models, Deep Neural Network (DNN) and 1D-Convolutional Neural Network (CONV-1D), were also utilized. Each model is described individually, providing detailed information about its fine-tuned parametrized settings.
Machine learning models
This section provides the machine learning models used for experiments.
Random Forest Model: is an ensemble learning model commonly employed for various tasks, including classification and regression. The ensemble model consists of a collection of decision trees constructed using a randomly selected set of features and training data. The mentioned characteristics of the subject are widely recognized in academic circles, including its exceptional precision, robustness against distortions and anomalies, and ability to handle data with a high number of dimensions effectively. This method has been successfully implemented in various industries, such as banking, medicine, and bioinformatics. The methodology employed involves the generation of multiple decision trees during the training process, followed by determining the class that represents the average of all the predicted classes, specifically in the context of regression. The estimation of the significance of features is also possible. The experimental settings for the RF model include the following parameters: maximum depth of 8, the maximum number of features considered for splitting at each node set to 5, the minimum number of samples required to split an internal node set to 5, and the number of estimators (i.e., decision trees) in the random forest ensemble set to 500.
Decision Tree Model: is a supervised learning algorithm commonly employed in machine learning to classify problems. The system’s functioning involves dividing the dataset into smaller subsets, utilizing a predetermined set of features. This partitioning is performed recursively, further dividing the subsets into even smaller subsets until the data can be readily classified. The construction of the tree structure involves the iterative partitioning of the dataset into increasingly smaller subsets, guided by the optimal feature values that effectively distinguish between different classes. The outcome is a tree structure consisting of decision nodes and leaf nodes. The decision nodes hold the split conditions, while the leaf nodes store the class labels. Decision trees have gained popularity due to their ease of comprehension, interpretability, and ability to accommodate both categorical and numerical data. Nevertheless, it is worth noting that these models tend to fit the training data excessively, thereby compromising their generalization capabilities. Consequently, ensemble methods such as random forests are frequently employed to enhance their overall performance. The parameters of the DT model were set as criterion=’gini’, splitter=’best’, max_depth=None, min_samples_split=2, min_samples_leaf=1, and min_weight_fraction_leaf=0.0 throughout the experiments.
Extreme Gradient Boosting Model: XGBoost, a widely adopted gradient boosting technique, holds prominence in machine learning for its application in classification and regression tasks. The XGBoost algorithm is based on the gradient boosting framework, which involves iteratively adding models to an ensemble. Each added model aims to improve the overall performance of the ensemble by reducing the errors made by the previous models. XGBoost differs from previous gradient boosting techniques by incorporating the ability to handle missing values in the input data and employing a more regularized model formulation to mitigate the issue of overfitting. The acceleration of model training is achieved by using parallel processing techniques and implementing a more efficient optimization approach. Due to its inherent attributes, XGBoost has gained significant popularity and proven to be a highly efficient machine learning technique, particularly suitable for tasks involving the analysis of extensive datasets and complex feature spaces. The default parameters of the XGB model were maintained throughout the experiments.
Multilayer Perceptron Model: consists of several layers of interconnected nodes or neurons. These layers include an input layer, several hidden layers, and an output layer. Neurons in the layer above them feed the neurons in network information. After processing them, a non-linear activation function is applied to these signals by producing a weighted total. Finally, the processed signals are transmitted to the next network layer. To accomplish this, the inter-neuron connection weights are frequently acquired through a backpropagation technique. Supervised learning encompasses various problem domains, two of which are classification and regression. MLPs have demonstrated exceptional performance in these specific domains. The parameters of the MLP model utilized in the experiments were maintained at their default values.
K-Nearest Neighbours Model: is a fundamental machine learning algorithm employed in classification and regression tasks. The algorithm identifies the K nearest labeled data points to a novel, unlabeled data point and leverages their class or average value to generate predictions. The algorithm operates under the assumption that data points that exhibit similarity are likely to possess similar labels or values. The selection of the parameter K influences the adaptability of the decision boundary. The K-NN algorithm is characterized by its simplicity in comprehension and implementation. However, it is important to note that its computational demands can increase significantly when applied to large datasets. Additionally, the performance of K-NN is susceptible to the scaling of features. The algorithm generally exhibits versatility by effectively capturing local patterns within the dataset. The parameters of the KNN were set as n_neighbors=5, weights=’uniform’, algorithm=’auto’, leaf_size=30, p=2, metric=’minkowski’, metric_params=None and n_jobs=None.
Deep learning models
In recent years, the widespread integration of deep learning and machine learning models has transformed many fields, providing unprecedented answers to complex challenges. Convolutional Neural Networks (CNNs) and other deep learning architectures are used in fields ranging from computer vision and natural language processing to medical diagnostics and financial predictions. Researchers have used the power of these sophisticated models to extract subtle patterns and representations from big information, allowing breakthroughs in a wide range of applications39,40. As the demand for intelligent systems grows, the exploration and implementation of these models remain at the leading edge of cutting-edge research and innovation. This research dives into using advanced deep learning techniques, focusing on CNNs, to address a paradigm that highlights the versatility and efficacy of modern neural network architectures.
The two significant deep learning models utilized in this study are the DNN and the CONV-1D. Each model is described separately, providing detailed information about their fine-tuned parametrized settings.
One-dimensional Convolutional Neural Network: The present study utilized 1D-CNN. The Conv1d architecture is widely utilized in deep learning for processing sequence data with a singular dimension. This encompasses various data types, such as time series, audio signals, and textual information. In a Conv1d network, individual convolutional layers acquire distinct filters that are subsequently closely integrated with the input signal to discern patterns or features. Rectified Linear Unit (ReLU) is one example of a non-linear activation function used at the end of each convolutional layer to give the model its non-linearity. The convolutional layers are typically followed by one or more fully connected layers responsible for performing the classification or regression task. The Conv1d network employs the backpropagation optimization process to train its parameters. This process involves adjusting the network’s weights and biases to minimize a loss function, quantifying the discrepancy between the anticipated and observed output.
The CNN model under examination comprises three convolutional layers, three max-pooling layers, two dropout levels, and two fully connected layers. Dropout is a regularization technique used in deep learning to address the problem of overfitting. Overfitting occurs when a model becomes excessively complex during training on a limited dataset, fitting the noise rather than the underlying pattern. As a result, extrapolating to unfamiliar data becomes challenging. The model underwent training using 571,525 parameters. In the context of training neural networks, deep learning commonly utilizes optimizers and employs the categorical cross-entropy loss function. Optimizers iteratively adjust the parameters by changing the weights and biases of a neural network during training. Optimization aims to find the values of the weights and biases that provide the least deviation between the predicted and actual output (the loss function). Among the many optimization methods available are stochastic gradient descent, Adam, RMSProp, and many more. In multiclass classification applications, categorical cross-entropy is a popular loss function. The statistical metric quantifies the discrepancy between the actual probability distribution for a particular class and the estimated distribution. The primary goal is to optimize the probability of correctly classifying an instance while minimizing the categorical cross-entropy loss. During the training process, the weights and biases of a neural network are iteratively updated by an optimizer in conjunction with a loss function, such as categorical cross-entropy. This iterative update process aims to enhance the network’s predictive capabilities. The present study employed the Adam optimizer and utilized categorical cross-entropy as the loss function during the model’s training. The experimental configuration involved setting the batch size to 64 and the number of epochs to 90.
Deep Neural Network: Artificial neural networks, specifically Deep Neural Networks (DNNs), are widely utilized due to their extensive layers of computational capacity. Each layer within the network is designed to acquire a progressively intricate representation of the data by building upon the preceding layers. The term “input layer” denotes the layer in a neural network closest to the input data, while the term “output layer” denotes the layer closest to the output. The term “hidden layers” pertains to the intermediate layers between the observable layers. DNNs have demonstrated their ability to effectively address complex problems such as image classification, natural language processing (NLP), and speech recognition. The individuals undergo training using large datasets, employing algorithms that modify the weights and biases of the model and are subsequently assessed using a loss function. The architecture incorporated seven dense layers. No dropout layers were employed to compare a generalized model with a complex one. All other experimental conditions remain consistent with those of the CNN model.
The DNN model is structured sequentially, showing a typical classification task architecture with tightly connected layers. Following an initial layer with input data-aligned dimensionality, successive layers gradually decrease the number of neurons. To add non-linearity, the ReLU activation function is used by 1000 neurons in the first dense layer. This layer inputs weights and biases totaling 289,000. In the same way, the following dense layers consist of 750, 500, 250, 100, 50, and 5 neurons, respectively. The output layer for multiclass classification is the last dense layer with 5 neurons.
There are a total of 1,570,905 trainable parameters in the architecture. These parameters are fine-tuned throughout training to maximize the model’s capacity to identify links and patterns in the input data. The model can capture complex patterns in the dataset since the number of neurons in each layer decreases, making it easier to extract hierarchical features. Although the output layer’s activation function is not specified, multiclass classification tasks frequently employ softmax to generate probability distributions for each class. This DNN model is well-suited for classification tasks since it strikes a good mix between being overly complicated and being able to detect complex patterns in the input data.