Article révisé par les pairs
Résumé : In this paper, empirical mode decomposition (EMD) is proposed as an alternative to decompose the log magnitude spectrum of the speech signal into its harmonic, envelope and noise components. The acoustic measure named harmonic-to-noise ratio (HNR) is used to summarize the degree of disturbance in the speech signal and consequently to evaluate the overall quality of the disordered voices produced by dysphonic speakers. Most approaches for HNR estimation have in common to involve the isolation of individual speech cycles or pseudo-harmonics/rhamonics in speech spectrum/cepstrum; however, this isolation cannot be carried out reliably in speech produced by severely hoarse speakers and may result in inaccurate HNR estimation. The EMD-based approach used in this study incorporates an appropriate procedure that estimates automatically the thresholds used by the clustering algorithm without knowledge of the fundamental frequency. The frequency range of the harmonic and noise components is divided into ten equally spaced intervals and the harmonic-to-noise ratios (HNRs) within each interval are used as independent variables to summarize the amount of perceived hoarseness. The proposed method is evaluated on a corpus comprising 251 normophonic and dysphonic speakers. Multiple correlation analysis carried out on HNRs from the different frequency bands shows that multi-band analysis based on empirical mode decomposition results in statistically significantly higher correlation of predicted scores with scores of perceived hoarseness over full-band analysis. Principal component analysis is carried out on the HNR measures obtained in the ten frequency bands. More than 97% of the total variance is explained by the first two principal components, PC1 and PC2. Experimental results show that the first principal component is interpretable in terms of the degree of the severity of hoarseness whereas the second principal component indicates whether the voice is high-pitched or low-pitched. It is shown that the first two principal components result in a high predictability of hoarseness scores.