Trade-off Between Speech Quality and Intelligibility in DTLN-Based Noise Suppression
Meyti Apriyani, Elok Hamdana |Pages: 261-272|

Abstract— Noise suppression is essential in real-time speech communication, yet common evaluation metrics often capture different aspects of performance. This paper investigates the trade-off between perceptual quality and intelligibility in speech enhanced by the Dual-Signal Transformation LSTM Network (DTLN). A dataset of 1,360 noisy mixtures was created from English and Indonesian speech combined with environmental noise at multiple SNR levels. Objective evaluation was conducted using Perceptual Evaluation of Speech Quality (PESQ), Short-Time Objective Intelligibility (STOI), and Mean Square Error (MSE). Results indicate a weak linear association between PESQ and STOI (Pearson r = -0.265) but a strong monotonic association (Spearman ρ = 0.92), suggesting a nonlinear and condition-dependent relationship: samples with higher intelligibility tend to rank higher in perceived quality, yet the mapping is not well captured by a single linear trend. Correlations involving MSE were negligible (PESQ–MSE: r = -0.008; STOI–MSE: r = 0.074), confirming its limited perceptual relevance. These findings demonstrate that perceptual quality and intelligibility are not interchangeable, and that relying solely on MSE is insufficient. The study recommends intelligibility-aware objectives and multi-metric evaluation strategies to balance comfort and clarity in practical applications such as telemedicine and online learning.


DOI: https://doi.org/10.5455/jjee.204-1758702326