Abstract:As an important application of speech signal processing, speech enhancement aims to reduce the influence of background noise on speech signals. However, how to effectively separate target speech in extremely nonstationary noise environment is still a challenging problem. Speech enhancement based on nonnegative matrix factorization (NMF) is currently an advanced and effective technique for suppressing nonstationary noise, which models spectral subspaces of speech and noise using nonnegative basis matrices. First, in this paper, the theory of nonnegative matrix factorization is introduced in details, including the model of the NMF, the definition of cost functions and the commonly used multiplicative update rules. Then, the basic principle of the NMF-based speech enhancement methods is reviewed in details, including the specific processes of the training and enhancement stages, and the experiments are carried out. In addition, an NMF-based speech reconstruction experiment is used to verify the ability of speech basis matrix for modeling the speech spectrums. Finally, the shortcomings of the traditional NMF-based algorithms are summarized, and some existing NMF-based algorithms are respectively briefly reviewed including their innovations, advantages and disadvantages. Moreover, several typical methods are analyzed and compared. This paper shows the continuous developments of the NMF-based speech enhancement methods in a historical perspective.
鲍长春,白志刚. 基于非负矩阵分解的语音增强方法综述[J]. 信号处理, 2020, 36(6): 791-803.
Bao Changchun, Bai Zhigang. Speech Enhancement Based on Nonnegative Matrix Factorization: An Overview. Journal of Signal Processing, 2020, 36(6): 791-803.
P. C. Loizou. Speech enhancement: theory and practice [M]. CRC press, 2007.
[2]
S. F. Boll. Suppression of acoustic noise in speech using spectral subtraction [J]. IEEE Transactions on Acoustic, Speech, and Signal Processing, Apr. 1979, ASSP-27(2): 113-120.
[3]
M. Berouti, M. Schwartz, J. Makhoul. Enhancement of speech corrupted by acoustic noise [C]. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Washington, DC, USA, 2-4 April 1979: 208-211.
[4]
J. Chen, J. Benesty, Y. Huang, et al. New insights into the noise reduction Wiener filter [J]. IEEE Transactions on Audio, Speech, and Language Processing, July 2006, 14(4): 1218-1233.
[5]
Y. Ephraim, D. Malah. Speech enhancement using a minimum mean-square error short-time spectral amplitude estimator [J]. IEEE Transactions on Acoustics, Speech, and Signal Processing, 1984, ASSP-32(6): 1109-1121.
ZOU Xia, CHEN Liang, ZHANG Xiong-wei. Speech enhancement with Gamma speech modeling [J]. Journal on Communications, Oct. 2006, 27(10): 118-123. (in Chinese)
[7]
Y. Ephraim, H. L. V. Trees. A signal subspace approach for speech enhancement [J]. IEEE Transactions on Speech and Audio Processing, July 1995, 3(4): 251-266.
[8]
高珍珍, 鲍长春. 能量匹配的MFS-HMM语音增强方法[J]. 信号处理, Aug. 2016, 32(8): 937-944.
GAO Zhen-zhen, BAO Chang-chun. MFS-HMM speech enhancement with the matched energy [J]. Journal of Signal Processing, Aug. 2016, 32(8): 937-944. (in Chinese)
[9]
F. Deng, C. Bao, W. B. Kleijn. Sparse hidden markov models for speech enhancement in non-stationary noise environments [J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, November 2015, 23(11): 1973-1987.
[10]
S. Srinivasan, J. Samuelsson, W. B. Kleijn. Codebook-based Bayesian speech enhancement for nonstationary environments [J]. IEEE Transactions on Audio, Speech, and Language Processing, Feb. 2007, 15(2): 441-452.
[11]
Q. Huang, C. Bao, X. Wang. Improved codebook-based speech enhancement based on MBE model [C]. 2017 Annual Conference of the International Speech Communication Association (INTERSPEECH 2017), Stockholm, Sweden, 20-24 August 2017: 3627-3631.
[12]
D. L. Wang, J. Chen. Supervised speech separation based on deep learning: an overview [J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, Oct. 2018, 26(10): 1702-1726.
[13]
Y. Yang, C. Bao. DNN-based AR-Wiener filtering for speech enhancement [C]. 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15-20 April 2018: 2901-2905.
[14]
D. D. Lee, H. S. Seung. Learning the parts of objects by non-negative matrix factorization [J]. Nature, October 1999, 401(21): 788-791.
[15]
A. B. J. Teoh, H. F. Neo, D. C. L. Ngo. Sorted locally confined non-negative matrix factorization in face verification [C]. 2005 International Conference on Communications, Circuits and Systems, Hong Kong, China, 27-30 May 2005: 820-824.
[16]
P. Smaragdis, J. C. Brown. Non-negative matrix factorization for polyphonic music transcription [C]. IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, New Paltz, NY, USA, 19-22 Oct. 2003: 177-180.
[17]
P. Smaragdis. From learning music to learning to separate [C]. Forum Acusticum, 2005.
[18]
D. D. Lee, H. S. Seung. Algorithms for non-negative matrix factorization [J]. Advances in Neural Information Processing Systems, 2001, 13(6): 556-562.
[19]
C. F′evotte, N. Bertin, J. L. Durrieu. Nonnegative matrix factorization with the Itakura-Saito divergence: with application to music analysis [J]. Neural Computation, March 2009, 21(3): 793-830.
[20]
P. O. Hoyer. Non-negative matrix factorization with sparseness constraints [J]. Journal of Machine Learning Research, April 2004, 5(1): 1457-1469.
[21]
S. R. Quackenbush, T. P. Barnwell, M. A. Clements. Objective measures of speech quality. Englewood Cliffs, NJ: Prentice Hall, 1988.
[22]
Perceptual evaluation of speech quality (PESQ), an objective method for end-to-end speech quality assessment of narrow-band telephone networks and speech codecs. ITU-T Recommendation, P.862, February 2001.
[23]
C. H. Taal, R. C. Hendriks, R. Heusdens, et al. A short-time objective intelligibility measure for time-frequency weighted noisy speech [C]. IEEE International Conference on Acoustics, Speech and Signal Processing, 2010: 4214-4217.
[24]
J. Jensen, C. H. Taal. An algorithm for predicting the intelligibility of speech masked by modulated noise maskers [J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2016, 24(11): 2009-2022.
[25]
T. Virtanen. Monaural sound source separation by nonnegative matrix factorization with temporal continuity and sparseness criteria [J]. IEEE Transactions on Audio, Speech, and Language Processing, March 2007, 15(3): 1066-1074.
[26]
M. N. Schmidt, J. Larsen. Reduction of non-stationary noise using a non-negative latent variable decomposition [C]. IEEE Workshop on Machine Learning for Signal Processing (MLSP), Cancun, Mexico, 16-19 Oct. 2008: 486-491.
[27]
K. W. Wilson, B. Raj, P. Smaragdis, et al. Speech denoising using nonnegative matrix factorization with priors [C]. 2008 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Las Vegas, NV, USA, 31 March-4 April 2008: 4029-4032.
[28]
K. W. Wilson, B. Raj, P. Smaragdis. Regularized non-negative matrix factorization with temporal dependencies for speech denoising [C]. INTERSPEECH 2008, Brisbane Australia, 22-26 September 2008: 411-414.
[29]
N. Mohammadiha, T. Gerkmann, A. Leijon. A new linear MMSE filter for single channel speech enhancement based on nonnegative matrix factorization [C]. 2011 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), New Paltz, NY, USA, 16-19 Oct. 2011: 45-48.
[30]
N. Mohammadiha, T. Gerkmann, A. Leijon. A new approach for speech enhancement based on a constrained nonnegative matrix factorization [C]. 2011 International Symposium on Intelligent Signal Processing and Communications Systems (ISPACS), Chiang Mai, Thailand, 7-9 Dec. 2011: 1-5.
[31]
B. Raj, R. Singh, T. Virtanen. Phoneme-dependent NMF for speech enhancement in monaural mixtures [C]. INTERSPEECH 2011, Florence, Italy, 28-31 August 2011: 1217-1220.
[32]
N. Mohammadiha, J. Taghia, A. Leijon. Single channel speech enhancement using Bayesian NMF with recursive temporal updates of prior distributions [C]. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Kyoto, Japan, 25-30 March 2012: 4561-4564.
[33]
F. Weninger, J. Feliu, B. Schuller. Supervised and semi-supervised suppression of background music in monaural speech recordings [C]. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Kyoto, Japan, 25-30 March 2012: 61-64.
[34]
C. Joder, F. Weninger, F. Eyben, et al. Real-time speech separation by semi-supervised nonnegative matrix factorization [C]. International Conference on Latent Variable Analysis and Signal Separation (LVA ICA), Tel Aviv, Israel, 2012: 322-329.
[35]
N. Mohammadiha, P. Smaragdis, A. Leijon. Supervised and unsupervised speech enhancement using nonnegative matrix factorization [J]. IEEE Transactions on Audio, Speech, and Language Processing, Oct. 2013, 21(10): 2140-2151.
[36]
H. T. Fan, J. W. Hung, X. G. Lu, et al. Speech enhancement using segmental nonnegative matrix factorization [C]. 2014 IEEE International Conference on Acoustic, Speech and Signal Processing (ICASSP), Florence, Italy, 4-9 May 2014: 4483-4487.
[37]
张立伟, 贾冲, 张雄伟, 等. 稀疏卷积非负矩阵分解的语音增强算法[J]. 数据采集与处理, Mar. 2014, 29(2): 259-264.
ZHANG Liwei, JIA Chong, ZHANG Xiongwei, et al. Speech enhancement based on convolutive nonnegative matrix factorization with sparseness constraints [J]. Journal of Data Acquisition and Processing, Mar. 2014, 29(2): 259-264. (in Chinese)
[38]
Z. Wang, F. Sha. Discriminative non-negative matrix factorization for single-channel speech separation [C]. 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Florence, Italy, 4-9 May 2014: 3749-3753.
[39]
K. Kwon, J. W. Shin, S. Sonowal, et al. Speech enhancement combining statistical models and NMF with update of speech and noise bases [C]. 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Florence, Italy, 4-9 May 2014: 7103-7107.
[40]
K. Kwon, J. W. Shin, N. S. Kim. NMF-based speech enhancement using bases update [J]. IEEE Signal Processing Letters, April 2015, 22(4): 450-454.
[41]
M. Kim, P. Smaragdis. Mixtures of local dictionaries for unsupervised speech enhancement [J]. IEEE Signal Processing Letters, March 2015, 22(3): 293-297.
[42]
H. Chung, E. Plourde, B. Champagne. Basis compensation in non-negative matrix factorization model for speech enhancement [C]. 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, China, 20-25 March 2016: 2249-2253.
[43]
K. M. Jeon, H. K. Kim. Local sparsity based online dictionary learning for environment-adaptive speech enhancement with nonnegative matrix factorization [C]. INTERSPEECH 2016, San Francisco, USA, 8-12 September 2016: 2861-2865.
[44]
S. Lee, D. K. Han, H. Ko. Single-channel speech enhancement method using reconstructive NMF with spectrotemporal speech presence probabilities [J]. Applied Acoustics, 2017, 117: 257-262.
[45]
T. G. Kang, K. Kwon, J. W. Shin, et al. NMF-based speech enhancement incorporating deep neural network [C]. INTERSPEECH 2014, Singapore, 14-18 September 2014: 2843-2846.
[46]
T. G. Kang, K. Kwon, J. W. Shin, et al. NMF-based target source separation using deep neural network [J]. IEEE Signal Processing Letters, February 2015, 22(2): 229-233.
[47]
H. W. Tseng, M. Hong, Z. Q. Luo. Combining sparse NMF with deep neural network: a new classification-based approach for speech enhancement [C]. 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brisbane, QLD, Australia, 19-24 April 2015: 2145-2149.
[48]
T. T. Vu, B. Bigot, E. S. Chng. Combining non-negative matrix factorization and deep neural networks for speech enhancement and automatic speech recognition [C]. 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, China, 20-25 March 2016: 499-503.
[49]
S. Nie, S. Liang, H. Li, et al. Exploiting spectro-temporal structures using NMF for DNN-based supervised speech separation [C]. 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, China, 20-25 March 2016: 469-473.