Multi-Scale Attention Neural Network for Acoustic Echo Cancellation. (arXiv:2106.00010v1 [cs.SD])

Acoustic Echo Cancellation (AEC) plays a key role in speech interaction by
suppressing the echo received at microphone introduced by acoustic
reverberations from loudspeakers. Since the performance of linear adaptive
filter (AF) would degrade severely due to nonlinear distortions, background
noises, and microphone clipping in real scenarios, deep learning has been
employed for AEC for its good nonlinear modelling ability. In this paper, we
constructed an end-to-end multi-scale attention neural network for AEC.
Temporal convolution is first used to transform waveform into spectrogram. The
spectrograms of the far-end reference and the near-end mixture are
concatenated, and fed to a temporal convolution network (TCN) with stacked
dilated convolution layers. Attention mechanism is performed among these
representations from different layers to adaptively extract relevant features
by referring to the previous hidden state in the encoder long short-term memory
(LSTM) unit. The representations are weighted averaged and fed to the encoder
LSTM for the near-end speech estimation. Experiments show the superiority of
our method in terms of the echo return loss enhancement (ERLE) for single-talk
periods and the perceptual evaluation of speech quality (PESQ) score for
double-talk periods in background noise and nonlinear distortion scenarios.

