WANG Jie1,2,GUAN Yuansheng1,HU Wenlin2*
(1 School of Electronics and Communication Engineering, Guangzhou University, Guangzhou 510006, Guangdong, China;2 National Engineering Laboratory for Digital Construction and Evaluation of Urban Rail Transit,China Railway Design Corporation, Tianjin 300308, China)
Abstract:
In order to solve problems in audio inpainting, such as the short length of repairable segment, limited object to music audio with high repeatability, and inverse transformation distortion caused by using spectrogram, a new generation adversarial network for long speech inpainting is proposed. The new network takes the original speech signals as input and output, which solves the limitations of the model based on spectrogram. Firstly, it is proposed to use a context codec as a generator to improve the utilization rate of available content around the signal time-domain gap; secondly, a speech feature extraction module is added to the discriminator to effectively improve the training efficiency and generation quality by learning the pitch and phoneme features in the content before and after. Compared with several algorithms, the objective and subjective evaluation results show that our new generation adversarial network proposed in this paper has outstanding speech inpainting performance, and the generation gap length can reach 256 ms. Furthermore, the speech gap of up to 500 ms can be repaired stably for the new extended speech model by varying the audio length.
KeyWords:
audio inpainting; generation adversarial network; context codec; speech feature extraction