基于生成对抗网络的音频补偿方法-陕西师范大学学报期刊社网站

陕西师范大学学报（自然科学版）

超声检测和声频工程专题

基于生成对抗网络的音频补偿方法

王杰1,2，观元升1，胡文林2*

（1 广州大学电子与通信工程学院, 广东广州 510006；2 中国铁路设计集团有限公司城市轨道交通数字化建设与测评技术国家工程实验室，天津 300308）

胡文林，男，高级工程师，研究方向为声学测量和噪声控制技术。E-mail:huwenlin@crdc.com

摘要:

为解决音频补偿存在可修复片段长度较短、修复对象局限于高重复性音频和采用语谱图所带来的逆变换失真等问题，提出了针对长语音补偿的新生成对抗网络。新网络模型以原始语音作为输入输出信号，解决传统基于语谱图方法的局限性。首先，采用前后文编解码器作为生成器，提高对信号时域空白间隙周围可用内容的利用率；其次，将语音特征提取模块加入鉴别器，通过学习前后文内容中音高、音素特征，有效提升训练效率和生成质量。结果表明：与现有多个算法进行对比，提出的生成对抗网络具有良好的语音补偿性能，可修复间隙长度达256 ms。进一步通过变速扩展音频长度，针对扩展语音新模型可稳定修复长达500 ms的语音间隙。

关键词：

音频补偿；生成对抗网络；前后文编解码器；语音特征提取

收稿日期：

2021-11-14

中图分类号：

TB51+8

文献标识码：

文章编号：

1672-4291(2022)06-0039-10

基金项目：

城市轨道交通数字化建设与测评技术国家工程实验室开放课题(2021JZ02)；国家自然科学基金（11974086）;广州大学校内科研项目（YJ2021008）;广州市科技计划项目（201904010468）

Doi:

10.15983/j.cnki.jsnu.2022213

Speech gap inpainting with generation adversarial network

WANG Jie1,2,GUAN Yuansheng1,HU Wenlin2*

(1 School of Electronics and Communication Engineering, Guangzhou University, Guangzhou 510006, Guangdong, China;2 National Engineering Laboratory for Digital Construction and Evaluation of Urban Rail Transit,China Railway Design Corporation, Tianjin 300308, China)

Abstract:

In order to solve problems in audio inpainting, such as the short length of repairable segment, limited object to music audio with high repeatability, and inverse transformation distortion caused by using spectrogram, a new generation adversarial network for long speech inpainting is proposed. The new network takes the original speech signals as input and output, which solves the limitations of the model based on spectrogram. Firstly, it is proposed to use a context codec as a generator to improve the utilization rate of available content around the signal time-domain gap; secondly, a speech feature extraction module is added to the discriminator to effectively improve the training efficiency and generation quality by learning the pitch and phoneme features in the content before and after. Compared with several algorithms, the objective and subjective evaluation results show that our new generation adversarial network proposed in this paper has outstanding speech inpainting performance, and the generation gap length can reach 256 ms. Furthermore, the speech gap of up to 500 ms can be repaired stably for the new extended speech model by varying the audio length.

KeyWords:

audio inpainting; generation adversarial network; context codec; speech feature extraction