A Deep Dive Into Neural Synchrony Evaluation for Audio-visual Translation

Abstract

This paper presents a comprehensive analysis of the neural audio-visual synchrony evaluation tool SyncNet. We assess the agreement of SyncNet evaluations vis-a-vis human perception and whether we can use SyncNet as a reliable metric for evaluating audio-visual lip-synchrony in generation tasks with no ground truth reference audio-video pair. We further look into the underlying elements in audio and video which vitally affect synchrony using interpretable explanations from SyncNet predictions and analyse its susceptibility by introducing adversarial noise into those elements. SyncNet has been used in numerous papers on visually-grounded text-to-speech for scenarios such as dubbing. We focus particularly on its suitability in this scenario which features many local asynchronies (something that SyncNet isn’t made for).

Publication
In the 24th ACM International Conference on Multimodal Interaction, ACM ICMI 2022
Debjoy Saha
Debjoy Saha
B.Tech Student

B.tech stduent interested in Multimodal Machine Learning and Speech, Language and Image Processing