A Deep Dive Into Neural Synchrony Evaluation for Audio-visual Translation

Shravan Nayak, Christian Schuler, Debjoy Saha, Timo Baumann

November 2022

Abstract

This paper presents a comprehensive analysis of the neural audio-visual synchrony evaluation tool SyncNet. We assess the agreement of SyncNet evaluations vis-a-vis human perception and whether we can use SyncNet as a reliable metric for evaluating audio-visual lip-synchrony in generation tasks with no ground truth reference audio-video pair. We further look into the underlying elements in audio and video which vitally affect synchrony using interpretable explanations from SyncNet predictions and analyse its susceptibility by introducing adversarial noise into those elements. SyncNet has been used in numerous papers on visually-grounded text-to-speech for scenarios such as dubbing. We focus particularly on its suitability in this scenario which features many local asynchronies (something that SyncNet isn’t made for).

Type

Conference paper

Publication

In the 24th ACM International Conference on Multimodal Interaction, ACM ICMI 2022

Debjoy Saha

B.Tech Student

B.tech stduent interested in Multimodal Machine Learning and Speech, Language and Image Processing