This paper presents a comprehensive analysis of the neural audio-visual synchrony evaluation tool SyncNet. We assess the agreement of SyncNet evaluations vis-a-vis human perception and whether we can use SyncNet as a reliable metric for evaluating audio-visual lip-synchrony in generation tasks with no ground truth reference audio-video pair. We further look into the underlying elements in audio and video which vitally affect synchrony using interpretable explanations from SyncNet predictions and analyse its susceptibility by introducing adversarial noise into those elements. SyncNet has been used in numerous papers on visually-grounded text-to-speech for scenarios such as dubbing. We focus particularly on its suitability in this scenario which features many local asynchronies (something that SyncNet isn’t made for).