Toggle Main Menu Toggle Search

Open Access padlockePrints

Discriminative Latent Semantic Graph for Video Captioning

Lookup NU author(s): Yang Bai, Dr Yu GuanORCiD


Full text for this publication is not currently held within this repository. Alternative links are provided below where available.


© 2021 ACM. Video captioning aims to automatically generate natural language sentences that can describe the visual contents of a given video. Existing generative models like encoder-decoder frameworks cannot explicitly explore the object-level interactions and frame-level information from complex spatio-temporal data to generate semantic-rich captions. Our main contribution is to identify three key problems in a joint framework for future video summarization tasks. 1) Enhanced Object Proposal: we propose a novel Conditional Graph that can fuse spatio-temporal information into latent object proposal. 2) Visual Knowledge: Latent Proposal Aggregation is proposed to dynamically extract visual words with higher semantic levels. 3) Sentence Validation: A novel Discriminative Language Validator is proposed to verify generated captions so that key semantic concepts can be effectively preserved. Our experiments on two public datasets (MVSD and MSR-VTT) manifest significant improvements over state-of-the-art approaches on all metrics, especially for BLEU-4 and CIDEr. Our code is available at

Publication metadata

Author(s): Bai Y, Wang J, Long Y, Hu B, Song Y, Pagnucco M, Guan Y

Publication type: Conference Proceedings (inc. Abstract)

Publication status: Published

Conference Name: Proceedings of the 29th ACM International Conference on Multimedia (MM '21)

Year of Conference: 2021

Pages: 3556-3564

Online publication date: 17/10/2021

Acceptance date: 02/04/2018

Publisher: ACM


DOI: 10.1145/3474085.3475519

Library holdings: Search Newcastle University Library for this item

ISBN: 9781450386517