Youtube2Text: Natural Language Description Generation for Diverse Activities in Video
Professor Kate Saenko
University of Massachusetts Lowell
Friday, November 1st, 2013
Abstract: Many core tasks in artificial intelligence require joint modeling of images and natural language. The past few years have seen increasing recognition of the problem, with research on connecting words and names to pictures, describing static images in natural language, and visual grounding of natural-language instructions for robotics. I will discuss recent work that focuses on generating natural language descriptions of a short but extremely diverse YouTube video clips, where limited prior work exists.Despite a recent push towards large-scale object recognition, activity recognition in video remains limited to narrow domains and small vocabularies of actions. In this work, we tackle the challenge of recognizing and describing activities ``in-the-wild''. We present a solution that takes a short video clip and outputs a brief sentence that sums up the main activity in the video, namely, the actor, the action and its object. Unlike previous work, our approach works on out-of-domain actions: If it cannot find an accurate prediction for a pre-trained model, it finds a less specific answer that is also plausible from a pragmatic standpoint. We use semantic hierarchies learned from the data to help to choose an appropriate level of generalization, and priors learned from web-scale natural language corpora to penalize unlikely combinations of actors/actions/objects; we also use a web-scale language model to ``fill in'' novel verbs, i.e. when the verb does not appear in the training set. We evaluate our method on a large YouTube corpus and demonstrate it is able to generate short sentence descriptions of video clips better than baseline approaches.
Professor Saenko has been an Assistant Professor in the Computer Science Department at UML since fall 2012. Before that, she was a Postdoctoral Researcher at the International Computer Science Institute, a Visiting Scholar at UC Berkeley EECS and a Visiting Postdoctoral Fellow School of Engineering and Applied Science at Harvard University. Before that, she was a PhD student at MIT. Her research interests are in applications of machine learning to image and language understanding, multimodal perception for autonomous systems, and adaptive intelligent human-computer interfaces.
November 1, 2013