Elevate your enterprise information technology and strategy in Transform 2021.
Humans understand events in the world contextually, performing what’s called multimodal reasoning across time to make inferences about the past, present, and future. Given text and a picture that appears benign when considered separately — e.g.,”Look how many people love you” and an image of a bare desert — people recognize that these elements take on potentially hurtful connotations when they’re paired or juxtaposed, for example.
Even the best AI systems fight in this region. But there’s been progress, most recently from a team in the Allen Institute for Artificial Intelligence and the University of Washington’s Paul G. Allen School of Computer Science & Engineering. At a preprint paper released this month, the researchers detail Multimodal Neural Script Knowledge Models (Merlot), a system which learns to match images in videos with words and follow events globally over time by viewing millions of YouTube videos with transcribed speech. It does all of this in an unsupervised fashion, meaning the videos haven’t been labeled or categorized — forcing the machine to find out from the videos’ inherent structures.
Learning from videos
Our capacity for commonsense reasoning is shaped by how we experience causes and effects. Teaching machines this type of”script understanding” is a significant challenge, in part because of the amount of data it requires. For example, even a single photo of people dining at a restaurant can imply a wealth of information, like the fact that the people had to agree where to go, meet up, and enter the restaurant before sitting down.
Merlot attempts to internalize these concepts by watching YouTube videos. Lots of YouTube videos. Drawing on a dataset of 6 million videos, the researchers trained the model to match individual frames with a contextualized representation of the video transcripts, divided into segments. The dataset contained instructional videos, lifestyle vlogs of everyday events, and YouTube’s auto-suggested videos for popular topics like”science” and”home improvement,” each selected explicitly to encourage the model to learn about a broad range of objects, actions, and scenes.
The goal was to teach Merlot to contextualize the frame-level representations over time and over spoken words so it could reorder scrambled video frames and make sense of “noisy” transcripts — including those with erroneously lowercase text, missing punctuation, and filler words like “umm,” “hmm,” and “yeah.” The researchers largely accomplished this. They reported that in a series of qualitative and quantitative tests, Merlot had a strong “out-of-the-box” understanding of everyday events and situations, enabling it to take a scrambled sequence of events from a video and order the frames to match the captions in a coherent narrative, like people riding a carousel.
Merlot is only the latest work on video understanding in the AI research community. In 2019, researchers at Georgia Institute of Technology and the University of Alberta created a system that could automatically generate commentary for”let’s play” videos of movie games. More recently, researchers in Microsoft published that a preprint paper describing a system that could determine whether announcements about video clips were accurate with learning from visual and textual clues. And Facebook has educated a computer vision program that can automatically find music, textual, and visual representations from publicly accessible Facebook videos.
Above: Merlot can comprehend the arrangement of events in movies, as demonstrated h