Top

海角视频

How to train your robot: U of T research provides new approaches

U of T postdoctoral researcher Makarand Tapaswi and PhD student Paul Vicol have used movies as a proxy for the real world in training how robots should behave (photos by Ryan Perez)

If your friend is sad, you can say something to help cheer them up. If you ask your co-worker to make coffee, they know the steps to complete this task.

But how do artificially intelligent robots, or AIs, learn to behave in the same way humans do?

University of Toronto researchers are presenting new approaches towards socially intelligent AIs, at the Computer Vision and Pattern Recognition (CVPR) conference, the premier annual computer vision event this week in Salt Lake City, Utah.

How do we train a robot how to behave?

In their paper , Paul Vicol, a PhD student in computer science, Makarand Tapaswi, a post-doctoral researcher, Lluis Castrej贸n, a master鈥檚 graduate of U of T computer science who is now a PhD student at the University of Montreal Institute for Learning Algorithms, and Sanja Fidler, an assistant professor at U of T Mississauga鈥檚 department of mathematical and computational sciences and tri-campus graduate department of computer science, have amassed a dataset of annotated video clips from more than 50 films.

鈥淢ovieGraphs is a step towards the next generation of cognitive agents that can reason about how people feel and about the motivations for their behaviours,鈥 says Vicol. 鈥淥ur goal is to enable machines to behave appropriately in social situations. Our graphs capture a lot of high-level properties of human situations that haven鈥檛 been explored in prior work.鈥

Their dataset focuses on films in the drama, romance, and comedy genres, like Forrest Gump and Titanic, and follows characters over time. They don't include superhero films like Thor because they're not very representative of the human experience.

鈥淭he idea was to use movies as a proxy for the real world,鈥 says Vicol.

Each clip, he says, is associated with a graph that captures rich detail about what鈥檚 happening in the clip: which characters are present, their relationships, interactions between each other along with the reasons for why they鈥檙e interacting, and their emotions.

Vicol explains that the dataset shows, for example, not only that two people are arguing, but what they鈥檙e arguing about, and the reasons why they鈥檙e arguing, which come from both visual cues and dialogue. The team created their own tool for enabling annotation, which was done by a single annotator for each film.

鈥淎ll the clips in a movie are annotated consecutively, and the entire graph associated with each clip is created by one person, which gives us coherent structure in each graph, and between graphs over time,鈥 he says.

With their dataset of more than 7,500 clips, the researchers introduce three tasks, explains Vicol. The first is video retrieval, based on the fact that the graphs are grounded in the videos.

鈥淪o if you search by using a graph that says Forrest Gump is arguing with someone else, and that the emotions of the characters are sad and angry, then you can find the clip,鈥 he says.

The second is interaction ordering, which refers to determining the most plausible order of character interactions. For example, he explains if a character were to give another character a present, the person receiving the gift would say 鈥渢hank you.鈥

鈥淵ou wouldn't usually say 'thank you,' and then receive a present. It's one way to benchmark whether we're capturing the semantics of interactions.鈥

Their final task is reason prediction based on the social context.

鈥淚f we focus on one interaction, can we determine the motivation behind that interaction and why it occurred? So that's basically trying to predict when somebody yells at somebody else, the actual phrase that would explain why,鈥 he says

Tapaswi says the end goal is to learn behaviour. 

鈥淚magine for example in one clip, the machine basically embodies Jenny [from the film Forrest Gump]. What is an appropriate action for Jenny? In one scene, it鈥檚 to encourage Forrest to run away from bullies. So we鈥檙e trying to get machines to learn appropriate behaviour.鈥

鈥淎ppropriate in the sense that movies allow, of course.鈥

How does a robot learn household tasks?

Led by led by Massachusetts Institute of Technology Assistant Professor Antonio Torralba and U of T鈥檚 Fidler, , is training a virtual human agent using natural language and a virtual home, so the robot can learn not only through language, but by seeing, explains U of T master鈥檚 student of computer science Jiaman Li (photo right) a contributing author with U of T PhD student of computer science Wilson Tingwu Wang.

U of T postdoctoral researcher Makarand Tapaswi and PhD student Paul Vicol have used movies as a proxy for the real world in training how robots should behave (photos by Ryan Perez)

If your friend is sad, you can say something to help cheer them up. If you ask your co-worker to make coffee, they know the steps to complete this task.

But how do artificially intelligent robots, or AIs, learn to behave in the same way humans do?

University of Toronto researchers are presenting new approaches towards socially intelligent AIs, at the Computer Vision and Pattern Recognition (CVPR) conference, the premier annual computer vision event this week in Salt Lake City, Utah.

How do we train a robot how to behave?

In their paper , Paul Vicol, a PhD student in computer science, Makarand Tapaswi, a post-doctoral researcher, Lluis Castrej贸n, a master鈥檚 graduate of U of T computer science who is now a PhD student at the University of Montreal Institute for Learning Algorithms, and Sanja Fidler, an assistant professor at U of T Mississauga鈥檚 department of mathematical and computational sciences and tri-campus graduate department of computer science, have amassed a dataset of annotated video clips from more than 50 films.

鈥淢ovieGraphs is a step towards the next generation of cognitive agents that can reason about how people feel and about the motivations for their behaviours,鈥 says Vicol. 鈥淥ur goal is to enable machines to behave appropriately in social situations. Our graphs capture a lot of high-level properties of human situations that haven鈥檛 been explored in prior work.鈥

Their dataset focuses on films in the drama, romance, and comedy genres, like Forrest Gump and Titanic, and follows characters over time. They don't include superhero films like Thor because they're not very representative of the human experience.

鈥淭he idea was to use movies as a proxy for the real world,鈥 says Vicol.

Each clip, he says, is associated with a graph that captures rich detail about what鈥檚 happening in the clip: which characters are present, their relationships, interactions between each other along with the reasons for why they鈥檙e interacting, and their emotions.

Vicol explains that the dataset shows, for example, not only that two people are arguing, but what they鈥檙e arguing about, and the reasons why they鈥檙e arguing, which come from both visual cues and dialogue. The team created their own tool for enabling annotation, which was done by a single annotator for each film.

鈥淎ll the clips in a movie are annotated consecutively, and the entire graph associated with each clip is created by one person, which gives us coherent structure in each graph, and between graphs over time,鈥 he says.

With their dataset of more than 7,500 clips, the researchers introduce three tasks, explains Vicol. The first is video retrieval, based on the fact that the graphs are grounded in the videos.

鈥淪o if you search by using a graph that says Forrest Gump is arguing with someone else, and that the emotions of the characters are sad and angry, then you can find the clip,鈥 he says.

The second is interaction ordering, which refers to determining the most plausible order of character interactions. For example, he explains if a character were to give another character a present, the person receiving the gift would say 鈥渢hank you.鈥

鈥淵ou wouldn't usually say 'thank you,' and then receive a present. It's one way to benchmark whether we're capturing the semantics of interactions.鈥

Their final task is reason prediction based on the social context.

鈥淚f we focus on one interaction, can we determine the motivation behind that interaction and why it occurred? So that's basically trying to predict when somebody yells at somebody else, the actual phrase that would explain why,鈥 he says

Tapaswi says the end goal is to learn behaviour. 

鈥淚magine for example in one clip, the machine basically embodies Jenny [from the film Forrest Gump]. What is an appropriate action for Jenny? In one scene, it鈥檚 to encourage Forrest to run away from bullies. So we鈥檙e trying to get machines to learn appropriate behaviour.鈥

鈥淎ppropriate in the sense that movies allow, of course.鈥

How does a robot learn household tasks?

Led by led by Massachusetts Institute of Technology Assistant Professor Antonio Torralba and U of T鈥檚 Fidler, , is training a virtual human agent using natural language and a virtual home, so the robot can learn not only through language, but by seeing, explains U of T master鈥檚 student of computer science Jiaman Li (photo right) a contributing author with U of T PhD student of computer science Wilson Tingwu Wang.

MovieGraphs was supported in part by the Natural Sciences and Engineering Research Council of Canada (NSERC) and Virtual海角视频 is supported in part by the NSERC COmputing Hardware for Emerging Intelligent Sensing Applications (COHESA) Network.

This article was first published on