Katerina Fragkiadaki from Google Research gave a talk on Diverse Visual Imaginations for Video Prediction today.


We, humans, constantly need to predict what is about to happen, e.g., how objects are about to move, how our actions will affect their motion, where to fixate our attention,   in order to see and act in the world while compensating for the lags of our mechanical parts. For our machines to learn similar predictions conditioned on visual input from their cameras, dealing with multimodality of the future is a key: multiple future outcomes are plausible at any moment. In this talk, I will present my recent work on representations that combine non-parametric memories” with parametric deformations to imagine diverse object motion directly from video pixels.


Katerina Fragkiadaki is a Post Doctoral Researcher in the Machine Perception Team at Google Research. Till few months ago she was a Post Doctoral Researcher in the Computer Vision Laboratory at UC Berkeley. She received her Ph.D. in University of Pennsylvania in 2013 and her BA in Electrical and Computer Engineering from National Technical University of Athens. She enjoys thinking how to build machines that understand the stories that videos narrate and that embody this understanding for acting in the world.


Every picture tells a story. Panta rhei said that

Everything flows

Visual Imaginations include

  • To see: predict where the target will move
  • To act: predict how the world hanges under our actions
  • To generalize: predict object appearance under new viewpoints

Previous works include Kalman filters (time smoothing of detection responses), and hardcoding physical constraints and social interaction patterns.

Predicting (Action-Conditioned) Dynamics

Policy Learning