Facebook AI Research & New York University
One of the most famous joke in Computer Vision field is that Dr. LeCun’s “controversial” paper “Scene Parsing with Multiscale Feature Learning, Purity Trees, and Optimal Covers” was rejected by CVPR but now is accepted by ICML’12. 🙂
Dr. LeCun published the reviews from CVPR reviewers as well as his letter to Serge Belongie, a CVPR PC chair online. Though the reviews are harsh, but all reviewers are very responsible with their own arguments. (Every coin has two sides, eh?)
Now CVPR loves deep learning, loves faces (as always) and loves 3D (DynamicFusion: Reconstruction and Tracking of Non-rigid Scenes in Real-Time) and even lightfield (Accurate Depth Map Estimation From a Lenslet Light Field Camera).
One notable work by LeCun recently is: The Loss Surfaces of Multilayer Networks
Deep learning methods have had a profound impact on a number of areas in recent years, including natural image understanding and speech recognition. Other areas seem on the verge of being similarly impacted, notably natural language processing, biomedical image analysis, and the analysis of sequential signals in a variety of application domains. But deep learning systems, as they exist today, have many limitations.
First, they lack mechanisms for reasoning, search, and inference. Complex and/or ambiguous inputs require deliberate reasoning to arrive at a consistent interpretation. Producing structured outputs, such as a long text, or a label map for image segmentation, require sophisticated search and inference algorithms to satisfy complex sets of constraints. One approach to this problem is to marry deep learning with structured prediction (an idea first presented at CVPR 1997). While several deep learning systems augmented with structured prediction modules trained end to end have been proposed for OCR, body pose estimation, and semantic segmentation, new concepts are needed for tasks that require more complex reasoning.
Second, they lack short-term memory. Many tasks in natural language understanding, such as question-answering, require a way to temporarily store isolated facts. Correctly interpreting events in a video and being able to answer questions about it requires remembering abstract representations of what happens in the video. Deep learning systems, including recurrent nets, are notoriously inefficient at storing temporary memories. This has led researchers to propose neural nets systems augmented with separate memory modules, such as LSTM, Memory Networks, Neural Turing Machines, and Stack-Augmented RNN. While these proposals are interesting, new ideas are needed.
Lastly, they lack the ability to perform unsupervised learning. Animals and humans learn most of the structure of the perceptual world in an unsupervised manner. While the interest of the ML community in neural nets was revived in the mid-2000s by progress in unsupervised learning, the vast majority of practical applications of deep learning have used purely supervised learning. There is little doubt that future progress in computer vision will require breakthroughs in unsupervised learning, particularly for video understanding, But what principles should unsupervised learning be based on?
Preliminary works in each of these areas pave the way for future progress in image and video understanding.