Dr. Yasutaka Furukawa from Washington University in St. Louis presented an excellent talk on Indoor Scene Understanding and Dynamic Scene Modeling today.


I will present our recent 4 projects on indoor scene understanding or dynamic scene modeling: 1) 3D sensing technologies has brought revolutionary improvements to indoor mapping, For example, Matterport is an emerging company, which lets everybody 3D-scan an entire house easily with a depth camera. However, the alignment of depth data has been a challenge, where their system requires extremely dense data sampling. Our approach can significantly decrease the number of necessary scans and hence human operating costs by utilizing a 2D floorplan image. 2) Multi-modal data analysis between images and natural languages has been popular in Computer Vision. However, multi-modal image analysis has been relatively under-explored. We study the capability of deep-network in understanding the relationships of 5 million floorplan images and 80 million regular photographs. The network has shown super-human performance on several multi-modal image understanding problems with a large margin. 3) Single-image understanding techniques such as deep-network has shown remarkable performance in image recognition or understanding problems, but has not been utilized much by high-fidelity 3D reconstruction techniques. It is simply because single-view techniques are lack in enough precision. We show that a single-view technique can yield exact pixel-accurate geometric constraints for multi-view reconstruction through geometric relationship classification, enabling SfM for very challenging indoor environments. 4) 3D reconstruction techniques has a great success in static geometry inference. Dynamic scene modeling is still a big open problem. The current approaches require extensive hardware setup (e.g., Google Jump) to produce production quality dynamic scene model/visualization. I will present a system that turns a regular movie of an urban scene into Cinemagraph-style animation.


This is uncleaned note. For more details, please visit Dr. Furukawa’s page: http://www.cse.wustl.edu/~furukawa/ 

Large scale indoor scanning

  • Exploiting Indoor Plan
    • RGBD Streaming
      • Kinect Fusion
      • Google tango
    • Panorama RGBD Scanning
      • Matterport
      • Faro 3D (18k$ to buy, 1k to rent for a day, 2m 3D pts)
  • Limitations
    • Require extremely dense scanning, once in 3-4 meters
    • Does not reach (accuracy degrades) over 10 meters
    • Idea for high-end 3D scanning
  • Existing approach
  • Our approach
    • 2D Scan Placement
    • S = {s1, s2, …}
    • S1 : 2D position and orientation, 4M * 4 possible values
    • Energy function: E_s(s_i) + E_{s*s} (s_i, s_j) +  E_F^k (S)
      • Unary, Scan-to-Floorplan consistency
      • Binary, Scan-to-Scan consistency, NCC score of the patch
    • High-order, Floorplan coverage
    • Datasets
      • 71 out of 75 are placed correctly, minimal alignment
    • Use floorplan

Deep Multi-modal Image Matching

Question: Which photograph matches the floorplan pictures?

  • Require long reasoning
  • Dataset: http://www.nii.ac.jp/dsc/idr/next/homes.html
  • Multi-modal image matching
  • Non-instant reasoning
  • K-way classification
    • 100k training samples
    • VS Amazon Turks, networks perform better than humans
  • Receptive Field Visualization (heatmap)
  • Sliding window patches to figure out the significant patch
  • Network can learn to match images across different modalities, solve a problem requiring long-time reasoning.


Deep Learning for SLAM

  • SLAM / SfM in hard cases
    • iPhone with fierce rotation and translation
    • // Multi-view community hates single-view techniques…
  • Very rough estimation -> Pixel accurate pose and 3D geometry
  • Learning pixel-accurate constraints?
    • Images-> camera poses and 3D geometry
    • Image->Geometry
  • Line-based SLAM
    • Figure out Manhattan Coplanar
    • Given a line, classify whether this line is on the floor or not
  • 5 channel – Alexnet
    • Bilinear Upsampling and Stacking
    • Get normal map from CMU
    • Manhattan coplanar
    • Horizontally coplanar

Automatic Cinemagraph

  • Static photograph with subtle animations
  • Video input
    • Rerendering the video from a single view point
  • Mask
  • Cinemagraph