SIGGRAPH 2014 Technical Paper Round Up

As many of you already know, SIGGRAPH 2014 (#SIGGRAPH2014) is taking place this week in Vancouver, British Columbia through 14-AUG.  SIGGRAPH has been around for more than four decades, and the presentations there constantly represent some of the most forward thinking in the fields of computer graphics, computer vision and human computer interface technologies and techniques. I am certainly jealous of those in attendance, so I will covet from afar as I make my way to a client visit this week.  The first pages of all of the SIGGRAPH 2014 Technical papers can be found at the SIGGRAPH site. Here is a sampling of those papers which I personally found to be most interesting.  A few have already been profiled by others, and if I seen them reviewed before, I will provide additional links.  These are not in any order of priority:

  • Learning to be a Depth Camera for Close-Range Human Capture and Interaction [(Microsoft Research project which proposes a machine learning technique to estimate z-depth per pixel using any conventional single 2D camera in certain limited capture and interaction scenarios [hands and faces] – demonstrating results comparable to existing consumer depth cameras, with dramatically lower costs, power consumption and form factor).  This one, admittedly, blew me away.   I have been interested in the consumer reality capture space for a while, and have blogged previously about the PrimeSense powered ecosystem and plenoptic (a/k/a “light field”) computational cameras.  I argued that light field cameras made lots of sense (to me at least) as the technology platform for mobile consumer depth sensing solutions (form factor, power consumption, etc.).   This new paper from Microsoft Research proposes a low cost depth sensing system for specific capture and interaction scenarios (the geometry of hands and faces) – turning a “regular” 2D camera into a depth sensor.   Admittedly doing so requires that you first calibrate the 2D camera by registering depth maps captured from a depth camera against intensity images, and in this way the 2D camera “learns” and encodes such things as surface geometry and reflectance among other things.   They demonstrate two prototype hardware designs – a modified web camera for desktop sensing and a modified camera for mobile applications – in both instances demonstrating hand and face tracking on par with existing consumer depth camera solutions.  This paper is a great read, in addition to describing their proposed techniques, they provide a solid overview of existing consumer depth capture solutions.

Learning to be a Depth Camera

  • Proactive 3D Scanning of Inaccessible Parts  (proposes a 3D scanning method where a user modifies/moves the object being acquired during the scanning process to capture occluded regions, using an algorithm supporting scene movement as part of the global 3D scanning process)


  • First-person Hyper-lapse Videos – paper  + Microsoft Research site (presentation of a method to convert single camera, first-person videos into hyper-lapse videos, i.e. time lapse videos with smoothly moving camera – overcoming limitations of prior stabilization methods).  What does this mean?  If you have ever tried to take a video that you shot (particularly while the camera is moving) and slow it down – the results are often not optimal.  Because frames need to be “made up” to fill the gaps, any camera movement introduces blurring.   Techcrunch reviewed the Microsoft Research project here.


  • Color Map Optimization for 3D Reconstruction with Consumer Depth Cameras (proposes an optimization approach to map color images onto geometric reconstructions generated from range and color videos produced by consumer grade color depth cameras – demonstrating substantially improved color mapping fidelity).  Anyone who has attempted to create a 3D reconstruction of an object or a scene using consumer depth cameras knows that it is one thing to create a generally good surface map, but it is an entirely more challenging problem to map color, per pixel, to accurately represent the captured environment.  Because consumer depth cameras are inherently noisy, and in particular because the shutters of the RGB and depth cameras are not synchronized, this means that generally color information is “out of phase” with the reconstructed surfaces.  Their method provides for some pretty incredible results:

Improved Color Map

  • Real-time Non-rigid Reconstruction Using an RGB-D Camera (a proposed hardware and software solution, using consumer graphics cards, for markerless reconstruction in real-time (at 30 Hz) of arbitrary shaped (i.e. faces, bodies, animals), yet moving/deforming physical objects).  Real-time reconstruction of objects or scenes without moving elements are the bread and butter of solutions such as Kinect Fusion.  Real-time 3D reconstruction of moving objects, in real time, is much more challenging.  Imagine, for example, having your facial expressions and body movements being “painted” in real-time, to your avatar in a virtual world.   While this solution requires a custom rig (i.e. high quality capture at close range was needed, something consumer depth cameras do not provide) it is certainly exciting to see what can be achieved with relatively modest hardware modifications.


  • Functional Map Networks for Analyzing and Exploring Large Shape Collections  (proposes a new algorithm for organizing, searching and ultimately using collections of models – first by creating high quality maps connecting the models, and then using those connections for queries, reconstruction, etc.).   Much of this paper was beyond me – but the problem is certainly understood by everyone, who even today, searches for 3D content.  Most of that data is organized/categorized by metadata – and not be the characteristics of the shapes themselves.  There are obviously some services, like, which are actually interpreting and categorizing the underlying shape data – but most model hosting sites do not.  Imagine if you could run an algorithm against a huge database of content (e.g. Trimble’s 3D Warehouse), or even shapes when “discovered” on the web, and immediately build connections and relationships between shapes so that you could ask the query “Show me similar doors”.  Wow.


  • Automatic Editing of Footage from Multiple Social Cameras  (presents an approach that takes multiple cameras captured by “social” cameras – cameras that are carried/worn by those participating in the activity – and automatically produces a final, coherent cut video of that activity, represented from multiple camera views.)   The folks at Mashable recently looked at this approach.   While this is certainly cool, I’ve often wondered why, given all the mobile video camera solutions that exist, that an application has been developed which allows an event to be “socially” captured on video, and then in real or near-real time, allow interaction with that socially captured video, navigating from camera position to camera position within a 3D environment.  Sure, it is a huge data problem, but if you have gone to a concert lately you will soon realize that many folks (thousands of them in fact) are capturing some, if not all, of the entire event, from their unique camera position.  Certainly true for many sporting events as well (and in most cases, youth sporting events where the parents are recording their children).   Taking the Microsoft Photosynth  approach on steroids, if those camera positions are back-computed into 3D space, the video and sound could be synchronized, allowing for virtual fly throughs to different camera locations (if necessary interpolating frames along the way.)  OK, we might have to borrow all of the DARPA computing power for a month for a five minute video clip, but boy would it be cool!  😉