CG Paper#2 KinectFusion: Real-time 3D Reconstruction

Tags:

CG Paper #1 RenderMan: An Advanced Path Tracing Architecture for Movie Rendering CG Papers Content List

ComputerScience
Research

#CGPapers#ComputerGraphics#ComputerScience#Research

This paper share is from Izadi et al.'s "KinectFusion: Real-time 3D Reconstruction and Interaction Using a Moving Depth Camera."

KinectFusion is a technology that allows users to hold a Kinect camera and quickly reconstruct indoor scenes. This paper mainly introduces how the Microsoft team developed this technology to enable depth cameras to be used for high-precision 3D scene reconstruction.

KinectFusion

This system allows users to perform 3D scanning of indoor scenes by moving a Kinect camera, generating a geometrically accurate 3D model in real time. The system continuously tracks the six-degree-of-freedom (6DOF) position of the camera and fuses depth data from each viewpoint into a global 3D model, gradually adding detail and filling holes in the model. The system improves image resolution through multi-view reconstruction; even when the original depth data contains noise and incomplete information, this process is similar to image super-resolution techniques.

Low-Cost Handheld Scanning

The paper repeatedly emphasizes that the Kinect camera itself is very low-cost. KinectFusion's mobility and real-time performance make it a low-cost object scanning tool. Users can quickly capture objects from different angles and view feedback on screen immediately. The generated 3D models can be used for CAD or 3D printing.

Direct Interaction for Object Segmentation

Users can directly segment specific objects in the scene by moving them (for example, if the user only wants to scan a teacup on the table rather than the entire scene). The system monitors model changes in real time; if significant changes are detected, it marks and segments the moved object.

Geometry-Aware Augmented Reality

KinectFusion supports geometry-aware augmented reality, where virtual objects can be overlaid directly on the 3D model and precisely aligned with the real environment. This includes occlusion handling of virtual objects and reflection of real scene details, making virtual and real objects blend more naturally.

Additionally, the system uses ray tracing to generate shadows, reflections, and other effects for objects, making virtual and real objects blend more naturally in rendering. For example, virtual objects can precisely cast shadows and reflect the surrounding real environment on complex geometric surfaces such as desks.

It is important to know that occlusion handling is crucial for AR—whether virtual and real object occlusion can be properly handled determines the believability of virtual objects.

Physics Simulation

Through physics simulation, KinectFusion supports dynamic interaction of virtual objects on the 3D model. The system can perform rigid body collision simulation while reconstructing the model, achieving realistic physical interaction—for example, tens of thousands of particles can interact with the scene model in real time.

Entering the Scene

KinectFusion further extends the user's ability to physically interact directly within the 3D scene.

Traditional 3D reconstruction assumes the scene is static, but this system allows the user's hand or other objects to dynamically enter the camera's field of view without disrupting the overall camera tracking. To achieve this, the system implements segmentation between foreground and background, ensuring stable camera tracking by dynamically adjusting the relative positions of foreground objects and the background model. This segmentation and tracking technology allows users to interact directly with physical objects in the virtual scene—for example, fingers can perform touch operations on the model surface.

GPU Implementation

The main system pipeline consists of four stages: depth map conversion → camera tracking → volumetric integration → ray casting. Each stage is programmed on the GPU using CUDA.

Depth Map Conversion

The paper describes the algorithm for depth map conversion using CUDA as above.

This stage converts the real-time depth map from image coordinates to 3D points (called vertices in the paper) and normals in camera coordinate space. The depth data for each pixel is projected into a 3D point, yielding a vertex map and normal map for parallel computation. The Kinect's intrinsic matrix is used to reproject depth data into 3D space.

Camera Tracking

This stage computes the six-degree-of-freedom transformation of the camera through a GPU-implemented Iterative Closest Point (ICP) algorithm, aligning the current frame's points with the previous frame. This relative transformation is progressively applied to a global transformation matrix, thus defining the global position of the Kinect. Each GPU thread finds point correspondences between the current and previous frames based on the projective data association method, and in parallel computation tests whether the point-to-plane distance and angle fall within preset thresholds to exclude outliers. The output of ICP is a transformation matrix that minimizes point-to-plane error.

Obviously, this dense tracking method has enormous computational requirements and is only feasible with GPU implementation.

Volumetric Representation and Integration

In this stage, the system uses a voxel grid for surface reconstruction rather than simply fusing point clouds or creating meshes. Each voxel stores a distance value to the physical surface via the Truncated Signed Distance Function (TSDF).

The process of updating the voxel grid is based on the distance between each voxel and the camera, integrating the measured depth data into the global coordinate system of the grid.

The system allocates a complete 3D voxel grid on the GPU and stores it as linear memory. Although this method is not memory efficient (for example, a voxel grid at 512^3 resolution requires 512MB memory), it has advantages in speed. Due to memory alignment, each GPU thread can access memory through coalesced access, improving memory access throughput. The algorithm uses a projective approach to integrate depth data into the voxel grid. By updating the TSDF (Truncated Signed Distance Function) values in the volume, the system can update the voxel grid at real-time speed (e.g., only 2ms at 512^3 resolution) and discretize the continuous surface estimate from the Kinect depth map into the voxel grid. This method is simpler to implement than hierarchical techniques, and with modern GPU memory support, can scale to modeling entire rooms.

Ray Casting

The system uses a ray casting-based method to extract the implicit surface from the voxel grid and generate views for the user. Each GPU thread traverses the voxel grid along a ray; when a position where the signed value becomes zero is detected, it computes the surface intersection and normal to support lighting rendering.

This stage's rendering also performs occlusion handling between virtual geometry and the real geometry of the voxel grid, making virtual objects blend seamlessly with the real scene visually and providing additional shadow and reflection effects.

The paper also describes how KinectFusion extends through the GPU to support physics collision simulation between virtual objects and the reconstructed scene, making interaction between virtual and real objects more physically realistic. The specific implementation includes:

Particle simulation: A particle-based physics simulation is implemented on the GPU. The geometric structure of the scene is represented by a set of static particles, each corresponding to a surface voxel. These particles are spheres of the same size; although they are stationary, they can collide with dynamically simulated particles. Although this is an approximate model, it can simulate every surface position of the voxel volume in real time, even down to complex shapes like book edges or teapot handles.
Static particle creation: Static particles are generated during the volumetric integration stage. When the volume is scanned, positions where the TSDF value is close to zero are extracted and defined as the "zero layer." For each surface voxel, the system instantiates a static particle; each particle contains a 3D vertex (in global coordinates) and an ID.
Collision detection: Collision detection is the key challenge of this simulation. To this end, the system uses a spatially uniformly partitioned grid to identify neighboring particles and assigns each particle a unique grid cell ID. Dynamic and static particles are sorted by grid cell to enable fast neighbor detection on the GPU. Each dynamic particle checks particles in neighboring cells during the simulation step to detect and handle collisions.
Collision handling and velocity update: In collision handling, the system uses the Discrete Element Method (DEM) to compute post-collision velocity vectors. Each particle's global velocity is updated based on the influence of neighboring collisions, gravity, and boundary conditions, and finally particles are repositioned each simulation step according to the accumulated velocity.

References:

Paper original: https://www.cs.cmu.edu/afs/cs.cmu.edu/academic/class/15869-f11/www/readings/izadi11_kinectfusion.pdfOpen ↗