Reconstructs a 3D scene from a sequence of RGB and depth images captured with a Kinect camera. Each consecutive image pair is aligned using SIFT feature matching, RANSAC outlier rejection, and ICP refinement — then merged into a single unified point cloud.
Project developed for the Vision and Image Processing course at Instituto Superior Técnico (Fall 2019).
Authors: João Ribeiro · Rafael Correia · Zuzanna Swiderska
Requirements
- MATLAB (R2018b or later)
- VLFeat — add it to MATLAB path following these instructions
- MATLAB Computer Vision Toolbox
Running
All the code lives in code.m. Open it in MATLAB and adjust the following before running:
- Line 2 — which
.matdataset file to load (default:sinteticotable.mat) - Line 33 — path to scan for image datasets
- Line 41 — which dataset from the discovered list to process
Then run the script. Available datasets are in the datasets/ folder: sinteticotable, vianaPiv, sala, board, parede, viewfroom, doitifyoucan.
The main purpose of this project is to reconstruct a 3D scene from a set of RGB and depth images (in 2D) acquired from a Kinect camera.
The problem is divided into sub-problems, each tackled by an individual component to increase modularity and testability. An overall view of the proposed solution is shown in the flowchart in Figure 3. The following sections present the acquisition tool (Section 1.1) and the theory (Sections 1.2 to 1.6) behind the solutions used (further discussed in Section 2).
RGB and depth images were acquired using a Kinect camera (Intel RealSense). It uses a regular RGB camera alongside an infrared projector and camera that measures the distance from the device to the 3D world points visible in the RGB image.
Some drawbacks stem from the use of infrared (IR) light for depth sensing. It prevents outdoor use since sunlight's IR component overwhelms the projected IR pattern. Objects that don't reflect IR light well — mainly black-coloured surfaces — produce erroneous zero-depth readings, as can be seen in Figure 1c where the monitor and computer towers have no valid depth.
All images (RGB and depth) have a resolution of 640×480 pixels.
Figure 1: Acquisition tool and RGB/depth pair examples
The pin-hole camera model describes a camera as a transformation that projects 3D world points onto 2D points on an image plane.
Let
Figure 2: Pin-hole camera model
The projection is obtained by tracing a ray from the 3D point through the optical center O to the image plane. With focal distance
With unitary focal distance (
The optical axis doesn't always intersect the image plane at the origin — there's an offset describable by a rotation
Since we're now in image coordinates in pixels, these are denoted
The Scale-Invariant Feature Transform (SIFT) is an algorithm for detecting and matching features across images.
SIFT builds the scale space of an image by filtering it with Gaussian filters of increasing standard deviations (an octave). After each octave, the image is downsampled and the process repeats at a smaller scale.
Keypoints are found at extrema of the Laplacian of the image — regions that change locally in many directions. In practice, this is approximated by taking extrema of differences between consecutive Gaussian-filtered images. For each keypoint, local gradients are computed and summarised into a descriptor histogram oriented relative to the dominant gradient direction. This orientation-relative descriptor makes SIFT robust to rotation; its position-independence makes it invariant to translation; normalisation of the histogram handles global illumination changes; and computing descriptors across the scale space makes them scale invariant.
Features are matched by comparing descriptors, providing point correspondences that survive many image transformations.
Random Sample Consensus (RANSAC) iteratively searches for the best estimate of a model's parameters from data that contains both inliers (explained by the model, possibly with noise) and outliers (data points deviating too far to be explained by any noise distribution).
The algorithm requires: a model, a dataset of
Steps:
- Randomly sample
$n$ points and fit the model to get parameters$\widehat{\Theta}$ . - Compute the error
$\varepsilon_i = |Y_i - \widehat{\Theta} X_i|$ for all points. - Classify each point as inlier or outlier:
- Count inliers. If this is the best so far, save these as the current best inliers.
- Repeat for
$k$ iterations.
The number of iterations needed to achieve probability
Table 1: Number of RANSAC iterations as a function of p and P
The transformation between two rigid point clouds
First, compute and subtract the centroids from each point cloud:
The rotation and translation that minimise the equation above are obtained via SVD:
A proof that these minimise Equation 4 can be found in Appendix A.
The Iterative Closest Point (ICP) algorithm minimises Equation 4 without knowing in advance which points correspond between two clouds. It picks a sub-sample of one cloud, finds the nearest neighbour in the other, keeps only the fraction with the lowest distances (to avoid matching non-overlapping regions), and estimates
Having laid down the theoretical foundations, here is the sequence of steps taken to solve the problem. A visual summary is in Figure 3.
Figure 3: Flowchart summarising the steps to solve the proposed problem
The main loop iterates over all consecutive image pairs. Let
Point cloud generation. Image coordinates
The points then need to be transformed from the depth camera frame to the RGB camera frame, using a known rotation and translation between them:
This is applied to all points in the image pair. Figure 4 shows an example.
Figure 4: Point cloud generation from an RGB/depth pair
Feature matching. SIFT features and descriptors are extracted from the grayscale RGB images and matched between consecutive pairs. Detection thresholds are reduced to find as many features as possible — since correctly aligning 3D points is difficult, having a large pool of candidates helps ensure enough true matches survive outlier rejection. The matched points are then transformed from image coordinates into 3D coordinates in the RGB camera frame.
RANSAC. The two sets of matching 3D points are passed to RANSAC using a general 3D affine model:
This can be rewritten in closed form. For each set of 3D point correspondences:
Which gives a system of the form:
With exactly
When sampled points are coplanar,
Rigid transformation estimation. From the RANSAC inliers,
The middle matrix is the identity unless
ICP refinement. With an initial
Merging. The final
This composition extends to any number of clouds, progressively merging everything into the first cloud's coordinate frame.
