Learning and Optimizing Camera Pose
Doctoral thesis, 2024

Plenty of computer vision applications involve assessing the position and orientation, i.e. the pose, of one or several cameras, including object pose estimation, visualĀ  localization, and structure-from-motion. Traditionally, such problems have often been addressed by detection, extraction, and matching of image keypoints, using handcrafted local image features such as the scale-invariant feature transform (SIFT), followed by robust fitting and / or optimization to determine the unknown camera pose(s). Learning-based models have the advantage that the they can learn from data what cues or patterns are relevant for the task, beyond the imagination of the engineer. However, compared with 2D vision tasks such as image classification and object detection, applying machine learning models to 3D vision tasks such as pose estimation has proven to be more challenging.

In this thesis, I explore pose estimation methods based on machine learning and optimization, from the aspects of quality, robustness, and efficiency. First, an efficient and powerful graph attention network model for learning structure-from-motion is presented, taking image point tracks as input. Generalization capabilities to novel scenes is then demonstrated, without costly fine-tuning of network parameters. Combined with bundle adjustment, accurate reconstructions are acquired, significantly faster than off-the-shelf incremental structure-from-motion pipelines. Second, techniques are presented for improving the equivariance properties of convolutional neural network models carrying out pose estimation, either by intentionally applying radial distortion to images to reduce perspective effects, or via a geometrically sound data augmentation scheme corresponding to camera motion. Next, the power and limitations of semidefinite relaxations of pose optimization problems are explored, notably leading to the conclusion that absolute camera pose estimation is not necessarily solvable using the considered semidefinite relaxations, since while they tend to almost always be tight in practice, counter-examples do indeed exist. Finally, a rendering-based object pose refinement method is presented, robust to partial occlusion due to its implicit nature, followed by a method for long-term visual localization, leveraging on a semantic segmentation model to increase the robustness by promoting semantic consistency of sampled point correspondences.

optimization

structure-from-motion

camera pose estimation

machine learning

EC
Opponent: Prof. Dr. Konrad Schindler, Photogrammetry and Remote Sensing, ETH Zürich, Switzerland

Author

Lucas Brynte

Chalmers, Electrical Engineering, Signal Processing and Biomedical Engineering

Semantic Match Consistency for Long-Term Visual Localization

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics),; Vol. 11206 LNCS(2018)p. 391-408

Paper in proceeding

Pose Proposal Critic: Robust Pose Refinement by Learning Reprojection Errors

Proceedings of the British Machine Vision Conference 2020,; (2020)

Paper in proceeding

On the Tightness of Semidefinite Relaxations for Rotation Estimation

Journal of Mathematical Imaging and Vision,; Vol. 64(2022)p. 57-67

Journal article

Rigidity Preserving Image Transformations and Equivariance in Perspective

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics),; Vol. 13886 LNCS(2023)p. 59-76

Paper in proceeding

Computer vision regards developing visual functions for machines and robots similar to human visual perception. If a camera is what substitutes the human eye, computer vision is what substitutes the visual cortex in the brain. Computer vision involves many things, including semantic understanding of image contents, as well as 3D tracking and mapping. A core concept of the latter is camera pose, meaning the position and orientation of a camera in 3D space. The thesis studies different methods for estimating camera pose, in particular machine learning methods, but also mathematical optimization methods. Estimating camera pose is relevant for tracking the movement of objects visible in an image, the movement of the camera itself, or estimating the relative position and orientation between multiple cameras.

Machine learning is one of the most famous research fields of today, being the primary method for developing artificial intelligence systems. While many computer vision challenges have been revolutionized by machine learning in recent years, vision tasks involving 3D geometrical reasoning such as camera pose estimation have proven more challenging to learn. The thesis presents several contributions which demonstrate performance improvements for learning-based pose estimation. In addition, methods are presented which combine machine learning and optimization, and which explore the power and limitations of a certain type of optimization strategy for pose optimization known as semidefinite relaxations.

Deep Learning for 3D Recognition

Wallenberg AI, Autonomous Systems and Software Program, 2018-01-01 -- .

Areas of Advance

Information and Communication Technology

Subject Categories

Computer Vision and Robotics (Autonomous Systems)

ISBN

978-91-7905-973-6

Doktorsavhandlingar vid Chalmers tekniska högskola. Ny serie: 5439

Publisher

Chalmers

EC

Opponent: Prof. Dr. Konrad Schindler, Photogrammetry and Remote Sensing, ETH Zürich, Switzerland

More information

Latest update

1/8/2024 1