Understanding and Overcoming the Limitations of Convolutional Neural Networks for Visual Localization (VisLocLearn)
Visual localization is the problem of estimating the position and orientation from which an image was taken with respect to the scene. In other words, visual localization allows an AI system to determine its position in the world through a camera. Visual localization is a core technology enabling embodied AI systems such as self-driving cars, autonomous robots such as lawn mover robots developed by Bosch or vacuum cleaning robots from iRobot or Dyson, and other intelligent systems to navigate through and interact with the world. Localization is also a crucial component of intelligent augmentation systems such as Microsoft HoloLens or Google’s Augmented Reality navigation that project virtual content into the field-of-view of a user (Augmented / Mixed Reality).
Machine learning, and especially Deep Learning, has revolutionized many areas in Computer Vision and AI by replacing handcrafted systems with end-to-end trained (convolutional) neural networks, including speech recognition, semantic scene understanding, and object detection and recognition. Interestingly, visual localization is not among these fields: Methods that use a neural network to directly regress the camera position and orientation (aka the camera pose) from which an image was taken are significantly less accurate than classical methods based on handcrafted features. We recently showed that such methods do not even consistently outperform a simple nearest neighbor classifier that approximates the camera pose of a test image by the pose of the most similar training image. This clearly shows that current state-of-the-art methods for camera pose regression do not work as intended. At the same time, these methods have some desirable properties, including run-time efficiency (only a single forward pass through a neural network is required), compactness of representation, and ease of deployment, not to mention the possibility of providing a building block for larger end-to-end trainable systems, e.g., for visual navigation. Thus, understanding why current approaches fail and proposing novel approaches that are able to accurately localize a camera are problems of high practical relevance. This is the purpose for the proposed project, VisLocLearn.
In detail, VisLocLearn aims to make the following three high-level contributions:
1) Through developing theoretical models and verifying them through practical experiments, we aim to understand why pose regression algorithms do not outperform simple nearest neighbor approaches. In contrast to other research areas in Computer Vision and AI, the principles of projective geometry underlying the localization problem are well-understood. This provides us with powerful tools for our analysis, allowing us to determine when a network learns something that is consistent with these principles. This enables us to identify the shortcuts taken by existing pose regression algorithms that allow them to explain the training data without truly learning the underlying geometric relationships. This part of the project follows the call’s goal of developing theoretical foundations for AI and
addresses the lack of theoretical understanding of pose regression algorithms in the literature.
2) We will use this theoretical understanding to develop pose regression algorithms that, as a first for their kind, achieve state-of-the-art accuracy and are thus of practical relevance. This will allow embodied AI systems to reliably and accurately determine their position in the world, thus enabling higher-level tasks such as navigation. This part of the project focuses on developing computational methods for AI than can be used to tackle core problems related to the deployment of AI systems.
3) We aim to transfer the lessons learned from our theoretical models and novel algorithms to other areas of AI where neural networks struggle to outperform classical baselines, such as visual navigation and single-view 3D reconstruction.
The scope of the project has been extended to also cover privacy-preserving visual localization techniques, e.g., localization approaches that do not reveal scene details or private details in images send by a user to a localization service.
Torsten Sattler (contact)
Associate Professor at Chalmers, Electrical Engineering, Signal Processing and Biomedical Engineering, Imaging and Image Analysis
Doctoral Student at Chalmers, Electrical Engineering, Signal Processing and Biomedical Engineering, Imaging and Image Analysis
Chalmers AI Research Centre
Chalmers AI Research Centre
Funding Chalmers participation during 2020–2024