Deep Learning-based Visual Odometry and SLAM

Yu Huang
12 min readDec 25, 2020

Introduction

— — — — — — — -

Visual SLAM is a challenging problem in computer vision and robotics. In autonomous driving, SLAM is also an important technique for localization and map building. Visual SLAM gets more difficult than other sensors, like RGB-D sensor or LiDAR, due to the ill-posed problem, i.e. 3-D reconstruction from 2-D images [1, 2].

Since AlexNet won the ImageNet competition in 2012, Deep learning (mostly CNN) has made breakthrough in computer vision, first from classification, detection, then to tracking and segmentation. Recently more work of deep learning-based methods appear in feature extraction, disparity/depth estimation, optic flow estimation and pose estimation etc.

These modules above mentioned in computer vision mostly existed in a visual odometry (VO) or visual SLAM framework. It is seen there have been various methods put forward in this area. An obvious trend is, brute force learning the pose or structure from images is replaced with a multi-task learning strategy, i.e. depth, flow, segmentation, normal and edge clues are estimated at the same time in sub-modules, trained in a unified neural network along with the required pose/reconstruction sub-module [3, 4, 8, 9, 12]. It is worth to mention, VO or VSLAM methods are classified as direct and indirect, where the former estimate the pose directly from images, and the latter extracts features and computes pose from 2D feature matchings or 2D feature-3D reconstructed feature matchings. So feature-based or feature aware deep learning methods are proposed as well in VO or Visual SLAM [7, 11, 14, 15].

Besides, a VO or Visual SLAM platform usually consists of a frontend and a backend, in which the frontend does calculating the poses as well as adding the new reconstructed scenes in the map and the backend does optimizing the group of poses as well as the map in a nonlinear optimization framework, such as Bundle Adjustment (BA) [13] or Pose Graph Optimization (PGO) [6]. Besides, the key frame structure is popularly applied in VO or Visual SLAM, which increases the efficiency for both frontend and backend computation [10]. How to emulate these traditional workstyles in deep learning-based VO or Visual SLAM solutions is getting more researchers’ pay attention as well. It is seen some papers mentioned applications of RNN (mostly LSTM) in the VO/Visual SLAM [5].

Related Work

— — — — — — — —

ORB-SLAM [1] and DSO [2] are two well known visual SLAM methods, also with open sources. ORB-SLAM is a ORB feature-based indirect method, while DSO is a semi-direct method. Both have included key frame-based frontend and backend, with pose optimization and loop closure detection.

ORB-SLAM [1]
DSO with loop closure

Yin & Shi [3] proposed a unsupervised NN model GeoNet to estimate the depth, flow and pose together, where the flow is calculated based on camera pose compensation, based on a pose network in advance. Zou, Luo & Huang [4] also presented a method to jointly estimate depth and flow, in which geometric consistency from static scenes are utilized for flow estimation, i.e. due to the camera motion.

GeoNet [3]
DF-Net [4]

Kim [5] designed a CNN+LSTM-based visual SLAM system following the VO pipeline. In an unsupervised learning manner, it applies depth network and pose network and extracts their feature from encoder to feed LSTM, which can learn the sequential optimization process. Li et al. [6] directly use traditional PGO to the pose sequence generated by a joint depth and pose network, which splits into two stages, i.e. windowed PGO and global PGO.

Visual Tracking and Mapping [5]
Visual Odometry with PGO [6]

Zhan et al. [7] solved visual odometry (VO) by leveraging traditional feature-based methods and deep learning. The epipolar geometry and Perspective-n-Point (PnP) algorithm have been integrated with depth network and flow network to generate VO. Kim [8] proposed a motion semantic network which jointly estimate the depth, semantics (instance segmentation) and pose.

DF-VO [7]
SimVODIS [8]

Ranjian et al. [9] proposed a CC net, jointly training pose, flow, depth and motion segmentation in an unsupervised manner. Sheng et al. [10] tackled the joint learning problem of keyframe detection and VO by an end-to-end unsupervised deep framework.

CC-Net [9]
Key frame + VO [10]

Shen et al. [11] consider unsupervised depth and pose estimation by incorporating geometric quantities with feature aware strategies to avoid low textured regions, similar to DSO. Gordon et al. [12] simultaneously learn the pose, depth, camera/object motion and camera intrinsic parameters in an unsupervised manner, where object detection is applied.

DeepMatchVO [11]
Google’s Depth Prediction [12]

Shi et al. [13] proposed to jointly optimize the depth and camera motion via incorporating differentiable Bundle Adjustment (BA) layer by minimizing the feature-metric error. Xue et al. [14] emulate the key frame idea in traditional VO methods and propose a component called “Memory and Refining”, which preserves global information in Memory and ameliorates with contexts by the ST attention mechanism. Zou et al. [15] extended this idea in [14] in a two-stage-unsupervised-learning framework, mimicking ideas of loop closure mechanism and key frame selection, also jointly estimating pose, flow and depth.

VO with BA layer [13]
VO with Memory and Refining Modules [14]
VO with RNN modeling [15]

Magic Leap’s researchers [16] proposed a feature network called SuperPoint, which learns in an unsupervised framework the feature localization and descriptor, where a homography adaptation. Recently Dusmanu et al. [17] proposed a CNN to realize both feature descriptor and feature detector, called D2-Net, which is trained using pixel correspondence extracted from a large scale SfM (structure-from-motion) reconstructions.

SuperPoint[16]
D2-Net [17]

It is worth to mention, Parisotto et al. [18] propose a Neural Graph Optimizer, which applied attention-based RNN, similar to Transformer, to refine the global pose, mimicking the loop closure implicitly.

Attention-based Recurrent Network [18]

Proposed Method

— — — — — — — — — — -

First, we describe a semi-DL method for camera pose estimation, which uses output of depth network, flow network or feature network to estimate the camera pose in a traditional framework, shown in Fig. 1.

Fig. 1 Semi-traditional camera pose estimation framework

In Fig.1, we have three NN models, depth, flow and feature network. If the flow magnitude given by flow network is bigger than a threshold (absolute mean of flow field), we extract those pixel pairs with forward-backward consistency constraints. Otherwise, i.e. the flow is too tiny which results in the noisy estimation, we choose extracted features from feature network [16] for feature matchings between consecutive frames, by brute force search or fast nearest neighbor search in defined feature space, with geometric constraints (such as homography). Next, the matched feature/pixel pairs are sent to module “PnP” as well as depth inferred from depth network. The estimated output is the rotation and translation components.

Then, we follow the trend of multi-task learning in deep VO or visual SLAM framework and propose a framework which involves the feature network, depth network, flow network, semantic segmentation network and pose network, shown in Fig. 2.

Fig. 2 A multi-task learning framework for pose, feature, depth, flow and segmentation

In Fig.2, the input is the monocular image from the camera, where pose net and flow net requires two images and others only need one image. All those networks shares the encoders to extract the hierarchical feature maps, but the decoders or regression layers going to different output respectively.

Feature network is targeted for feature detector and descriptor generation. Here we show two different decoders for feature localization and descriptor separately, however D2-Net [17] can use a single decoder to finish the task. Feature matching is done by feature descriptor’s similarity, such as k-NN. Those matched points are used in the loss function of the pose network.

Segmentation network can generate the pixel-wise scene parsing, where some objects are suggested to remove from pose network’s loss function, such as sky (water-like objects too, such as fountain), trees, mirrored surface, vehicles and pedestrians etc. Apparently, the road surface and the obvious static objects in the street or highway, can be removed from the loss function of residual flow estimation. Similarly, extracted features are classified based on semantic segmentation lab and contribute to the loss function correspondingly.

Depth network is to infer the pixel depth from single image instead of stereo images, however depth consistency can be defined in the loss function with contribution from segmentation and optic flow. Meanwhile, optic flow network estimate the residual flow based on camera motion estimated from pose network. Depth is a structure constraints used for loss computation in flow network and pose network. It is meaningful for optic flow and pose estimation with cross check consistency constraints, i.e. forward and backward warping.

Pose network in this joint training platform generates the output, i.e. rotation R and translation t, for VO or visual SLAM.

Correspondingly, the unsupervised learning loss function is defined similarly. We have the image appearance loss term based on view synthesis by depth, camera pose and residual flow, smoothness loss term based on edge-aware image and geometry-aware depth smoothness metric, as well as consistency loss term for camera pose (both rotation and translation), segmentation and residual optic flow.

(Note: segmentation labels are implicitly applied in the guided loss function as weight or mask, features are applied with segmentation mask/weight for the loss part including either pose or flow contribution).

Fig. 3 shows a simple application with the trained pose network, with the PGO for optimization within a sliding window as backend.

Fig. 3 DL-based VO/V-SLAM platform

A problem for the application is the low efficiency, so a key frame-based framework is suggested, shown in Fig. 4. The selection of key frames is similar to DSO [2], based on camera motions.

Fig. 4 A DL-based VO/V-SLAM platform with key frame selection

For some scenarios, loop closures frequently occur, for example, at the parking lot or residential areas. So, a loop closure handling module is added and a global PGO is performed as well, shown in Fig. 5. Note: A key frame pool is well organized by adding or deleting keyframes to avoid redundancy.

Fig. 5 A DL-based VO/V-SLAM platform with key frame and loop closure mechanism

These system designs in Fig. 3–5 are integration of traditional methods with the deep learning models, no an end-to-end deep learning system. Below we propose a RNN (LSTM + Attention)-based deep VO/visual SLAM system, shown in Fig. 6.

Fig. 6 An End-to-End deep VO/visual SLAM platform with LSTM

In Fig. 6, a two-layered LSTM network is used to estimate the camera pose from the input from the CNN encoder. The first layer LSTM is aimed at learning the short time pose transition as key frame selection and windowed PGO, and the second layer’s purpose is to learn the long time pose dynamics and refine the global pose as PGO with loop closure.

As a matter of fact, the first layer’s output is filtered to remove redundancy by a key frame detection model, which is trained to select (like a switch) the key frames based on the encoder’s output, shown in Fig. 7.

Fig. 7 Key frame classifier model

Different from [14–15], we don’t use rotation and translation in pose net to estimate the occurrence of key frame, instead measure the feature map difference from the encoder. Initially, the current frame is compared with the previous frame; if the difference is big to generate the key frame, the reference frame is shifted to the current frame; if not, the reference frame is not changed, the current frame changes to the next.

For the second layer LSTM, it is designed to emulate the global pose optimization with loop closure mechanism used in traditional SLAM methods. The loop closure mechanism model is mimicked by transformer-like [19] self-attention mechanism, which similarity is measured by feature network output, shown in Fig. 8.

Fig. 8 Attention-based model for loop closure mechanism

Similar to [14–15], the global/absolute pose estimation is calculated in the second layer LSTM by accumulating the predicted related poses from the first layer LSTM.

Different from [15], we don’t use cycle consistency loss as the mechanism to emulate loop closure, rather we define the similarity in attention mechanism to measure if the loop closure occurs. Meanwhile, the features are defined explicitly from the feature network, instead of the feature map from the encoder in [14]. Also the attention is called spatial-temporal attention mechanism, instead of temporal attention only.

The encoder-LSTM network is trained along with flow network, segmentation network, feature network, depth network, two layered LSTM, keyframe classifier and attention-based loop closure mechanism simultaneously, shown in Fig. 9.

Fig. 9 A joint training framework of pose, feature, flow, depth and segmentation with LSTM

The module “multi-heads” is shown in Fig. 10, similar to Fig. 2. Similar to the reference [15], the training process is done in multiple stages. The first stage is processed for the CNN encoder framework, like Fig. 2. The second stage is processed for the CNN encoder — the first layer LSTM framework. The third stage is processed for the CNN encoder — the two layered LSTM framework.

Fig. 10 The multi-head module architecture

Summary

— — — — —

We propose an end-to-end deep VO/visual SLAM framework. In the CNN architecture, depth network, flow network, segmentation network and feature network are joint trained together.

Besides, we provide a semi-traditional design which applies the output of depth network, flow network or feature network to feed into a traditional framework of camera pose estimation.

With the pose estimation result from the DL-based pose network, we design a VO/V-SLAM platform with key frame and loop closure mechanism.

In the LSTM architecture, a two layered LSTM with attention mechanism and multi-stage-training is designed to mimic ideas of key frame selection, local/global pose graph optimization and loop closure, occurred in traditional methods.

References

— — — — — — -

  1. Artal R M, Tardos J D, “ORB-SLAM2: an Open-Source SLAM System for Monocular, Stereo and RGB-D Cameras”, arXiv:1610.06475, 2016
  2. J. Engel, V. Koltun and D. Cremers, “Direct Sparse Odometry”, arXiv:1607.02565, 2016
  3. Z Yin, J Shi, “GeoNet: Unsupervised Learning of Dense Depth, Optical Flow and Camera Pose”, CVPR 2018.
  4. Y Zou, Z Luo, J Huang, “DF-Net: Unsupervised Joint Learning of Depth and Flow using Cross-Task Consistency”, ECCV, 2018
  5. Y Kim, A Kim, “Sequential Learning of Visual Tracking and Mapping Using Unsupervised Deep Neural Networks “, arXiv: 1902.09826, 2019
  6. Y Li et al, “Pose Graph Optimization for Unsupervised Monocular Visual Odometry “, arXiv: 1903.06315, 2019
  7. H Zhan et al., “Visual Odometry Revisited: What Should Be Learnt”, arXiv:1909.09803, 2019.
  8. U Kim, S Kim, J Kim, “SimVODIS: Simultaneous Visual Odometry, Object Detection, and Instance Segmentation”, arXiv:1911.05939, 2019.
  9. A Ranjian et al., “Competitive Collaboration: Joint Unsupervised Learning of Depth, Camera Motion, Optical Flow and Motion Segmentation”, CVPR, 2019
  10. L Sheng et al., “Unsupervised Collaborative Learning of Keyframe Detection and Visual Odometry Towards Monocular Deep SLAM”, ICCV, 2019
  11. T Shen et al., “Self-Supervised Learning of Depth and Motion Under Photometric Inconsistency”, ICCV Workshops, 2019.
  12. Gordon A et al., “Depth From Videos in the Wild: Unsupervised Monocular Depth Learning From Unknown Cameras”, ICCV, 2019
  13. Y Shi et al., “Self-Supervised Learning of Depth and Ego-motion with Differentiable Bundle Adjustment”, arXiv: 1909.13163, 2019
  14. F Xue et al., “Beyond Tracking: Selecting Memory and Refining Poses for Deep Visual Odometry”, IEEE CVPR 2019.
  15. Y. Zou et al., “Learning Monocular Visual Odometry via Self-Supervised Long-Term Modeling”, arXiv:2007.10983, 2020
  16. D DeTone, T Malisiewicz, A Rabinovich, “SuperPoint: Self-Supervised Interest Point Detection and Description”, arXiv:1712.07629, 2017.
  17. M. Dusmanu et al., “D2-Net: A Trainable CNN for Joint Detection and Description of Local Features”, IEEE CVPR 2019.
  18. E Parisotto et al. “Global pose estimation with an attention-based recurrent network”, arXiv: 1802.06857, 2018.
  19. Y Tay et al., “Efficient Transformers: A Survey”, arXiv:2009.06732, 2020
  20. Shi, X., Chen, Z., Wang, H., Yeung, D.Y., “Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting”. NeurIPS 2015

--

--

Yu Huang

Working in Computer vision, deep learning, AR & VR, Autonomous driving, image & video processing, visualization and large scale foundation models.