D3VO Implementation

In my most recent Master's degree term project, I delved into the theoretical side of state estimation by exploring a recently developed monocular visual odometry technique known as Deep Depth, Deep Pose and Deep Uncertainty for Monocular Visual Odometry (D3VO)1. This work proposes a framework that leverages self-supervised deep networks (predicting pose and brightness transformations between consecutive frames (PoseNet) and depth and photometric uncertainty (DepthNet)) to boost the front-end tracking and back-end optimization of a visual odometry system. The novelty of these self-supervised networks is in the ability of the system to train itself solely on monocular video data, rather than stereo data, which is often more expensive from a hardware perspective. In this project, we aimed to understand and implement D3VO, and my role specifically involved adapting PoseNet, first proposed in Monodepth22, to include the brightness transformations parameters, as well as updating the loss function by incorporating this transformation into the re-projection error and adding a regularizing term for these parameters. We successfully completed a working implementation, with additional improvements and future work to be done to potentially improve the performance of both the deep nets and the visual odometry system. Check out our GitHub here.

  1. Yang, N., Stumberg, L. V., Wang, R., & Cremers, D. (2020). D3vo: Deep depth, deep pose and deep uncertainty for monocular visual odometry. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 1281-1292).
  2. Godard, C., Mac Aodha, O., Firman, M., & Brostow, G. J. (2019). Digging into self-supervised monocular depth estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 3828-3838).