Vision foundation models, particularly the ViT family, have revolutionized image understanding by providing rich semantic features. However, despite their success in 2D comprehension, their abilities on grasping 3D spatial relationships are still unclear. In this work, we evaluate and enhance the 3D awareness of ViT-based models. We begin by systematically assessing their ability to learn 3D equivariant features, specifically examining the consistency of semantic embeddings across different viewpoints. Our findings indicate that improved 3D equivariance leads to better performance on various downstream tasks, including pose estimation, tracking, and semantic transfer. Building on this insight, we propose a simple yet effective finetuning strategy based on 3D correspondences, which significantly enhances the 3D correspondence understanding of existing vision models. Remarkably, finetuning on a single object for one iteration results in substantial gains.
To evaluate 3D equivariance, we utilize rendered or annotated multiview correspondences from Objaverse and MVImgNet, covering both synthetic and real images. For Objaverse, we randomly select 1,000 objects from the Objaverse repository, rendered across 42 uniformly distributed camera views, producing 42,000 images. Dense correspondences are computed for each object across every unique ordered pair of views, resulting in 1.8 billion correspondence pairs for evaluation. Similarly, 1,000 objects are randomly drawn from MVImgNet, yielding 33.3 million annotated correspondence pairs for evaluation. Since MVImgNet employs COLMAP to reconstruct 3D points, it provides sparser correspondences compared to Objaverse.
3D Equivariance itself is not interesting unless it can be used. Below, we will talk about three mature downstream applications that require 3D equivariance capability, and show a correlation between the quality of 3D equivariance and the downstream applications.
Pose Estimation
Video Tracking
Semantic Transfer
The high-level intuition of improving the multiview equivariance of the network features is to enforce the similarity between features of corresponding pixels in 3D space. We apply LoRA in the last four blocks to finetune large foundation models, we introduced a single convolutional layer with a kernel size of 3 and a stride of 1. The motivation behind this addition is rooted in the observation that ViT-family models process image tokens as patches, resulting in much lower-resolution feature maps. It is beneficial to explicitly exchange information between neighboring patches before interpolation to achieve more accurate results.
If you find it helpful, please consider citing our work:
@article{you2024multiview,
title={Multiview Equivariance Improves 3D Correspondence Understanding with Minimal Feature Finetuning},
author={You, Yang and Li, Yixin and Deng, Congyue and Wang, Yue and Guibas, Leonidas},
journal={arXiv preprint arXiv:2411.19458},
year={2024}
}
If you have further questions, please feel free to drop an email to yangyou@stanford.edu, yixinli@stanford.edu.