🌐 3DCorrEnhance

Multiview Equivariance Improves 3D Correspondence Understanding with Minimal Feature Finetuning

The Thirteenth International Conference on Learning Representations (ICLR) 2025

Yang You¹, Yixin Li¹, Congyue Deng¹, Yue Wang², Leonidas Guibas¹

¹Stanford University, ²University of Southern California

Abstract

Vision foundation models, particularly the ViT family, have revolutionized image understanding by providing rich semantic features. However, despite their success in 2D comprehension, their abilities on grasping 3D spatial relationships are still unclear. In this work, we evaluate and enhance the 3D awareness of ViT-based models. We begin by systematically assessing their ability to learn 3D equivariant features, specifically examining the consistency of semantic embeddings across different viewpoints. Our findings indicate that improved 3D equivariance leads to better performance on various downstream tasks, including pose estimation, tracking, and semantic transfer. Building on this insight, we propose a simple yet effective finetuning strategy based on 3D correspondences, which significantly enhances the 3D correspondence understanding of existing vision models. Remarkably, finetuning on a single object for one iteration results in substantial gains.

Evaluation of 3D Equivariance

To evaluate 3D equivariance, we utilize rendered or annotated multiview correspondences from Objaverse and MVImgNet, covering both synthetic and real images. For Objaverse, we randomly select 1,000 objects from the Objaverse repository, rendered across 42 uniformly distributed camera views, producing 42,000 images. Dense correspondences are computed for each object across every unique ordered pair of views, resulting in 1.8 billion correspondence pairs for evaluation. Similarly, 1,000 objects are randomly drawn from MVImgNet, yielding 33.3 million annotated correspondence pairs for evaluation. Since MVImgNet employs COLMAP to reconstruct 3D points, it provides sparser correspondences compared to Objaverse.

Metrics

Average Pixel Error % (APE): a metric that quantifies the average distance between predicted and ground-truth pixel correspondences, normalized by the length of the shortest image edge
Percentage of Correct Dense Point % (PCDP): a metric designed to evaluate dense correspondences, similar to Percentage of Correct Keypoints % (PCK).

3D Equivariance Helps 3D Tasks

3D Equivariance itself is not interesting unless it can be used. Below, we will talk about three mature downstream applications that require 3D equivariance capability, and show a correlation between the quality of 3D equivariance and the downstream applications.

Pose Estimation

Video Tracking

Semantic Transfer

The three tasks we evaluated—pose estimation, video tracking, and semantic correspondence—are intentionally selected to cover diverse aspects of correspondence estimation, ranging from simpler to more complex scenarios:

Pose Estimation: Examines correspondences within the same instance under rigid transformations (SE(3)).
Video Tracking: Extends this to correspondences for the same instance under potential non-rigid or articulated transformations, such as humans or animals in motion.
Semantic Correspondence: Requires correspondences across different instances with similar semantics, often under arbitrary viewpoint changes.

Along the horizontal axis, lower APE indicates better feature equivariance, while the vertical axis reflects higher task performance across all four plots. The data points align roughly along the diagonal from the top left to the bottom right, suggesting a strong correlation between improved feature equivariance and better task performance.

Finetuned 3D Equivariance Leads to Better 3D Task Performance

The high-level intuition of improving the multiview equivariance of the network features is to enforce the similarity between features of corresponding pixels in 3D space. We apply LoRA in the last four blocks to finetune large foundation models, we introduced a single convolutional layer with a kernel size of 3 and a stride of 1. The motivation behind this addition is rooted in the observation that ViT-family models process image tokens as patches, resulting in much lower-resolution feature maps. It is beneficial to explicitly exchange information between neighboring patches before interpolation to achieve more accurate results.

Remarkably,

Fine-tuning on just one object provides significant performance improvements.
The method is object-agnostic, the choice of object does not affect performance significantly. Even simple shapes like an untextured hemisphere can enhance the 3D correspondence understanding of the ViTs in these tasks.
Training with just a single multi-view pair of one object for a single iteration significantly boosts the model’s 3D equivariance

BibTeX

If you find it helpful, please consider citing our work:

@article{you2024multiview,
  title={Multiview Equivariance Improves 3D Correspondence Understanding with Minimal Feature Finetuning},
  author={You, Yang and Li, Yixin and Deng, Congyue and Wang, Yue and Guibas, Leonidas},
  journal={arXiv preprint arXiv:2411.19458},
  year={2024}
}

If you have further questions, please feel free to drop an email to yangyou@stanford.edu, yixinli@stanford.edu.