2025
This study introduces a novel approach to adapt vision foundation models to new domains without annotations, leveraging only short multi-view object-centric videos. Unlike existing self-supervised strategies, VESSA carefully combines prediction head tuning, parameter-efficient adaptation, and multi-view object observations to prevent knowledge degradation and ensure robustness. Through extensive experiments, VESSA shows consistent gains across different models and datasets, demonstrating its potential for advancing visual foundation model adaptation.