This work introduces a novel approach to adapt vision foundation models to new domains without annotations, leveraging only short multi-view object-centric videos. Unlike existing self-supervised strategies, VESSA carefully combines prediction head tuning, parameter-efficient adaptation, and multi-view object observations to prevent knowledge degradation and ensure robustness.