

(a) A Refiner is trained on top of the frozen CLIP using a contrastive learning paradigm with global-to-local dynamics. This Refiner effectively enhances the noisy dense features from CLIP, encoding high-quality local semantics.
(b) A visual-centric branch is integrated into recent Region-Language Alignment methods, distilling the refined spatial correlations from the Refiner into the CLIP ViT as a spatial constraint. This approach leverages the strengths of both visual-language and visual-centric perspectives.
Here, we showcase visualizations of zero-shot semantic segmentation results, highlighting the effectiveness of our method in enhancing CLIP's spatial awareness.
@inproceedings{
qiu2025refining,
title={Refining {CLIP}'s Spatial Awareness: A Visual-Centric Perspective},
author={Congpei Qiu and Yanhao Wu and Wei Ke and Xiuxiu Bai and Tong Zhang},
booktitle={The Thirteenth International Conference on Learning Representations},
year={2025},
url={https://openreview.net/forum?id=38No4B8sx6}
}