Refining CLIP's Spatial Awareness: A Visual-centric Perspective


ICLR 2025

1Xi'an Jiaotong University   2EPFL   3University of Chinese Academy of Sciences   corresponding author  

TL;DR: We design a refiner module that extracts dense, spatially-aware features directly from CLIP, improving region-language alignment with a visual-centric focus.

Refiner Image

Training the Refiner on frozen CLIP for high-quality dense features


Motivation

The necessity of a visual-centric constraint: Region-Language Alignment (RLA) methods align CLIP's region-level visual features with text-oriented features, often overlooking spatial details. This results in a diminished spatial awareness in CLIP ViTs, which is critical for dense prediction tasks. To address this, we introduce a pixel-level visual-centric constraint to complement and enhance RLA.

Abstract

Contrastive Language-Image Pre-training (CLIP) excels in global alignment with language but exhibits limited sensitivity to spatial information, leading to strong performance in zero-shot classification tasks but underperformance in tasks requiring precise spatial understanding. Recent approaches have introduced Region-Language Alignment (RLA) to enhance CLIP's performance in dense multimodal tasks by aligning regional visual representations with corresponding text inputs. However, we find that CLIP ViTs fine-tuned with RLA suffer from notable loss in spatial awareness, which is crucial for dense prediction tasks. To address this, we propose the Spatial Correlation Distillation (SCD) framework, which preserves CLIP's inherent spatial structure and mitigates above degradation. To further enhance spatial correlations, we introduce a lightweight Refiner that extracts refined correlations directly from CLIP before feeding them into SCD, based on an intriguring finding that CLIP naturally capture high-quality dense features. Together, these components form a robust distillation framework that enables CLIP ViTs to integrate both visual-language and visual-centric improvements, achieving state-of-the-art results across various open-vocabulary dense prediction benchmarks.

Method Overview

(a) A Refiner is trained on top of the frozen CLIP using a contrastive learning paradigm with global-to-local dynamics. This Refiner effectively enhances the noisy dense features from CLIP, encoding high-quality local semantics.

(b) A visual-centric branch is integrated into recent Region-Language Alignment methods, distilling the refined spatial correlations from the Refiner into the CLIP ViT as a spatial constraint. This approach leverages the strengths of both visual-language and visual-centric perspectives.


Visualization

Here, we showcase visualizations of zero-shot semantic segmentation results, highlighting the effectiveness of our method in enhancing CLIP's spatial awareness.

Ours
CLIPSelf
Ours
CLIPSelf
Ours
CLIPSelf
Ours
CLIPSelf
Ours
CLIPSelf
Ours
CLIPSelf

BibTeX

@inproceedings{
      qiu2025refining,
      title={Refining {CLIP}'s Spatial Awareness: A Visual-Centric Perspective},
      author={Congpei Qiu and Yanhao Wu and Wei Ke and Xiuxiu Bai and Tong Zhang},
      booktitle={The Thirteenth International Conference on Learning Representations},
      year={2025},
      url={https://openreview.net/forum?id=38No4B8sx6}
      }