UniRefiner: Teaching Pre-trained ViTs to Self-Dispose Dross via Contrastive Register

CVPR 2026

Congpei Qiu1,4,* Zhaoyu Hu1,* Wei Ke1 Zhuotao Tian3,4 Yanhao Wu1 Tong Zhang2,†

1Xi'an Jiaotong University 2University of Chinese Academy of Sciences 3Harbin Institute of Technology (Shenzhen) 4Shenzhen Loop Area Institute

* Equal contribution. † Corresponding author.

PDF Soon Code Soon BibTeX Soon

TL;DR

UniRefiner identifies any token that fails to preserve location-aligned semantics as spurious, then teaches pre-trained ViTs to preserve regular image tokens while pushing spurious responses into boundary registers, enabling lightweight post-hoc refinement of even 8B-scale models within roughly 30 minutes.

PCA dynamics throughout the refining process, comparing frozen vanilla features against UniRefiner under the same rendering setup.

Reference image containing a cat, a dog, and a car.
Reference image used for the six PCA dynamics visualizations.

EVA-CLIP-8B

Vanilla vs UniRefiner
UniRefiner refinement dynamics for EVA-CLIP-8B from iteration 0 to 1000.
Frozen vanilla feature dynamics for EVA-CLIP-8B from iteration 0 to 1000.
Vanilla UniRefiner

SigLIP2

Vanilla vs UniRefiner
UniRefiner refinement dynamics for SigLIP2-So400m from iteration 0 to 1000.
Frozen vanilla feature dynamics for SigLIP2-So400m from iteration 0 to 1000.
Vanilla UniRefiner

RICE-ViT

Vanilla vs UniRefiner
UniRefiner refinement dynamics for RICE-ViT from iteration 0 to 1000.
Frozen vanilla feature dynamics for RICE-ViT from iteration 0 to 1000.
Vanilla UniRefiner

SigLIP2-Giant

Vanilla vs UniRefiner
UniRefiner refinement dynamics for SigLIP2-Giant from iteration 0 to 1000.
Frozen vanilla feature dynamics for SigLIP2-Giant from iteration 0 to 1000.
Vanilla UniRefiner

CLIP ViT-G/14

Vanilla vs UniRefiner
UniRefiner refinement dynamics for CLIP ViT-G/14 from iteration 0 to 1000.
Frozen vanilla feature dynamics for CLIP ViT-G/14 from iteration 0 to 1000.
Vanilla UniRefiner

Qwen3-ViT*

Vanilla vs UniRefiner
UniRefiner refinement dynamics for Qwen3-ViT from iteration 0 to 1000.
Frozen vanilla feature dynamics for Qwen3-ViT from iteration 0 to 1000.
Vanilla UniRefiner

* Qwen3 appears less stable because the spurious-token ratio drops sharply during refinement, making the PCA basis itself vary more noticeably.

Preliminary

Spurious tokens are broader than simple outliers.

UniRefiner starts from a simple view: for dense prediction, any token that no longer preserves location-aligned semantics should be treated as spurious.

Fixed Pattern

Tokens that remain similar across unrelated images instead of reflecting local visual content.

Global Proxy

Tokens that drift toward scene-level context and stop encoding the semantics of their own spatial position.

Attention Hijackee

Tokens that are overwritten by stronger neighbors through attention flow and lose their own spatial identity.

Preliminary figure illustrating Fixed Pattern, Global Proxy, and Attention Hijackee tokens.
UniRefiner characterizes three recurring token failure modes that corrupt dense spatial representations in large ViTs.

Method

Filter the spurious tokens, then teach registers to absorb them.

UniRefiner method overview with token filtering and contrastive registers.

Spurious token filtering. A frozen teacher branch separates regular and spurious tokens using similarity cues, attention flow, and register feedback.

Contrastive register distillation. Student image tokens align to filtered regular tokens, while boundary registers align to detected spurious tokens.

Adaptive refinement. As training progresses, learned registers become stronger spurious detectors and improve the next round of filtering.

Experiment

Refined vision-language backbones can exceed vision-specialized ones.

Teaser summary figure comparing dense prediction quality before and after UniRefiner across tasks.
Across segmentation, depth estimation, and high-resolution vision-language understanding, UniRefiner consistently improves dense spatial representations in large pre-trained backbones.

UniRefiner significantly improves dense prediction performance on both visual-centric and vision-language backbones. For the full set of quantitative and qualitative results, please refer to the paper.

Citation

% BibTeX placeholder
% The citation block will be added once the final proceedings metadata is available.