UniRefiner: Teaching Pre-trained ViTs to Self-Dispose Dross via Contrastive Register

CVPR 2026

Congpei Qiu1,4,* Zhaoyu Hu1,* Wei Ke1 Zhuotao Tian3,4 Yanhao Wu1 Tong Zhang2,†

1Xi'an Jiaotong University 2University of Chinese Academy of Sciences 3Harbin Institute of Technology (Shenzhen) 4Shenzhen Loop Area Institute

* Equal contribution. † Corresponding author.

TL;DR

UniRefiner identifies any token that fails to preserve location-aligned semantics as spurious, then teaches pre-trained ViTs to preserve regular image tokens while pushing spurious responses into boundary registers, enabling lightweight post-hoc refinement of even 8B-scale models within roughly 30 minutes.

PCA dynamics throughout the refining process, comparing frozen vanilla features against UniRefiner under the same rendering setup.

Reference image containing a cat, a dog, and a car.
Reference image used for the six PCA dynamics visualizations.

SigLIP 2 So400M

ViT-So400M/16-512
UniRefiner refinement dynamics for SigLIP2-So400m from iteration 0 to 1000.
Frozen vanilla PCA feature map for SigLIP 2 So400M/16-512.
Vanilla UniRefiner

EVA-CLIP-8B

ViT-8B/14-448
UniRefiner refinement dynamics for EVA-CLIP-8B/14-448 from iteration 0 to 1000.
Frozen vanilla PCA feature map for EVA-CLIP-8B/14-448.
Vanilla UniRefiner

SigLIP 2 Giant

ViT-G/16-384
UniRefiner refinement dynamics for SigLIP 2 Giant/16-384 from iteration 0 to 1000.
Frozen vanilla PCA feature map for SigLIP 2 Giant/16-384.
Vanilla UniRefiner

RICE-ViT-L

ViT-L/16-560
UniRefiner refinement dynamics for RICE-ViT-L/16-560 from iteration 0 to 1000.
Frozen vanilla PCA feature map for RICE-ViT-L/16-560.
Vanilla UniRefiner

DINOv2 ViT-g/14

ViT-g/14
UniRefiner refinement dynamics for DINOv2 ViT-g/14.
Frozen vanilla PCA feature map for DINOv2 ViT-g/14.
Vanilla UniRefiner

Qwen3.5-9B ViT

ViT/16-512
UniRefiner refinement dynamics for Qwen3.5-9B ViT/16-512.
Frozen vanilla PCA feature map for Qwen3.5-9B ViT/16-512.
Vanilla UniRefiner

Preliminary

Spurious tokens are broader than simple outliers.

UniRefiner starts from a simple view: for dense prediction, any token that no longer preserves location-aligned semantics should be treated as spurious.

Fixed Pattern

Tokens that remain similar across unrelated images instead of reflecting local visual content.

Global Proxy

Tokens that drift toward scene-level context and stop encoding the semantics of their own spatial position.

Attention Hijackee

Tokens that are overwritten by stronger neighbors through attention flow and lose their own spatial identity.

Preliminary figure illustrating Fixed Pattern, Global Proxy, and Attention Hijackee tokens.
UniRefiner characterizes three recurring token failure modes that corrupt dense spatial representations in large ViTs.

Method

Filter the spurious tokens, then teach registers to absorb them.

UniRefiner method overview with token filtering and contrastive registers.

Spurious token filtering. A frozen teacher branch separates regular and spurious tokens using similarity cues, attention flow, and register feedback.

Contrastive register distillation. Student image tokens align to filtered regular tokens, while boundary registers align to detected spurious tokens.

Adaptive refinement. As training progresses, learned registers become stronger spurious detectors and improve the next round of filtering.

Experiment

Refined vision-language backbones can exceed vision-specialized ones.

Teaser summary figure comparing dense prediction quality before and after UniRefiner across tasks.
Across segmentation, depth estimation, and high-resolution vision-language understanding, UniRefiner consistently improves dense spatial representations in large pre-trained backbones.

UniRefiner significantly improves dense prediction performance on both visual-centric and vision-language backbones. For the full set of quantitative and qualitative results, please refer to the paper.

Citation

@article{qiu2026unirefiner,
  title={UniRefiner: Teaching Pre-trained ViTs to Self-Dispose Dross via Contrastive Register},
  author={Qiu, Congpei and Hu, Zhaoyu and Ke, Wei and Tian, Zhuotao and Wu, Yanhao and Zhang, Tong},
  journal={arXiv preprint arXiv:2605.19622},
  year={2026},
  eprint={2605.19622},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2605.19622}
}