DetRefiner: Model-Agnostic Detection Refinement with Feature Fusion Transformer

Overview

The process of refining object detection results with DetRefiner consists of three steps:

Refines global-level and patch-level DINOv3 features with a lightweight 2-layer Transformer Encoder, then pools box-level representations using ROI Align.
Matches class and ROI features with CLIP text embeddings, while using CLIP-based distillation for semantic alignment only during training.
Merges image-level, box-level, and base detector scores to recalibrate detections without accessing internal detector features or retraining the detector.

DetRefiner has the following inference characteristics:

Uses only the input image and detector outputs, i.e., labels, boxes, and confidence scores.
Does not change box coordinates or category labels.
Rescues low-scored true positives and suppresses false positives.
A single trained DetRefiner can generalize to different detectors without further retraining.

Quantitative Results

DetRefiner consistently improves the performance of various models under zero-shot or cross-domain settings on datasets such as COCO, LVIS, Pascal VOC, and ODinW-13. The evaluation uses four types of base detectors—GLIP, MM-Grounding DINO, Grounding DINO, and LLMDet—with varying model sizes. The following table shows the results on LVIS.

Qualitative Results

Four qualitative comparisons of detection results before and after applying DetRefiner.

Top: base detector predictions (left) vs. predictions refined by DetRefiner (right)
Bottom: predictions based on the class vector (left) and patch vector (right)

@article{okazaki2026detrefiner, title={DetRefiner: Model-Agnostic Detection Refinement with Feature Fusion Transformer}, author={Okazaki, Soichiro and Sasaki, Tatsuya and Ohashi, Hiroki}, journal={arXiv preprint arXiv:2605.10190}, year={2026} }

DetRefiner: Model-Agnostic Detection Refinement with Feature Fusion Transformer

Overview

Quantitative Results

Qualitative Results

BibTeX