CVPR 2026 Findings

DetRefiner: Model-Agnostic Detection Refinement with Feature Fusion Transformer

Soichiro Okazaki1, Tatsuya Sasaki1, Hiroki Ohashi1
1Hitachi, Ltd. Research and Development Group.
DetRefiner teaser result

DetRefiner can refine object detection results by recalibrating detector confidence using global and local visual features extracted from foundation models such as DINOv3.
As a plug-and-play module, DetRefiner improves diverse OVOD models, such as Grounding DINO and LLMDet, without accessing internal detector features or retraining the base model.

Overview

The process of refining object detection results with DetRefiner consists of three steps:

  1. Refines global-level and patch-level DINOv3 features with a lightweight 2-layer Transformer Encoder, then pools box-level representations using ROI Align.
  2. Matches class and ROI features with CLIP text embeddings, while using CLIP-based distillation for semantic alignment only during training.
  3. Merges image-level, box-level, and base detector scores to recalibrate detections without accessing internal detector features or retraining the detector.
Overview of DetRefiner

DetRefiner has the following inference characteristics:

  • Uses only the input image and detector outputs, i.e., labels, boxes, and confidence scores.
  • Does not change box coordinates or category labels.
  • Rescues low-scored true positives and suppresses false positives.
  • A single trained DetRefiner can generalize to different detectors without further retraining.

Quantitative Results

DetRefiner consistently improves the performance of various models under zero-shot or cross-domain settings on datasets such as COCO, LVIS, Pascal VOC, and ODinW-13. The evaluation uses four types of base detectors—GLIP, MM-Grounding DINO, Grounding DINO, and LLMDet—with varying model sizes. The following table shows the results on LVIS.

LVIS quantitative results

Qualitative Results

Four qualitative comparisons of detection results before and after applying DetRefiner.

  • Top: base detector predictions (left) vs. predictions refined by DetRefiner (right)
  • Bottom: predictions based on the class vector (left) and patch vector (right)
Qualitative comparison result 1
Qualitative comparison result 2
Qualitative comparison result 3
Qualitative comparison result 4

BibTeX

@article{okazaki2026detrefiner,
title={DetRefiner: Model-Agnostic Detection Refinement with Feature Fusion Transformer},
author={Okazaki, Soichiro and Sasaki, Tatsuya and Ohashi, Hiroki},
journal={arXiv preprint arXiv:2605.10190},
year={2026}
}