Consensus Entropy: Harnessing Multi-VLM Agreement for Self-Verifying and Self-Improving OCR

Yulong Zhang, Tianyi Liang, Xinyue Huang, Erfei Cui, Xu Guo, Pei Chu, Chenhui Li, Ru Zhang, Wenhai Wang, Gongshen Liu

CVPR 2026 Accepted

CE-OCR Framework

Given an input image, multiple VLMs independently generate OCR predictions. Pairwise similarities among these predictions define a consensus distribution, from which Consensus Entropy is derived. A threshold gate accepts low-entropy ensemble outputs and routes high-entropy cases to stronger rephrasing, enabling self-verifying and self-improving OCR without supervision.

Abstract

OCR is a core capability for vision-language models and an important source of high-quality data for LLM training, yet strong VLMs still lack reliable sample-level quality control. Consensus Entropy (CE) is a training-free and model-agnostic reliability metric that measures agreement entropy across multiple VLM predictions. Building on CE, CE-OCR verifies OCR results through ensemble agreement, selects low-entropy outputs, and improves efficiency with adaptive routing. The method requires no supervision and can be integrated as a plug-and-play quality-control layer for OCR pipelines.

Why Consensus Entropy Works

Prediction behaviors across entropy levels

Low-entropy predictions form a tight cluster around the ground truth, while medium- and high-entropy cases show increasing disagreement among VLMs.

Normalized entropy analysis compares combination strategies and motivates CE as a compact agreement signal bounded between single-point and uniform prediction distributions.

Main Results

OCR verification

+20.0% to +42.1% overall F1 improvement over VLM-as-Judge across GPT-4o, Qwen2-VL-7B, and Qwen2-VL-72B reference settings.

CE-Ensemble

Positive gains in most 3-5 model ensembles on OCRBench, with larger ensembles producing higher and more stable improvements.

CE-OCR routing

On OCRBench-V2, CE-OCR improves over CE-Ensemble and the best single model across English OCR, Math, Element Parsing, and Chinese OCR.

OCRBench performance under CE thresholds

OCRBench performance under different CE thresholds shows how routing uncertain cases improves accuracy while controlling the rephrasing budget.

Performance comparison across token lengths

CE-based routing remains effective across token lengths compared with self-consistency and single-model baselines.

Key Quantitative Results

OCR verification: CE vs. VLM-as-Judge

Consensus Entropy is computed directly from VLM predictions and substantially improves F1 over prompting a VLM to judge OCR correctness.

Reference VLM	VLM-as-Judge F1	CE F1	Relative Gain
GPT-4o	40.0	48.0	+20.0%
Qwen2-VL-7B	36.1	51.3	+42.1%
Qwen2-VL-72B	39.8	51.0	+28.1%

CE-Ensemble gains with more participating models

Across 3-5 model ensembles on OCRBench, CE-based output selection consistently improves over both the weakest and strongest individual model in most cases.

# Models	Avg. gain over weakest	Avg. gain over best	Positive cases over best	Positive cases over average
3	50.2	4.2	66.2%	94.7%
4	67.5	12.7	82.2%	100.0%
5	78.3	17.8	91.1%	100.0%

CE-OCR on OCRBench-V2

With GPT-4o rephrasing for routed samples, CE-OCR improves over CE-Ensemble and the best single model on most OCRBench-V2 categories.

Method	English OCR	Math	Element Parsing	Chinese Overall
Best single model	65.6	47.7	32.6	44.2
CE-Ensemble	67.2	50.1	34.0	45.7
CE-OCR	71.6	53.1	33.8	48.0

Method Details

Collect independent OCR hypotheses

For each image, CE-OCR queries several VLMs or OCR systems independently. The method does not assume access to logits, confidence scores, or training labels; it only uses the resulting text strings.

Convert pairwise agreement into an entropy signal

Pairwise string similarities define a consensus distribution for every candidate. Predictions that agree with most other candidates have lower entropy; fragmented predictions have higher entropy and are treated as uncertain.

Select, ensemble, or route based on CE

Low-entropy outputs can be directly accepted or selected as ensemble results. High-entropy cases are routed to a stronger model for rephrasing, giving a controllable trade-off between OCR quality and inference cost.

Usage

pip install consensus-entropy

from consensus_entropy import calculate_consensus_entropy, get_best_ocr_result

ocr_results = ["Hello World", "Hello Wrld", "Hallo World"]
entropy_values = calculate_consensus_entropy(ocr_results, task_type="ocr")
best_result, best_entropy = get_best_ocr_result(ocr_results, task_type="ocr")

Supplementary Findings

The supplementary material further validates Consensus Entropy beyond the main OCRBench setting, including calibration, stochastic same-model ensembles, output cleaning, distance metrics, computational cost, and human-labeled OCR evaluation.

Analysis	Takeaway	Evidence
Same-family / identical-model ensembles	CE remains effective even when candidates come from similar model families or stochastic runs of the same model.	Identical model with T=0.7: +2.1% average gain, +3.2% max gain, positive in 100% of cases.
CE vs. ROVER	Traditional voting-style OCR fusion can fail on open-ended VQA/OCR tasks, while CE maintains positive gains.	CE-Ensemble average: 66.3 vs. Best Single 60.3 and ROVER 50.4.
Non-OCR VQA tasks	CE generalizes to semantic settings when paired with an appropriate distance metric.	Reported gains include Science-VQA +3.9%, Math-VQA +14.0%, and Visual Understanding +8.9%.
Computational cost	Edit-distance CE is extremely lightweight for long OCR strings.	1,000 long-text pairs: edit distance total 0.163s on CPU, compared with embedding-based metrics requiring GPU memory.

Human-Labeled OCR Dataset

We use a human-labeled OCR evaluation set to validate whether Consensus Entropy aligns with human quality judgments. The dataset supports analysis of OCR correctness, calibration, and agreement-based verification across diverse document types.

The public Hugging Face dataset card is being confirmed and will be linked here and in the GitHub README once finalized.

Reproducible Demo

The released supplementary package includes a compact OCRBench demo for CE filtering and CE-based multi-model ensembling. It consumes VLMEvalKit-style OCRBench Excel outputs and produces JSON/XLSX analysis files.

pip install pandas Levenshtein numpy openpyxl

# CE threshold analysis
python cal_avg_scores_CE_thresholds.py   ./samples/InternVL2_5-8B_OCRBench.xlsx   ./samples/Qwen2-VL-7B-Instruct_OCRBench.xlsx   -o ./res_avg_scores_CE/test.json

# CE-based multi-model ensemble
python ensemble_consensus_entropy_from_xlsx.py   ./samples/InternVL2_5-8B_OCRBench.xlsx   ./samples/Qwen2-VL-7B-Instruct_OCRBench.xlsx   ./samples/Qwen2.5-VL-7B_OCRBench.xlsx   -o ./res_ensemble/

BibTeX

@misc{zhang2025consensusentropyharnessingmultivlm,
  title={Consensus Entropy: Harnessing Multi-VLM Agreement for Self-Verifying and Self-Improving OCR},
  author={Yulong Zhang and Tianyi Liang and Xinyue Huang and Erfei Cui and Xu Guo and Pei Chu and Chenhui Li and Ru Zhang and Wenhai Wang and Gongshen Liu},
  year={2025},
  eprint={2504.11101},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2504.11101}
}