the voting ensemble paradox — resolved
You build an ensemble of compression models, each fine-tuned from different checkpoints. You expect the ensemble to be better than any single model. It's worse.
That's the voting ensemble paradox.
The paradox
We formalized it: under k-of-N drop voting, the ensemble eviction indicator equals the k-th order statistic of the per-voter indicators. The ensemble collapses to its weakest member on every stratum.
In plain terms: if you have 3 models voting on whether to keep or drop each token, and any 2 agree, the ensemble score is the second-worst model's score. The best model's judgment gets diluted by the weaker ones. Adding more models makes it worse, not better.
We proved this (Theorem 1 + Corollary 1 + Remark 1) and validated it empirically: an ensemble of v3, v3.1, and v3.2 scored 0.931 heretic exact — worse than v4 alone at 0.967. The ensemble wasn't just not better. It was actively harmful.
The fix
A 3.0× weighted cross-entropy penalty on critical-syntactic tokens — signal names, file paths, exit codes, compiler flags, anything the agent needs intact.
Three mechanisms work together:
-
Mechanism A (training): The 3.0× loss weight forces the model to prioritize must-keep tokens during fine-tuning. We mapped the full Pareto frontier: 3× → 0.955 heretic (15% compression), 5× → 0.963 (3.7%), 10× → 0.972 (2.8%). The tradeoff is fundamental.
-
Mechanism B (inference override): A post-inference regex safety net catches what the subword tokenizer splits across tokens. Compiler flags, hex addresses, file paths — patterns the single-token classifier misses. Deployed in headroom PR #1419. Pushes agent mk_in_ref from 0.652 to 1.000.
-
Mechanism C (self-labeling loop): The model labels its own training data, a stronger teacher corrects the mistakes, the corrected labels train the next version. v3→v4 internalized the override behavior (delta collapsed from +0.027 to 0.000). v8 used Qwen2.5-7B as the teacher for C3 self-distillation.
kompress-v8
The production model: 149M-param dual-head ModernBERT, trained via C3 self-distillation with Qwen2.5-7B teacher on 97 carefully labeled pairs at 33% C3 ratio.
| Metric | Value |
|---|---|
| Heretic exact (32 prompts) | 0.955 |
| Keep rate | 0.854 |
| Override delta | 0.000 |
| Agent mk_in_ref (with override) | 1.000 |
| Token savings | 15% |
| Base model | kompress-v2-base |
The experiment
17 models trained. 8 teachers. 4 architectures. $38.95 total.
| Version | What we tried | Heretic | Lesson |
|---|---|---|---|
| v2 | — | 0.975 | Precision ceiling |
| v4 | Self-labels | 0.943 | Override internalized |
| v6 | Agent-distribution | 0.962 | Dead end |
| v8 | Qwen2.5 teacher | 0.955 | Production |
| v9 | C3-only | 0.921 | Overfit |
| v11 | Larger encoder | 0.906 | Capacity ≠ precision |
| v14 | Council training | 0.882 | Concept proven |
| v16 | 10× weight | 0.972 | Pareto endpoint |
11 of 17 were dead ends. We published them all. The dead ends are the research.
Open science
The interactive paper is live at kompress.vaked.dev — WebGL neural field background, live paradox simulation, baseline comparison.
- Paper PDF: peterlodri-sec.github.io/longrun-eval-kompress/paper/main.pdf
- GitHub: github.com/peterlodri-sec/longrun-eval-kompress
- Model: huggingface.co/PeetPedro/kompress-v8
- All 18 models: huggingface.co/PeetPedro
- Experiment logs: pocoo.vaked.dev
ICLR 2027 submission. All code, data, models open source.
This is an inner loop of the ultrawhale project. The outer loop cost $37.19 in DeepSeek API fees for the agent that orchestrated the experiments. The inner loop cost $1.76 in GPU compute on vast.ai RTX 4090s. The whole thing cost less than a conference registration.
Label quality is the bottleneck, not model capacity or data quantity. Loop engineering works. The loop shipped.
— peter
This is the research paper companion post. See also: the kompress heretic eval (full experiment log), the loop shipped (closing essay), LoopKit (starter kit), and the interactive paper.