The loop shipped. Here's what it produced.
We closed the loop. 17 models, 8 teachers, 4 architectures, $38.95 total. The manuscript is written, the baselines are run, the paradox is proven, the fix works.
What we built
- An ICLR 2027 manuscript proving the Voting Ensemble Paradox
- kompress-v8: a production compression model (0.955 heretic exact, 1.000 agent mk_in_ref with override)
- LoopKit — a loop-experiment-researcher template so anyone can scaffold from here
- Interactive docs, a Colab notebook, a Telegram bot with council, an MCP-ready evaluation server
- 18 model cards on HuggingFace with full benchmarks, training details, and cross-references
What we learned
Label quality is the bottleneck. Not model capacity (v11: ModernBERT-large, 352M params, collapsed to 0.906). Not data quantity (v15: 983 pairs, regressed to 0.878). Only label-quality interventions worked:
| Intervention | Version | Heretic | Δ |
|---|---|---|---|
| Self-labeling | v3→v4 | 0.942→0.943 | +0.001 |
| C3 distillation (Qwen teacher) | v8 | 0.955 | +0.012 |
| λ-ablation (loss weight) | v8→v17→v16 | 0.955→0.963→0.972 | +0.017 |
The λ-ablation mapped the full Pareto frontier: higher must-keep weight in the loss function improves heretic precision linearly but kills compression. At 3x (v8): 0.955 heretic, 15% compression. At 5x (v17): 0.963 heretic, 3.7% compression. At 10x (v16): 0.972 heretic, 2.8% compression. The tradeoff is fundamental — you cannot have perfect precision and aggressive compression simultaneously with this architecture.
The dead ends are the research. 11 of 17 models were dead ends. We published them all:
| Version | Attempt | Heretic | Lesson |
|---|---|---|---|
| v6 | Agent-distribution training | 0.962 | Dead end — more conservative |
| v7 | Sliding-window self-labeling | 0.956 | Dead end — regressed precision |
| v9 | C3-only, no generic | 0.921 | Overfit — need diversity |
| v11 | Larger encoder (352M) | 0.906 | Capacity ≠ precision |
| v12 | Qwen3-Coder teacher | 0.949 | Teacher too conservative |
| v13 | GLM scenarios + regex | 0.951 | Regex teacher too conservative |
| v14 | Council-controlled training | 0.882 | Concept proven, needs work |
| v15 | Everything bagel (983 pairs) | 0.878 | More data ≠ better |
Every dead end taught us something. The loop doesn't produce models — it produces understanding.
What's next
The loop has stopped improving. v8 is the production model — 0.955 heretic exact, 1.000 agent mk_in_ref with the must-keep override, 15% token savings. The Voting Ensemble Paradox is proven. The paper is written.
But the loop pattern itself is now open source. LoopKit — part of the growing loop engineering ecosystem with Addy Osmani, LangChain, and Cobus Greyling — is available for anyone to clone and extend. Build your own loop. Find your own v8.
Until the next loop starts.
— peter
This post is the closing entry in the kompress experiment. See also: the heretic eval, LoopKit, all 18 models on HuggingFace, the ultrawhale training repo, and headroom.