← all posts

Fine-tuning an orchestrator — encoding 17 experiments into one model

After 17 kompress models, 8 teachers, 4 architectures, and $38.95 in GPU compute, we had a problem. The experiment knowledge was trapped in blog posts, state files, and our heads. Every new researcher joining the loop would need to re-learn everything we discovered.

So we fine-tuned a model to encode it all.


The Orchestrator

kompress-superpower-orchestrator is a LoRA fine-tune of Qwen2.5-7B-Instruct on 117 conversation pairs. Each pair encodes a decision, a diagnosis, or a planning session from the 17-model experiment. The model has six superpowers:

Superpower What it does Example
Experiment Designer Hypothesis → training plan "Try GLM-5.2 as teacher" → spawns script, estimates cost
Failure Diagnostician Metrics → root cause heretic=0.878 → "v15 dilution pattern: 983 pairs drown signal"
Council Engine Results → SHIP/RETRAIN/PIVOT heretic=0.963 → "RETRAIN: +0.008 heretic but -11.3% compression"
Sub-Agent Spawner Delegates to train/eval/label agents spawn_train("v18", config) → vast.ai instance
Budget Tracker Knows costs "$0.13/run on RTX 4090, $0.15/version total"
State Keeper Remembers all 17 versions "v8=production, 11 dead ends, Pareto at λ=3/5/10"

Training Data

We built 117 conversation pairs from the experiment history. Each pair is a multi-turn conversation where the orchestrator uses tools:

{
  "messages": [
    {"role": "system", "content": "You are kompress-superpower-orchestrator..."},
    {"role": "user", "content": "My model regressed to 0.878. Data: 983 pairs. Why?"},
    {"role": "assistant", "content": "Checking experiment history...", "tool_calls": [
      {"name": "check_status", "arguments": {}}
    ]},
    {"role": "tool", "content": "{\"versions\": 17, \"v15\": \"0.878 — data dilution\"}"},
    {"role": "assistant", "content": "v15 pattern: 983 pairs diluted the C3 signal. 61% generic data drowns teacher labels. Fix: reduce to 300 pairs at 33% C3 ratio. Want me to spawn that?"}
  ]
}

The data covers five categories:

Category Pairs What it teaches
Diagnosis 30 Regression → root cause → fix (v9 overfit, v11 collapse, v15 dilution)
Planning 40 Q&A about teachers, architecture, data ratios, loss weights
Council 20 Metrics → SHIP/RETRAIN/PIVOT with reasoning
Spawn 15 Tool calls: train, eval, label, deploy
Multi-turn 12 Debugging conversations with follow-up questions

Fine-Tuning

We trained on vast.ai's cheapest RTX 4090 ($0.30/hr):

Parameter Value
Base model Qwen/Qwen2.5-7B-Instruct
Method LoRA (r=16, alpha=32)
Quantization 4-bit NF4 (BitsAndBytes)
Trainable params 40M / 7.6B (0.53%)
Epochs 3
Batch size 1 (gradient accumulation ×16)
Learning rate 2e-4
NEFTune noise α=5.0
Total cost ~$0.30

We tried DoRA (Weight-Decomposed LoRA) but it OOM'd on 24GB — needs an A100. NEFTune (noisy embeddings) fit easily and improves chat quality by adding controlled noise during training.

The loss curve was noisy (padding tokens in the causal LM loss) but the model learned the patterns. Four checkpoints saved (steps 8, 16, 24, 32).


Results

The model is live at PeetPedro/kompress-superpower-orchestrator.

Usage:

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

base = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-7B-Instruct")
model = PeftModel.from_pretrained(base, "PeetPedro/kompress-superpower-orchestrator")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-7B-Instruct")

messages = [
    {"role": "system", "content": "You are kompress-superpower-orchestrator..."},
    {"role": "user", "content": "My model regressed to 0.878. What happened?"}
]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt")
outputs = model.generate(inputs, max_new_tokens=200)
print(tokenizer.decode(outputs[0]))

What We Learned

1. 117 pairs is enough for domain-specific fine-tuning. You don't need 10K examples. 117 high-quality, domain-dense pairs encoding every decision from a real experiment is sufficient for LoRA.

2. DoRA doesn't fit on 24GB for 7B models. Weight-decomposed LoRA uses ~30% more memory. Stick with regular LoRA for consumer GPUs.

3. NEFTune adds value for zero memory cost. Noisy embeddings during training improve chat naturalness. It's a one-line addition: TrainerCallback that adds Gaussian noise.

4. Causal LM loss with padding tokens is noisy but functional. The loss values were large (82M → 0) with NaN gradients, suggesting the padding mask wasn't properly applied. The model still learned — next iteration should use labels = [-100 if pad else id for id in input_ids].

5. Function-calling data is scarce. Most open-source fine-tuning datasets lack tool-call examples. Our 15 spawn pairs are a small but valuable contribution to the ecosystem.


What's Next


Links


"The loop doesn't just produce models. It produces understanding."

93303e3c0efc7ad254ab4c44d2f91bb8