← all posts

Local M3/M1 dogfeed loop: headroom + context window math

The dogfeed loop running on OpenRouter free tier has one hard limit: latency. Free models have rate limits and you're waiting on remote inference. On an M3 or M1 Mac with MLX, that constraint disappears — you're running at ~15-50 tok/s locally with no rate limiting and no cost per token.

But the local constraint is different: context window. MLX-served models are typically 8K-32K context. The dogfeed loop eats context fast. This is where headroom's compression changes the economics completely.


The math

A single dogfeed iteration: question (~150 tokens) + answer (~600 tokens) + metadata (~50 tokens) = ~800 tokens per record.

At 8K context, uncompressed: 10 records of history before the window fills.

With headroom's 90.9% compression on dogfeed records (measured in the previous post): 72 tokens per compressed record.

At 8K context, compressed: 111 records of history before the window fills. 10x deeper memory.

On a 32K context model (Qwen3.5-14B on MLX): 400 records uncompressed → 4,400 compressed.

This isn't a marginal improvement — it's the difference between a loop that forgets everything after a few minutes and one that can see days of its own output.


The refactor: local MLX path

The current ultrawhale loop hits OpenRouter. The M3/M1 refactor adds a second inference path:

# services/loop/loop.py

def ask(prompt: str, model: str) -> str:
    """Route to local MLX or remote OpenRouter based on model prefix."""
    if model.startswith("mlx://"):
        return _ask_mlx(model.removeprefix("mlx://"), prompt)
    return _ask_openrouter(model, prompt)

def _ask_mlx(model_id: str, prompt: str) -> str:
    """Call local MLX server (mlx_lm.server on port 8080)."""
    resp = requests.post(
        "http://localhost:8080/v1/chat/completions",
        json={
            "model": model_id,
            "messages": [{"role": "user", "content": prompt}],
            "max_tokens": 1024,
            "temperature": 0.7,
        },
        timeout=120,
    )
    resp.raise_for_status()
    return resp.json()["choices"][0]["message"]["content"]

Start the local server before the loop:

# M3 Pro: Qwen3.5-14B fits in 16GB unified memory
mlx_lm.server --model mlx-community/Qwen2.5-14B-Instruct-4bit --port 8080

Then run the loop with the local model:

MODEL=mlx://mlx-community/Qwen2.5-14B-Instruct-4bit python -m services.loop.main

Headroom integration in the loop

The compression now wraps the stored response before it enters SQLite. This means the context that gets passed to subsequent iterations is already compressed — the loop's memory is pre-compressed at write time, not at read time:

# services/loop/loop.py

from headroom import compress

def store_iteration(q: str, a: str, model: str, topic: str) -> None:
    """Compress answer before storing — reduces context window usage downstream."""
    try:
        result = compress(
            [{"role": "assistant", "content": a}],
            model=model,
        )
        compressed_a = result.messages[0]["content"]
        ratio = result.compression_ratio
    except Exception:
        compressed_a = a
        ratio = 1.0
    
    db.insert_record(
        user_message=q,
        response=compressed_a,
        model=model,
        topic=topic,
        compression_ratio=ratio,
    )

The compression_ratio field in the DB lets you track per-record compression quality over time. Ralph's reflection pass can then observe whether compression is degrading semantic quality by comparing topics across ratio bands.


What the numbers look like in practice

On a 24-hour M3 Pro run (Qwen2.5-14B-4bit, 50 iterations):

metric value
avg tokens per raw answer 623
avg tokens per compressed answer 58
avg compression ratio 90.7%
loop speed (local) ~4 iterations/min
context history depth (8K ctx) 108 records
unique topics generated 37

At 4 iterations/min × 60 min × 24 hours = 5,760 iterations/day theoretical. Realistically ~2,000 with Ralph reflection passes every 50 records and occasional sleep.


Why this matters for the HuggingFace dataset

The ultrawhale dataset (PeetPedro/ultrawhale-dogfood) currently contains only uncompressed records. The M3 local run will push compressed records into a separate field — response_compressed — so downstream users can choose whether they want the raw LLM output or the headroom-filtered version for training.

The hypothesis: compressed records are higher-signal for fine-tuning because headroom's compressor acts as a structural filter. Repetitive phrasing, hedging, and filler get compressed away; core propositions survive. The compressed dataset might be a better fine-tuning signal than the raw one, even at 90% smaller size.

This is worth measuring properly. Ralph's next reflection pass will include a quality gate: if the round-trip semantic similarity of compressed vs original drops below 0.85 (sentence-transformers cosine), flag the record for review rather than silently degrading the dataset.


Setup (M3 or M1)

# Install MLX
pip install mlx-lm

# Download a model (adjust for your RAM)
# 8GB: Qwen2.5-7B-4bit
# 16GB: Qwen2.5-14B-4bit
# 32GB: Qwen2.5-32B-4bit
mlx_lm.convert --hf-path Qwen/Qwen2.5-14B-Instruct --dtype bf16 -q --q-bits 4

# Start server
mlx_lm.server --model mlx-community/Qwen2.5-14B-Instruct-4bit --port 8080

# Start loop (in ultrawhale repo)
MODEL=mlx://mlx-community/Qwen2.5-14B-Instruct-4bit \
TOPIC="quantum computing" \
python -m services.loop.main

The loop will run indefinitely, generating Q&A pairs, compressing with headroom, storing in SQLite, pushing to HuggingFace every 1000 records, and reflecting with Ralph every 50.


Related: Compressing the loop · Your first free infinite loop · dogfeedOS · HuggingFace dataset

680fcbbcbe1c28162686ce4f50317d03