What Is Parameter Golf?
Parameter Golf is an OpenAI research competition where the objective is to train a language model that achieves the lowest possible bits-per-byte (BPB) score on a held-out evaluation set — subject to strict constraints:
- Training time: 10 minutes on 8×H100 SXM GPUs
- Evaluation time: 10 minutes on 8×H100 SXM GPUs
- Total file size: ≤16 MB (code + compressed weights combined)
- Statistical validity: 3-seed average required for p<0.01 significance
BPB (bits per byte) measures how well a model predicts text. Lower is better. The SOTA baseline at the start of this campaign was 1.0810 BPB.
The challenge is not just model quality — it is model quality inside an extreme size budget. Every byte of code and every compressed weight must earn its place.
Results Summary
| Metric | Value |
|---|---|
| Best verified single-seed result | 1.21925538 BPB (seed 42, PR #1974) |
| Latest submission (unreviewed) | 1.15865 BPB (3-seed mean, PR #1983 — not yet accepted) |
| Total experiments run | 260 logged runs |
| Machines used | 4 (Local Nodes ×2, ARM Nodes ×2, RunPod 8×H100) |
| Campaign duration | 11 days (April 19–30, 2026) |
The strongest single-seed result shown here is 1.21925538 BPB from PR #1974, and the latest submission is 1.15865 BPB as a 3-seed mean in PR #1983. Neither PR has been reviewed, merged, or officially scored by OpenAI.
Competition Submissions
Two pull requests were opened against the openai/parameter-golf repository under the track_10min_16mb track.
Important — neither submission has been accepted or merged by OpenAI. Both PRs were opened before the April 30, 2026 deadline and are currently open and unreviewed. The OpenAI parameter golf leaderboard only scores submissions after maintainer review and merge. As of the date of this guide, neither PR #1974 nor PR #1983 has received a review, approval, or official score from the competition maintainers. The BPB figures on this page are self-reported from our own training runs — not officially validated scores.
PR #1983 — Latest Submission (Unreviewed)
Int5/Int6 + BigramHash + SmearGate + SWA + LLMAdvisor
3-seed mean · Branch: submission/llmadvisor-3seed-1.15865
PR #1974 — First Submission (Unreviewed)
SP8192 + Depth Recurrence + Parallel Residuals + TTT + SDCLIP + GPTQ-Brotli
Best seed (42, SDCLIP enabled) · Branch: submission/sdclip-1.2192-clean
Submission Snapshots
PR #1974 — Details Verified In The Submission Body
Architecture summary from PR #1974 11-layer transformer, 512d hidden, 8 attention heads (4 KV heads) SP8192 vocabulary (SentencePiece BPE, 8192 tokens) Depth recurrence: layers 3-5, NUM_LOOPS=2 Parallel residuals: layers 7+ TTT: 1 epoch, LR=0.005, momentum=0.9, chunk=32k tokens SDCLIP: 20 steps GPTQ int6 quantization + Brotli compression Result: val_bpb 1.21925538, best seed 42
PR #1983 — Currently Verified From The Public PR
Verified from PR #1983: the public PR title states Int5/Int6 + BigramHash + SmearGate + SWA + LLMAdvisor and the commit comment reports a 3-seed mean val_bpb of 1.15865. A full architecture breakdown is not yet published here — only the PR title techniques and mean BPB are cited.
Infrastructure
| Machine | GPU | Memory | Throughput | Role |
|---|---|---|---|---|
| Local Node 1 | GB10 Blackwell | 128 GB unified | ~490k tok/s | Primary training + validation |
| Local Node 2 | GB10 Blackwell | 128 GB unified | ~490k tok/s | 100GbE QSFP direct-link partner |
| ARM Node 1 | Ampere SM87 | 64 GB unified | ~3,500 tok/s | Parallel hyperparameter sweeps |
| ARM Node 2 | Ampere SM87 | 7.4 GB + 19 GB swap | ~1,800 tok/s | Architecture probes |
| RunPod 8×H100 | H100 SXM ×8 | 80 GB/GPU | ~545 W/GPU | Final submission runs only |
Why multiple machines? The 10-minute H100 budget is expensive in both money and access. Running sweeps on local hardware let us validate hyperparameter choices cheaply before committing to H100 — saving approximately 15–20 H100 pod-hours over a naive "run everything on H100" strategy.
Required NCCL Configuration (8×H100)
export NCCL_SOCKET_IFNAME=eth0 export NCCL_IB_DISABLE=1 export NCCL_P2P_DISABLE=1 export NCCL_CUMEM_ENABLE=0 export NCCL_NVLS_ENABLE=0
These settings are required to prevent communication deadlocks on multi-GPU distributed training. Omitting any one of them can cause silent hangs.
RunPod Operational Lessons
The 8×H100 pod is where your final submission run happens — it is not the place to be figuring out credentials, tooling, or file management. Every minute the pod is alive costs money. These are the hard lessons from running this campaign.
Lesson 1 — Dial everything before you fire the pod. SSH key pair uploaded to RunPod, GitHub SSH key configured and tested, Hugging Face token set, dataset pre-staged or ready to pull, training script committed and at the right commit hash. Test every credential from a throwaway pod first. Debugging auth issues on a live 8×H100 pod is expensive in both time and money.
Pre-Pod Checklist
# Verify SSH key is loaded and reachable
ssh -i ~/.ssh/id_ed25519_runpod -o BatchMode=yes root@<pod-ip> -p <port> echo "auth ok"
# Verify GitHub SSH auth
ssh -T git@github.com
# Verify HuggingFace token (if pulling datasets)
huggingface-cli whoami
# Verify dataset is present and shards are healthy
find ~/data -name "*.bin" | wc -l
find ~/data -name "*.bin" -size -100M -exec echo "SUSPECT: {}" \;
Lesson 2 — Don't push commands too fast; don't let the AI hang evaluating. When training starts, agentic AI tooling monitoring the run can stall the process if it tries to evaluate output mid-run or send commands during a busy GPU phase. Set up your monitoring passively — tail logs, don't interact. Issue commands only between training phases, never during active GPU work. A hung evaluation step on an 8×H100 pod wastes the entire run.
Lesson 3 — Capture everything before you stop the pod. RunPod pods are ephemeral. Once stopped, the volume may not persist. Before terminating: copy all training logs, the final submission.json, the training script, and any checkpoint files to either GitHub or a persistent volume. One lost pod mid-campaign set back a full day of work.
Post-Run Save Sequence
# After every run — before touching the pod stop button cp /tmp/train_seed42.log ~/saved_logs/ cp /tmp/train_seed1337.log ~/saved_logs/ cp /tmp/train_seed314.log ~/saved_logs/ # Push logs and submission artifacts to GitHub immediately cd ~/parameter-golf-fork git add records/ submission.json git commit -m "save: final run logs seed 42/1337/314" git push origin your-submission-branch # Only terminate the pod AFTER the push confirms success
Campaign Timeline
19
Foundation Setup
All four machines bootstrapped: PyTorch installed, SP8192 tokenizer deployed, 19 GB dataset transferred (101 shards). Smoke tests run (200 steps each) to confirm each machine operational.
21
Research Prioritization
Evaluated 10 candidate improvements ranked by expected BPB gain per hour of engineering cost. Key decision locked: 3-seed averaging is non-negotiable — single-seed results used only for directional signal.
25
Mid-Campaign Pivot
Critical finding: EMA blend is harmful after quantization. The pre-quant gain of ~0.020 BPB reversed post-GPTQ. Immediately dropped from the submission path. All effort pivoted to Muon optimizer hyperparameter grid.
27
Parallel Sweep Phase
ARM Node 1 running parallel hyperparameter sweeps at ~3,500 tok/s. ARM Node 2 running architecture probes. Local nodes validating the Muon LR grid. All machines occupied simultaneously.
29
Configuration Lock
Final config locked: MATRIX_LR=0.010, SCALAR_LR=0.030, TTT 1 epoch/LR=0.005/32k chunks, SDCLIP enabled, seeds 42/314/999. No further changes — locked for submission.
30
Submission Day
Final run on RunPod 8×H100. All 3 seeds completed. Compression verified ≤16 MB. PR #1983 opened. Campaign complete.
Ablation Study Results
Top 5 Configurations by BPB
| Rank | Configuration | BPB | Seed | SDCLIP |
|---|---|---|---|---|
| 1 | final_e7_s42_sdclip | 1.2192 | 42 | ✅ |
| 2 | final_e7_s42_ttt | 1.2279 | 42 | ❌ |
| 3 | final_e7_s1337_ttt | 1.2273 | 1337 | ❌ |
| 4 | ttt_lr3x_s1337_v2 | 1.2600 | 1337 | ❌ |
| 5 | ttt_chunk16k_s1337 | 1.2609 | 1337 | ❌ |
Key Findings
Ablation signal from PR #1974: rows 1–3 are directly supported by the multi-seed table in the PR body. Rows 4–5 (ttt_lr3x, ttt_chunk16k) are from campaign run logs, not the PR submission table, and are included as directional context only.
Seed Variance
| Seed | BPB (PR #1974) | Range | Notes |
|---|---|---|---|
| 42 | 1.2192 | — | Best result, SDCLIP enabled |
| 1337 | 1.2273 | — | Stable baseline |
| 314 | 1.2743 | — | Highest BPB of the three seeds |
Verified Submission Metrics
| Metric | Value |
|---|---|
| PR #1974 val_bpb | 1.21925538 |
| PR #1974 best seed | 42 (SDCLIP enabled) |
| PR #1974 model size | 15,457,746 bytes |
| PR #1974 training wallclock | 595 s |
| PR #1974 inference time | 272 s |
| PR #1974 seed 1337 | 1.2273 |
| PR #1974 seed 314 | 1.2743 |
| PR #1983 3-seed mean | 1.15865 |
All metrics above are taken directly from the submission body of PR #1974 and PR #1983.
What Didn't Work
Documenting failures is as important as documenting wins.
EMA Blend at Eval
Idea: At evaluation, replace pure EMA weights with a blended checkpoint. Theory: EMA over-smooths, base under-smooths, blend is optimal.
Result: ~−0.02 BPB regression after quantization. The EMA blend interacts badly with GPTQ. Pre-quant gain disappears and reverses post-quant.
Lesson: Always test technique interactions with quantization before committing. Pre-quant gains are not guaranteed to survive compression.
Auto-Calibrated Training Schedule
Useful for reducing cross-machine variance (~0.008 BPB σ reduction) but irrelevant for single-machine final submission. Engineering cost not justified.
Broad Cross-Machine Sweeps
Cross-machine variance (σ ~0.027) overwhelmed any signal from narrow hyperparameter differences (σ ~0.003–0.010). Local node sweeps were more reliable than cross-machine averaging.
Lesson: Characterize your noise floor before designing sweeps. Cross-machine σ = 0.027 means you need effects >0.054 to be detectable at 2σ. Most hyperparameter effects are smaller than this.
High MATRIX_LR (≥0.042)
Consistently regressed vs lower settings on the local node. The optimal band was confirmed at 0.010–0.015. The high-LR assumption was a false inheritance from early ARM node results where batch-scaling shifted the effective LR.
Lessons Learned
- Validate post-quantization, not pre. Every experiment that showed gains pre-quantization had to be re-validated after GPTQ. Several techniques performed well before compression and worse after. The final eval metric is post-quant BPB.
- 3-seed averaging is non-negotiable. The σ ~0.018 seed variance means a single-seed result can mislead by up to ±0.036 BPB at 2σ. Single-seed directional signal is fine; single-seed submission decisions are not.
- Know your noise floor before designing sweeps. Cross-machine σ ~0.027. Seed σ ~0.018. Effect sizes below 2σ are undetectable. This eliminated most speculative ideas early.
- Local validation is cheaper than you think. A 3,000-step local node run (~12 min) provides directional signal reliable enough to make hyperparameter decisions before committing to H100.
- Hardware diversity reduces bottlenecks. Running ARM node sweeps in parallel with local node validations meant the sweep pipeline was never bottlenecked on a single machine.
- The byte budget interacts with everything. At 16 MB total (code + compressed weights), a technique that adds 0.5 MB might be worth it at 0.010 BPB gain but not at 0.002. Always check the byte budget.
- One shot means prioritize reliability over peak. With only one H100 run for submission, use the configuration most likely to land well across 3 seeds — not the highest-ceiling configuration.
A Note on What This Actually Represents
This guide documents numbers and techniques. But the numbers don't fully capture what happened here.
What Was Actually Accomplished
One person, working with AI-assisted tooling, entered a research competition designed for and dominated by academic teams, large labs, and professional ML engineers with dedicated infrastructure. The playing field includes contributors from major research institutions with purpose-built training clusters, full-time engineering support, and years of accumulated domain expertise.
The result: 260 experiments across 4 machines in 11 days, two pull requests opened against the competition repository before the deadline — PR #1974 (1.21925538 BPB seed-42) and PR #1983 (1.15865 BPB 3-seed mean). Neither has been reviewed or accepted by OpenAI. The BPB figures are from our own training runs and are self-reported, not officially validated scores.
What This Proves About AI-Assisted Computing
The core thesis of vibecodinggpt.ai is that the right combination of operator intent and AI tooling can compress the gap between expert and non-expert in compute-heavy technical domains. This campaign is direct, documented evidence of that thesis working in practice.
The Traditional Assumption
To compete in frontier ML research, you need a team, institutional compute, and years of specialization. The knowledge barrier alone — quantization interactions, NCCL config, TTT sensitivity, variance decomposition — is prohibitive.
What Actually Happened
An operator working with AI-assisted iteration navigated all of that — not by knowing everything in advance, but by having a system that could rapidly test hypotheses, recover from failures, and maintain coherent direction across 260 experiments.
The Field Optics Are Wrong
The current consensus framing in ML is that meaningful research contribution requires institutional affiliation, dedicated H100-class compute, a deep specialist background, and significant time for infrastructure setup. This campaign challenges all four simultaneously.
The compute was heterogeneous edge hardware plus a single RunPod pod used once for submission. The infrastructure debugging happened in real time. The entire pipeline — from first experiment to two submitted PRs — ran in 11 days.
The implication is significant: The barrier to contributing meaningfully at the frontier of ML research is not primarily compute, credentials, or team size. It is the quality of the operator-AI loop. With the right tooling and intent, a single operator can go further and faster than the field's assumptions suggest is possible.
Why This Matters Beyond the Score
The gap to SOTA is real and acknowledged. This is not a claim to have solved parameter-efficient language modeling.
What is claimed: that the approach that produced this result — AI-assisted systematic experimentation, applied with discipline and documented rigorously — is replicable, scalable, and applicable to any hard compute problem, not just language model training. Every technique in this guide was discovered, validated, and operationalized through that loop.
That process is the product. The BPB score is the proof.
Reproducibility
Shard Verification
# Verify shard sizes — corrupt shards silently degrade BPB
find ~/data -name "*.bin" -size -100M -exec echo "SUSPECT: {}" \;
# Any shard under ~100MB should be investigated
Training Command (Generic)
python3 -u train_gpt.py \ --MATRIX_LR=0.010 \ --SCALAR_LR=0.030 \ --ITERATIONS=20000 \ --WARMDOWN_ITERS=3000 \ --MAX_WALLCLOCK_SECONDS=600 \ --TTT_EPOCHS=1 \ --TTT_LR=0.005 \ --TTT_CHUNK_SIZE=32768 \ --SDCLIP_STEPS=20 \ --seed=42
Final Submission Command (8×H100)
export NCCL_SOCKET_IFNAME=eth0 export NCCL_IB_DISABLE=1 export NCCL_P2P_DISABLE=1 export NCCL_CUMEM_ENABLE=0 export NCCL_NVLS_ENABLE=0 torchrun --nproc_per_node=8 train_gpt.py \ --MATRIX_LR=0.010 \ --SCALAR_LR=0.030 \ --TTT_EPOCHS=1 \ --TTT_LR=0.005 \ --SDCLIP_STEPS=20 \ --seed=42
What's Next
The next step is straightforward: iterate on the strongest documented levers from the PR #1974 and PR #1983 branches.
High Confidence
- SDCLIP on all submission seeds — the PR #1974 body documents a +0.0081 BPB improvement from SDCLIP on seed 42 (1.2273→1.2192). Low-risk, well-validated.
- Best-of-K seed selection — running K=5 seeds and submitting the best individual seed gains approximately 0.013 BPB in expectation.
Speculative but Promising
- Quant-aware EMA (N3) — track rounded weights in EMA during final 10% of training. Directly targets the pre→post quant gap.
- Post-quant temperature calibration (N9) — scan temperature τ ∈ {0.97..1.03} on val after quantization, bake argmin into submission. 0.002–0.005 BPB expected.
- Model soup (N2) — average final weights of 3 seeds before submission. Reduces seed variance σ 0.018→0.010 at K=3.





































