I Built a Tool That Optimizes Code While You Sleep

March 24, 2026

A few weeks ago, I watched a Karpathy talk where he described running an agentic loop to auto-tune LLM fine-tuning pipelines. The core idea was simple: give the agent a goal, a way to measure progress, and let it iterate autonomously until it gets there.

I couldn't stop thinking about it.

Not because of the fine-tuning use case — but because the pattern felt universally useful. Most software has something you want to improve and a way to measure it. Why are we still doing the iteration loop by hand?

So I built Hone.

What Hone Does

Hone is a CLI tool. You give it three things:

A goal, in plain English
A file or directory to optimize
A benchmark command that outputs a number

Then you leave.

Hone runs a loop: it asks an LLM what to try next, applies the changes, runs your benchmark, and decides whether to keep the result or revert it. It logs every iteration — the score, the diff, and the agent's reasoning — and stops when it hits your target or you tell it to.

hone "Optimize process_logs.py to run under 0.02 seconds" 
     --bench "python bench_logs.py" 
     --files "process_logs.py" 
     --optimize lower 
     --target 0.02 
     --budget 2.0

That's the entire interface.

Experiment 1: The Log Parser

The first real test was a deliberately naive Python log parser. The task: analyze 150,000 lines of server logs and return the top 3 most-visited endpoints with unique IP counts.

The baseline code was the kind you'd write in an interview warm-up: readlines() into memory, a list for uniqueness checking (O(n) per insert), a regex match on every line. It took 1.54 seconds.

I set a target of 0.02 seconds — roughly 75x faster — and launched Hone with a $2 budget.

Here's what happened over 20 iterations:

Iter	Score	What the agent did
1–4	0.8s → 0.4s	Replaced list with `set` for O(1) uniqueness, pre-bound `set.add` to skip attribute lookup overhead
5–9	0.4s → 0.15s	Switched from `readlines()` to streaming with `f`, dropped unnecessary string allocations
10–14	0.15s → 0.09s	Compiled regex outside the loop, switched from `re.match` to `re.search` with anchored pattern
15–17	0.09s → 0.07s	Plateaued. Agent recognized it had hit the ceiling of single-threaded Python looping.
18–20	0.07s → 0.037s	Changed the rules entirely. Abandoned line-by-line parsing. Read the file as a raw binary blob. Deployed `re.findall()` over the entire content in one pass.

The final move was the interesting one. The agent didn't just tune the existing approach — it recognized the approach itself was the bottleneck and replaced it. That pivot happened at iteration 18, after the agent wrote in its reasoning:

“The real bottleneck is the Python loop and split() calls. Try using a compiled regex to extract the endpoint in one operation across the entire file.”

Final result: 1.54s → 0.037s. A 41x speedup. Autonomously.

It didn't hit the 0.02 target — that's likely beyond what single-threaded Python can do on this task without going to C extensions. But a 41x improvement for $1.84 in API costs is a real result.

Experiment 2: Nearest Driver Dispatch

The second experiment was closer to production code. The problem: given a set of riders and a pool of drivers, find the nearest driver for each rider using haversine distance.

The baseline was an O(R × D) brute-force loop — calculate the full haversine distance between every rider and every driver. With 500 riders and 1,000 drivers, that's 500,000 distance calculations per call. Baseline: 2.18 seconds.

Run 1 — I launched Hone with no hints. Just: “optimize this to run faster.”

The agent went straight for spatial indexing. It built a grid over the geographic area, bucketed drivers into cells, and used Manhattan distance pre-filtering to eliminate distant candidates before running haversine. It also replaced the standard math module haversine with a vectorized approximation valid for short distances.

Result: 0.1496 seconds. A 14.6x speedup.

Run 2 — I ran Hone again on the output from Run 1.

This is where it got interesting. The agent looked at the already-optimized code and found something the previous run missed: the grid search still checked every driver in candidate cells, even after it had already found a close one.

The fix: stop searching the moment you find a driver within an acceptable radius. Expand the search radius incrementally — start small, grow outward — instead of checking all candidates at once.

“The algorithm beats the data structure. Grid resolution barely matters. Early termination dominates.”

Result: 0.069 seconds. Another 2.1x on top of an already fast baseline.

Two runs, $3 total, brute-force O(R×D) → smart early-termination spatial search. The agent arrived at an approach that a senior engineer would recognize as correct — not by knowing the algorithm upfront, but by observing what the benchmark rewarded.

What I Learned

The benchmark is everything. Hone is only as good as your measurement. If your benchmark is slow to run, the loop is slow. If it doesn't capture what you actually care about, the agent will optimize the wrong thing. The one thing you must get right before you start is: “does this number actually reflect what I want?”

The agent is a good low-level optimizer. It reliably finds the obvious wins: wrong data structures, redundant computations, missed language primitives. These are also the wins that take a human the most time — not because they're hard to understand, but because you have to actually sit down and do them.

It surprises you at the edges. The log parser pivot from line-by-line to whole-file regex wasn't something I would have thought to suggest in the initial prompt. It emerged from the agent hitting a wall and reasoning about why it had hit a wall. That's the behavior that makes agentic loops interesting.

The conversation thread is the memory. The most important architectural decision in Hone was keeping the LLM conversation alive across iterations. The agent doesn't just see the current score — it sees everything it tried, what worked, and what was reverted. That's what allows the pivot at iteration 18. Without it, the agent would start fresh each time and repeat the same early optimizations.

Cost is low. Time savings are high. Both experiments ran under $4. The engineering time to achieve the same results manually — writing hypotheses, applying changes, running benchmarks, reverting dead ends — would have been hours. The ROI on agentic loops is already real, and we're at the beginning.

What's Next

Hone v0 is rough. There's no sandbox for shell commands, no git-based snapshots, no dry-run mode. These are on the list.

More interesting to me is expanding the use cases. The same loop that optimizes a log parser can optimize:

LLM prompts against an eval suite (highest impact use case)
RAG pipelines against a retrieval benchmark
API costs against a quality-constrained spend target

The pattern is the same. The benchmark changes. Hone doesn't care.

If you want to try it:

git clone https://github.com/laxmena/hone
cd hone && pip install -e .

And if you have a benchmark that Hone should try — I want to hear about it.

#engineering #ai