engineering — laxmena

Hone vs. The 1 Billion Row Challenge

Wed, 25 Mar 2026 04:06:42 +0000

1,000,000,000 rows of data. No hand-tuning. Just an agent, a benchmark, and a budget.

The 1 Billion Row Challenge is simple on paper: read a file with 1B rows of weather station measurements, compute min/mean/max per station, as fast as possible. In Python, a naive solution takes minutes. The best human-optimized ones use memory-mapped files, multiprocessing, and numpy.

I'm not optimizing it by hand. I'm giving it to Hone — and letting it figure it out.

Hone is now on PyPI. Install it with pip install hone-ai.

This is a living document. I'll update it as each run completes. Follow the code at laxmena/hone-1brc.

The Setup

The challenge: Parse a 1B-row file. Each row: Hamburg;12.0. Compute min/mean/max per station. Print results sorted alphabetically.

The metric: Wall-clock runtime in seconds. Lower is better.

The constraints: Python standard library only. No numpy, no pandas, no third-party packages. Correctness must be preserved — output format and values must not change.

The baseline: Simple. Correct. Slow. One thread, one line at a time, float() on every value.

Results at a Glance

Run	Model	Dataset	Baseline	Optimized	Improvement
1	Haiku	1M rows	0.546s	0.471s	13.7%
2	Haiku	100M rows	47.197s	42.739s	9.4%
3	Sonnet	100M rows	48.104s	10.110s	79%
4	Sonnet (100M solution, no re-run)	1B rows	487.525s	130.080s	73.3%
5	Sonnet	1B rows	487.525s	90.929s	81.4%

Episode 1: Haiku, 1M rows — 13.7% faster

March 25, 2026

0.546s → 0.471s

First run: claude-haiku-4-5, 1M rows, $5 budget, 50 max iterations.

The 13.7% gain looks decent on paper. It isn't. The absolute numbers are tiny — we're talking 75 milliseconds. At this scale, Python startup time and OS disk caching dominate. The agent is optimizing noise, not the algorithm. Haiku made incremental tweaks but never found a structural breakthrough.

Wrong dataset size. Move on.

Hone v1.2.0: `--goal-file`

March 25, 2026

Episode 1 exposed a friction point. Pasting a long goal string into the terminal every run is error-prone and hard to version. For complex, multi-constraint goals it breaks down fast.

I added --goal-file to Hone — pass a path to a plain text file, Hone reads the goal from there. Same idea as Karpathy's program.md in autoresearch. The goal now lives alongside the code, versioned in git.

Live in v1.2.0. pip install --upgrade hone-ai.

Episode 2: Haiku, 100M rows — 9.4% faster

March 25, 2026

47.197s → 42.739s

10x harder dataset. Now I/O pressure actually matters — 4.5 seconds saved is a real signal.

But Haiku still couldn't find the structural moves. It made safe, local edits — better buffering, minor parsing cleanup — and never stepped back to reconsider the architecture. No parallelism. No mmap. No integer parsing. It hit its ceiling.

Episode 3: Sonnet, 100M rows — 79% faster

March 25, 2026

48.104s → 10.110s

Same benchmark. Same constraints. One change: claude-haiku-4-5 → claude-sonnet-4-6.

38 seconds saved. The agent didn't tune the baseline — it replaced it.

What Sonnet actually did

1. Text → Binary reads with mmap

The baseline opens the file in text mode and reads line by line. Sonnet switched to binary mode with memory-mapped I/O — the OS maps the file directly into memory, eliminating repeated read syscalls.

2. float() → integer arithmetic

Every float() call in the baseline is expensive. Sonnet eliminated them entirely. Temperatures are stored as integers ×10 — 12.3 becomes 123. The decimal point is skipped by knowing its fixed position in the byte string. Division back to float happens only once, at output time. It also pre-built a lookup table for all valid temperature values (-99.9 to 99.9) to skip even manual parsing on the common case.

3. Multiprocessing across all CPU cores

The baseline is single-threaded. Sonnet split the file into cpu_count() × 8 chunks, aligned each boundary to the next newline to avoid splitting rows, and ran each chunk in a separate process. Results merged at the end.

4. strip() + index() → partition()

The baseline does line.strip() then line.index(";") — two passes. Sonnet used line.partition(b';') — one pass, station and temperature in a single call.

Why Haiku couldn't find this

Haiku made safe, local edits. It never stepped back to reconsider the architecture. Sonnet saw the whole picture: the bottleneck isn't any single line, it's the approach. Single-threaded text parsing doesn't scale. The winning move was to throw it out and start from a parallel, binary-aware design.

Q: Does model choice matter more than iteration count?

Episode 4: Sonnet's 100M solution, dropped on 1B rows — 73.3% faster

April 7, 2026

487.525s → 130.080s

Before spending more API budget, I wanted to answer a simpler question first: does the architecture Sonnet found at 100M rows even generalize? No new Hone run. No new cost. Just the existing solution, run unchanged against the full 1B dataset.

357 seconds saved. The answer is yes — it generalizes. mmap, multiprocessing, and integer arithmetic aren't tricks tuned to a particular file size. They're structural. The solution held up.

But 130 seconds also exposed the ceiling. Optimizing against a 10x smaller proxy leaves performance on the table. The solution was good — just not good enough. Time to run Hone against the real target.

Source code as Gist here

Episode 5: Sonnet, 1B rows directly — 81.4% faster

April 7, 2026

487.525s → 90.929s

Same model, same constraints. This time Hone optimized against the full 1B row dataset from the start.

396 seconds saved. Under 91 seconds for a billion rows of Python.

The gains from Episode 4 weren't wasted — they were the floor. Hone started from a strong architecture and pushed further. 81.4% beats the 79% from Episode 3. More data, better result. The solution isn't fragile.

Source code as Gist here

The lesson: optimize against the real target. A proxy dataset is useful for iteration speed, but the final run needs to face the actual problem.

What's Next

Under 91 seconds on 1B rows. The question now is how much headroom is left — and whether Hone can find it without numpy or third-party packages.

Updates appear here as experiments run. Subscribe below or follow via RSS.

#engineering #hone #ai

I Built a Tool That Optimizes Code While You Sleep

Tue, 24 Mar 2026 03:06:12 +0000

A few weeks ago, I watched a Karpathy talk where he described running an agentic loop to auto-tune LLM fine-tuning pipelines. The core idea was simple: give the agent a goal, a way to measure progress, and let it iterate autonomously until it gets there.

I couldn't stop thinking about it.

Not because of the fine-tuning use case — but because the pattern felt universally useful. Most software has something you want to improve and a way to measure it. Why are we still doing the iteration loop by hand?

So I built Hone.

What Hone Does

Hone is a CLI tool. You give it three things:

A goal, in plain English
A file or directory to optimize
A benchmark command that outputs a number

Then you leave.

Hone runs a loop: it asks an LLM what to try next, applies the changes, runs your benchmark, and decides whether to keep the result or revert it. It logs every iteration — the score, the diff, and the agent's reasoning — and stops when it hits your target or you tell it to.

hone "Optimize process_logs.py to run under 0.02 seconds" 
     --bench "python bench_logs.py" 
     --files "process_logs.py" 
     --optimize lower 
     --target 0.02 
     --budget 2.0

That's the entire interface.

Experiment 1: The Log Parser

The first real test was a deliberately naive Python log parser. The task: analyze 150,000 lines of server logs and return the top 3 most-visited endpoints with unique IP counts.

The baseline code was the kind you'd write in an interview warm-up: readlines() into memory, a list for uniqueness checking (O(n) per insert), a regex match on every line. It took 1.54 seconds.

I set a target of 0.02 seconds — roughly 75x faster — and launched Hone with a $2 budget.

Here's what happened over 20 iterations:

Iter	Score	What the agent did
1–4	0.8s → 0.4s	Replaced list with `set` for O(1) uniqueness, pre-bound `set.add` to skip attribute lookup overhead
5–9	0.4s → 0.15s	Switched from `readlines()` to streaming with `f`, dropped unnecessary string allocations
10–14	0.15s → 0.09s	Compiled regex outside the loop, switched from `re.match` to `re.search` with anchored pattern
15–17	0.09s → 0.07s	Plateaued. Agent recognized it had hit the ceiling of single-threaded Python looping.
18–20	0.07s → 0.037s	Changed the rules entirely. Abandoned line-by-line parsing. Read the file as a raw binary blob. Deployed `re.findall()` over the entire content in one pass.

The final move was the interesting one. The agent didn't just tune the existing approach — it recognized the approach itself was the bottleneck and replaced it. That pivot happened at iteration 18, after the agent wrote in its reasoning:

“The real bottleneck is the Python loop and split() calls. Try using a compiled regex to extract the endpoint in one operation across the entire file.”

Final result: 1.54s → 0.037s. A 41x speedup. Autonomously.

It didn't hit the 0.02 target — that's likely beyond what single-threaded Python can do on this task without going to C extensions. But a 41x improvement for $1.84 in API costs is a real result.

Experiment 2: Nearest Driver Dispatch

The second experiment was closer to production code. The problem: given a set of riders and a pool of drivers, find the nearest driver for each rider using haversine distance.

The baseline was an O(R × D) brute-force loop — calculate the full haversine distance between every rider and every driver. With 500 riders and 1,000 drivers, that's 500,000 distance calculations per call. Baseline: 2.18 seconds.

Run 1 — I launched Hone with no hints. Just: “optimize this to run faster.”

The agent went straight for spatial indexing. It built a grid over the geographic area, bucketed drivers into cells, and used Manhattan distance pre-filtering to eliminate distant candidates before running haversine. It also replaced the standard math module haversine with a vectorized approximation valid for short distances.

Result: 0.1496 seconds. A 14.6x speedup.

Run 2 — I ran Hone again on the output from Run 1.

This is where it got interesting. The agent looked at the already-optimized code and found something the previous run missed: the grid search still checked every driver in candidate cells, even after it had already found a close one.

The fix: stop searching the moment you find a driver within an acceptable radius. Expand the search radius incrementally — start small, grow outward — instead of checking all candidates at once.

“The algorithm beats the data structure. Grid resolution barely matters. Early termination dominates.”

Result: 0.069 seconds. Another 2.1x on top of an already fast baseline.

Two runs, $3 total, brute-force O(R×D) → smart early-termination spatial search. The agent arrived at an approach that a senior engineer would recognize as correct — not by knowing the algorithm upfront, but by observing what the benchmark rewarded.

What I Learned

The benchmark is everything. Hone is only as good as your measurement. If your benchmark is slow to run, the loop is slow. If it doesn't capture what you actually care about, the agent will optimize the wrong thing. The one thing you must get right before you start is: “does this number actually reflect what I want?”

The agent is a good low-level optimizer. It reliably finds the obvious wins: wrong data structures, redundant computations, missed language primitives. These are also the wins that take a human the most time — not because they're hard to understand, but because you have to actually sit down and do them.

It surprises you at the edges. The log parser pivot from line-by-line to whole-file regex wasn't something I would have thought to suggest in the initial prompt. It emerged from the agent hitting a wall and reasoning about why it had hit a wall. That's the behavior that makes agentic loops interesting.

The conversation thread is the memory. The most important architectural decision in Hone was keeping the LLM conversation alive across iterations. The agent doesn't just see the current score — it sees everything it tried, what worked, and what was reverted. That's what allows the pivot at iteration 18. Without it, the agent would start fresh each time and repeat the same early optimizations.

Cost is low. Time savings are high. Both experiments ran under $4. The engineering time to achieve the same results manually — writing hypotheses, applying changes, running benchmarks, reverting dead ends — would have been hours. The ROI on agentic loops is already real, and we're at the beginning.

What's Next

Hone v0 is rough. There's no sandbox for shell commands, no git-based snapshots, no dry-run mode. These are on the list.

More interesting to me is expanding the use cases. The same loop that optimizes a log parser can optimize:

LLM prompts against an eval suite (highest impact use case)
RAG pipelines against a retrieval benchmark
API costs against a quality-constrained spend target

The pattern is the same. The benchmark changes. Hone doesn't care.

If you want to try it:

git clone https://github.com/laxmena/hone
cd hone && pip install -e .

And if you have a benchmark that Hone should try — I want to hear about it.

#engineering #ai

RiskChain: The Messy Middle: Building a Risk Graph from Scratch

Sat, 24 Jan 2026 04:11:23 +0000

An ongoing weekend project documenting the journey of uncovering hidden connections in corporate financial filings—the stumbles, the learnings, the 'aha!' moments, and everything in between. Started January 2025.

What is RiskChain?

The core idea is simple but ambitious: find hidden connections and risk trails that aren't immediately obvious when you're just reading through a 10-K filing.

Instead of treating each financial document as an isolated artifact, I'm building a system to:

Extract risk factors from 10-K filings (2004-2025) across 75 companies
Embed and connect these risks to find non-obvious relationships
Build a graph that reveals risk clusters, patterns, and “trails” that could signal systemic weaknesses or early warning signs

Why 10-K filings? Because companies are required to disclose risks in specific sections (Item 1 and Item 1a), and there's a decade+ of structured data just sitting there.

The Vision

Here's the full pipeline I'm building toward:

[Raw Financial Data]
  ├── SEC Filings (10-K/Q) ── News Articles ── Earnings Transcripts ── Other Reports
          │
          ▼
[1. Ingestion & Chunking]
  → Parse documents (PDF/HTML) → Split into sentences → Group into ~500-word chunks
          │
          ▼
[2. Risk Extraction]
  → Use Gemini Flash per chunk → Extract 3-5 specific risk factors + severity
          │
          ▼
[3. Storage & Embeddings]
  → SQLite DB (with sqlite-vec) → Embed risk labels (embedding-gemma-300m) → Deduplicate similar risks
          │
          ▼
[4. Graph Construction]
  → Nodes = unique risks
  → Edges = 
      ├─ Semantic similarity (embeddings)
      └─ Statistical co-occurrence (PMI)
          │
          ▼
[5. Hierarchical Clustering]
  → Apply Leiden algorithm (Surprise function) → Build risk hierarchy tree
  → Compute novelty scores for under-explored areas
          │
          ▼
[6. CLI / Interface Layer]
  → Persistent server for fast queries
  → Commands: search_risks, browse_tree, cross_report_risks, etc.
          │
          ▼
[7. Agent Workflow (Claude / similar)]
  ├── Stage 1: Ideation ── Browse tree → Propose novel risk chains (novelty bias)
  ├── Stage 2: Research ── Dive into chunks → Extract & order excerpts
  └── Stage 3: Output ── Generate RiskChain (visual trail with edges + narrative)
          │
          ▼
[8. Presentation & Action]
  → Web dashboard / exported report
  → Visual graph + highlighted excerpts + suggested hedges / alerts
  → Human review → Iterate via feedback

It's ambitious. It's probably overambitious. But that's the goal.

Current Status

Phase: 2 – Chunking Strategy ✓ Progress: Data downloaded → Chunking complete → Ready for Risk Extraction

Stay Updated

I'm documenting this journey every weekend—the wins, the blockers, the learnings. If you want regular updates on how RiskChain develops, subscribe below to get new posts delivered to your inbox.

Progress Log

Weekend 1 | Jan 18, 2025 | Phase 1: Download Script ✓

What I built: Downloaded 10-K filings for 75 companies from 2004-2025 using the Python edgartools library. Curated a list of significant companies (including ones that went bankrupt in 2008—why not?). Got the script working and only extracting the relevant sections (Item 1, Item 7, Item 8) to keep things lean.

The messy parts (aka real life): I initially tried sec-edgar-downloader to connect to SEC and download. Spent way too much time on this approach, got stuck in the data cleaning rabbit hole, and realized I was losing sight of the actual goal. The real issue? Many of the 10-K filings before the SEC standardized their item categorization didn't play nice with the tool.

Lesson learned: when you're iterating, it's okay to abandon the “perfect” approach for one that ships faster.

Then I switched to edgartools (also known as edgar). This library gave me more flexibility, though the documentation still wasn't intuitive for my specific use case. But instead of giving up, I dug into the source code. That's when things clicked. Sometimes the best learning comes from reading other people's code instead of waiting for docs to explain everything.

The 'aha!' moment:

My wife helped me understand what Item 1, Item 1a, Item 7, and Item 8 actually mean in a 10-K filing. She translated the financial jargon into plain English, and suddenly the document structure made sense. Having someone who can bridge the domain knowledge gap is invaluable. I realized I was building this in a foreign domain—finance is not my native language, and that's okay.

What blocked me:

Figuring out the right tool for downloading (sec-edgar-downloader vs edgartools vs rolling my own)
Understanding that parsing 10-K files is genuinely harder than it looks (inconsistent structures across years, weird formatting, embedded tables)

Next up: Phase 2: Chunking strategy. Need to figure out how to split these documents intelligently for downstream LLM tasks.

Weekend 2 | Jan 23, 2025 | Phase 2: Chunking Strategy ✓

What I built: Implemented chunking using wtpsplitter and stored all chunks as markdown files with YAML frontmatter metadata (ticker, filing date, company name, chunk ID, item section). Now sitting on several thousand chunks, each ~1000 characters max, ready for extraction.

The messy parts (aka real life): I tried two chunking strategies: RecursiveChunker and wtpsplitter. RecursiveChunker felt like brute force—just splitting on token counts. But wtpsplitter was smarter; it respects sentence boundaries and creates more semantically coherent chunks.

Storing these as markdown files locally feels like a step backward (shouldn't I be using a database?), but honestly, it's perfect for iteration. I can inspect the chunks, debug the metadata, and understand what's happening before I add the complexity of a full DB setup.

The 'aha!' moment:

Chunk quality matters way more than I initially thought. The way you split text directly impacts whether an LLM can extract meaningful risk factors later. Sentence-aware chunking beats token-counting brutality. This made me reconsider the whole “let me jump straight to a database” instinct. Sometimes you need to slow down and get the fundamentals right first.

What blocked me:

Deciding between chunking strategies (trial and error on a few approaches)
Understanding the tradeoff between local file storage and “proper” database setup (spoiler: local storage is fine for now)
Realizing I was overthinking this phase when the real value comes next

Next up: Phase 3: Risk Extraction. I'll iterate through each chunk and use Claude/Gemini to extract 3-5 risk factors per chunk. This is where the actual signal starts emerging.

Why This Matters (and Why I'm Excited)

Most financial analysis tools treat risks as isolated items. “Company X faces supply chain risk.” “Company Y has regulatory exposure.” But what if you could see that 40 companies in the industrial sector all mention the same emerging regulatory risk, and 3 of them went bankrupt 2 years later?

That's the thesis here. Hidden connections. Patterns that emerge when you look at scale.

Also, I'm learning a ton: SEC filing structures, chunking strategies, embedding models, graph theory, the Leiden algorithm... This is weekend learning on steroids.

Updates added weekly (weekends permitting). Check back for new learnings, blockers, and wins.

Resources & References

Inspiration: Syntopic Reading with Claude — The original spark for connecting documents at scale
Graph Clustering: Leiden Algorithm Documentation — For hierarchical risk clustering
SEC Data Tool: edgartools (edgar) — Python library for downloading SEC filings
Alternative Tool: sec-edgar-downloader — The tool I explored first (works well for recent filings; struggled with older 10-Ks before SEC standardization)

#engineering #ai

engineering — laxmena

Hone vs. The 1 Billion Row Challenge

The Setup

Results at a Glance

Episode 1: Haiku, 1M rows — 13.7% faster

Hone v1.2.0: --goal-file

Episode 2: Haiku, 100M rows — 9.4% faster

Episode 3: Sonnet, 100M rows — 79% faster

What Sonnet actually did

Why Haiku couldn't find this

Episode 4: Sonnet's 100M solution, dropped on 1B rows — 73.3% faster

Episode 5: Sonnet, 1B rows directly — 81.4% faster

What's Next

I Built a Tool That Optimizes Code While You Sleep

What Hone Does

Experiment 1: The Log Parser

Experiment 2: Nearest Driver Dispatch

What I Learned

What's Next

RiskChain: The Messy Middle: Building a Risk Graph from Scratch

What is RiskChain?

The Vision

Current Status

Stay Updated

Progress Log

Weekend 1 | Jan 18, 2025 | Phase 1: Download Script ✓

Weekend 2 | Jan 23, 2025 | Phase 2: Chunking Strategy ✓

Why This Matters (and Why I'm Excited)

Resources & References

Hone v1.2.0: `--goal-file`