RiskChain: The Messy Middle: Building a Risk Graph from Scratch

An ongoing weekend project documenting the journey of uncovering hidden connections in corporate financial filings—the stumbles, the learnings, the 'aha!' moments, and everything in between. Started January 2025.


What is RiskChain?

The core idea is simple but ambitious: find hidden connections and risk trails that aren't immediately obvious when you're just reading through a 10-K filing.

Instead of treating each financial document as an isolated artifact, I'm building a system to: – Extract risk factors from 10-K filings (2004-2025) across 75 companies – Embed and connect these risks to find non-obvious relationships – Build a graph that reveals risk clusters, patterns, and “trails” that could signal systemic weaknesses or early warning signs

Why 10-K filings? Because companies are required to disclose risks in specific sections (Item 1 and Item 1a), and there's a decade+ of structured data just sitting there.


The Vision

Here's the full pipeline I'm building toward:

[Raw Financial Data]
  ├── SEC Filings (10-K/Q) ── News Articles ── Earnings Transcripts ── Other Reports
          │
          ▼
[1. Ingestion & Chunking]
  → Parse documents (PDF/HTML) → Split into sentences → Group into ~500-word chunks
          │
          ▼
[2. Risk Extraction]
  → Use Gemini Flash per chunk → Extract 3-5 specific risk factors + severity
          │
          ▼
[3. Storage & Embeddings]
  → SQLite DB (with sqlite-vec) → Embed risk labels (embedding-gemma-300m) → Deduplicate similar risks
          │
          ▼
[4. Graph Construction]
  → Nodes = unique risks
  → Edges = 
      ├─ Semantic similarity (embeddings)
      └─ Statistical co-occurrence (PMI)
          │
          ▼
[5. Hierarchical Clustering]
  → Apply Leiden algorithm (Surprise function) → Build risk hierarchy tree
  → Compute novelty scores for under-explored areas
          │
          ▼
[6. CLI / Interface Layer]
  → Persistent server for fast queries
  → Commands: search_risks, browse_tree, cross_report_risks, etc.
          │
          ▼
[7. Agent Workflow (Claude / similar)]
  ├── Stage 1: Ideation ── Browse tree → Propose novel risk chains (novelty bias)
  ├── Stage 2: Research ── Dive into chunks → Extract & order excerpts
  └── Stage 3: Output ── Generate RiskChain (visual trail with edges + narrative)
          │
          ▼
[8. Presentation & Action]
  → Web dashboard / exported report
  → Visual graph + highlighted excerpts + suggested hedges / alerts
  → Human review → Iterate via feedback

It's ambitious. It's probably overambitious. But that's the goal.


Current Status

Phase: 2 – Chunking Strategy
Progress: Data downloaded → Chunking complete → Ready for Risk Extraction


Stay Updated

I'm documenting this journey every weekend—the wins, the blockers, the learnings. If you want regular updates on how RiskChain develops, subscribe below to get new posts delivered to your inbox.


Progress Log

Weekend 1 | Jan 18, 2025 | Phase 1: Download Script ✓

What I built: Downloaded 10-K filings for 75 companies from 2004-2025 using the Python edgartools library. Curated a list of significant companies (including ones that went bankrupt in 2008—why not?). Got the script working and only extracting the relevant sections (Item 1, Item 7, Item 8) to keep things lean.

The messy parts (aka real life): I initially tried sec-edgar-downloader to connect to SEC and download. Spent way too much time on this approach, got stuck in the data cleaning rabbit hole, and realized I was losing sight of the actual goal. The real issue? Many of the 10-K filings before the SEC standardized their item categorization didn't play nice with the tool.

Lesson learned: when you're iterating, it's okay to abandon the “perfect” approach for one that ships faster.

Then I switched to edgartools (also known as edgar). This library gave me more flexibility, though the documentation still wasn't intuitive for my specific use case. But instead of giving up, I dug into the source code. That's when things clicked. Sometimes the best learning comes from reading other people's code instead of waiting for docs to explain everything.

The 'aha!' moment: > My wife helped me understand what Item 1, Item 1a, Item 7, and Item 8 actually mean in a 10-K filing. She translated the financial jargon into plain English, and suddenly the document structure made sense. Having someone who can bridge the domain knowledge gap is invaluable. I realized I was building this in a foreign domain—finance is not my native language, and that's okay.

What blocked me: – Figuring out the right tool for downloading (sec-edgar-downloader vs edgartools vs rolling my own) – Understanding that parsing 10-K files is genuinely harder than it looks (inconsistent structures across years, weird formatting, embedded tables)

Next up: Phase 2: Chunking strategy. Need to figure out how to split these documents intelligently for downstream LLM tasks.


Weekend 2 | Jan 23, 2025 | Phase 2: Chunking Strategy ✓

What I built: Implemented chunking using wtpsplitter and stored all chunks as markdown files with YAML frontmatter metadata (ticker, filing date, company name, chunk ID, item section). Now sitting on several thousand chunks, each ~1000 characters max, ready for extraction.

The messy parts (aka real life): I tried two chunking strategies: RecursiveChunker and wtpsplitter. RecursiveChunker felt like brute force—just splitting on token counts. But wtpsplitter was smarter; it respects sentence boundaries and creates more semantically coherent chunks.

Storing these as markdown files locally feels like a step backward (shouldn't I be using a database?), but honestly, it's perfect for iteration. I can inspect the chunks, debug the metadata, and understand what's happening before I add the complexity of a full DB setup.

The 'aha!' moment: > Chunk quality matters way more than I initially thought. The way you split text directly impacts whether an LLM can extract meaningful risk factors later. Sentence-aware chunking beats token-counting brutality. This made me reconsider the whole “let me jump straight to a database” instinct. Sometimes you need to slow down and get the fundamentals right first.

What blocked me: – Deciding between chunking strategies (trial and error on a few approaches) – Understanding the tradeoff between local file storage and “proper” database setup (spoiler: local storage is fine for now) – Realizing I was overthinking this phase when the real value comes next

Next up: Phase 3: Risk Extraction. I'll iterate through each chunk and use Claude/Gemini to extract 3-5 risk factors per chunk. This is where the actual signal starts emerging.


Why This Matters (and Why I'm Excited)

Most financial analysis tools treat risks as isolated items. “Company X faces supply chain risk.” “Company Y has regulatory exposure.” But what if you could see that 40 companies in the industrial sector all mention the same emerging regulatory risk, and 3 of them went bankrupt 2 years later?

That's the thesis here. Hidden connections. Patterns that emerge when you look at scale.

Also, I'm learning a ton: SEC filing structures, chunking strategies, embedding models, graph theory, the Leiden algorithm... This is weekend learning on steroids.


Updates added weekly (weekends permitting). Check back for new learnings, blockers, and wins.


Resources & References