DOGE Lense: Building a RAG Pipeline for Legacy COBOL and Fortran Code

March 05, 2026

The Problem

Legacy codebases are a nightmare to navigate. LAPACK alone is 1.5 million lines of Fortran. COBOL systems power banks, governments, and airlines — and the developers who wrote them are retiring. You can't just grep for "how does matrix multiplication work" or ask a junior dev to trace a PERFORM chain through 50 paragraphs. The code is dense, domain-specific, and spread across files that haven't been touched in decades.

We needed a way to ask questions in plain English and get answers grounded in the actual source — with citations, not hallucinations.

The Solution: DOGE Lense

DOGE Lense is a RAG (Retrieval-Augmented Generation) pipeline that makes COBOL and Fortran codebases queryable through natural language. Ask "How does DGETRF perform LU factorization?" or "What does the CALCULATE-INTEREST paragraph do?" and you get answers with file:line citations, grounded in retrieved code chunks. No more digging through millions of lines by hand.

We built it from scratch — no LangChain, no LlamaIndex. Every stage is purpose-built for legacy code.

Architecture Overview

The pipeline has two phases: ingestion (offline, per codebase) and retrieval (online, per query).

Ingestion: Raw source → language detection → preprocessing → chunking → metadata extraction → batch embedding → Qdrant storage.

Retrieval: User query → embed query → hybrid search (dense + BM25) → rerank (metadata + Cohere) → format context → GPT-4o generation → cited answer.

Five codebases indexed: LAPACK (12,515 chunks), BLAS (814), OpenCOBOL Contrib (3,893), GnuCOBOL (3), and GNU Fortran (varies). Two languages. Eight analysis features. Web UI and CLI both hit the same backend.

Why a Custom Pipeline?

Off-the-shelf RAG frameworks assume document-style content. Legacy code has structure: COBOL paragraphs, Fortran subroutines, fixed-format columns, continuation lines. Generic chunkers would slice through the middle of a PERFORM block. We needed:

Language-aware preprocessing — COBOL strips cols 1–6 and 73–80; Fortran handles fixed vs. free form
Syntax-aware chunking — Boundaries at paragraphs and subroutines, not arbitrary token windows
Adaptive sizing — 64–768 tokens per chunk, merged or split on structural boundaries
Hybrid search — Identifier-heavy queries (e.g., "DGETRF") need BM25; semantic queries need dense vectors
Citation enforcement — Every answer cites file:line; the model is instructed to answer ONLY from context

We could have forced legacy code into a generic pipeline. We chose to build one that respects it.

The Ingestion Pipeline

Language detection routes files by extension: .cob, .cbl, .cpy → COBOL; .f, .f90, .f77, .f95 → Fortran.

COBOL preprocessing strips fixed-format artifacts (sequence numbers, identification columns), detects encoding via chardet, extracts comments, handles continuation lines. Cols 7–72 are the actual code.

Fortran preprocessing detects fixed vs. free form (extension-based with heuristic override), extracts ! comments and & continuations, handles both Fortran 77 and Fortran 90+.

Chunking splits on structural boundaries. COBOL: paragraph boundaries (PERFORM targets, standalone paragraphs). Fortran: SUBROUTINE, FUNCTION, MODULE, PROGRAM, BLOCK DATA. Chunks are 64–768 tokens; small adjacent chunks merge, oversized chunks split on line boundaries.

Metadata extracted per chunk: file_path, line_start, line_end, paragraph_name (or subroutine name), division, chunk_type, language, codebase, dependencies (CALL, USE, INCLUDE).

Embedding via Voyage Code 2 — 1536 dimensions, batch of 128 texts per API call. Code-optimized model.

Storage in Qdrant Cloud with payload indexes on paragraph_name, division, file_path, language, codebase for fast filtering.

The Retrieval Pipeline

Query embedding — Same Voyage Code 2 model, input_type="query" for retrieval.

Hybrid search — Qdrant native dense + BM25. We fuse both with query-adaptive weighting: identifier-heavy queries (e.g., "DGETRF", "MAIN-LOGIC") get 0.6 BM25 / 0.4 dense; semantic queries get 0.7 dense / 0.3 BM25. Optional codebase filter isolates results to one codebase.

Re-ranking — Metadata scoring first (paragraph name match, division hints, file path overlap, dependency overlap). Free, always runs. Then Cohere cross-encoder when configured — 40/60 blend with metadata scores. Confidence: HIGH/MEDIUM/LOW from normalized scores. Graceful fallback to metadata-only when Cohere is unavailable.

Context assembly — Top reranked chunks formatted for the LLM. Dynamic token budget (5,000 total). System prompt enforces citation format and "answer ONLY from context."

Generation — GPT-4o with GPT-4o-mini fallback. Feature-specific prompts shape the answer style (explanation vs. translation vs. dependency mapping). All eight features use the same retrieval path; differentiation is via prompts.

Eight Code Understanding Features

Code Explanation — Plain English explanation of what code does
Dependency Mapping — Trace PERFORM/CALL/USE chains
Pattern Detection — Identify common code structures
Impact Analysis — What breaks if this code changes?
Documentation Gen — Generate docs for undocumented code
Translation Hints — Map to modern language equivalents (e.g., Python)
Bug Pattern Search — Find potential issues and anti-patterns
Business Logic — Extract business rules from PROCEDURE DIVISION

All flow through the same pipeline. The feature param selects the prompt template. No custom retrieval strategies — prompt differentiation was enough.

Evaluation

We built a ground truth dataset: 27 queries across 5 codebases and 6 features. Each query has expected files/names; we measure precision@5 (at least one expected hit in top-5 chunks).

Result: 81.5% (22/27). Target was 70%. Per-codebase: gnucobol 100%, opencobol-contrib 100%, blas 88%, lapack 86%. gfortran was 0% at eval time (ingestion pending); excluding it, we're at 91.7%.

The evaluation script is reproducible: python evaluation/evaluate.py. Same queries, same metric, same story.

Tech Stack

Component	Technology
Embeddings	Voyage Code 2 (1536 dims, batch 128)
Vector DB	Qdrant Cloud
Search	Hybrid dense + BM25 (Qdrant native)
Re-ranking	Metadata-first + Cohere cross-encoder
Generation	GPT-4o (fallback: GPT-4o-mini)
API	FastAPI on Render
Frontend	Next.js 14 on Vercel
CLI	Click + Rich

Cost and Scaling

Development cost: ~$4.50 (ingestion + ~200 test queries). Voyage has 50M free tokens; GPT-4o dominated.

Per-query cost: ~$0.015 (Voyage embed negligible, Cohere optional, GPT-4o input/output dominant). GPT-4o-mini would cut that by ~90%.

Scaling: At 1K users × 10 queries/month → ~$185/mo. At 10K users → ~$1,670/mo. Embedding and reranking stay cheap; generation scales with usage.

Lessons Learned

Encoding is tricky. chardet mis-detected four LAPACK files as UTF-7, producing surrogate characters. We added a UTF-7 guard and fixed the affected files. Fortran and COBOL come from the era of EBCDIC and punched cards — encoding detection can't be an afterthought.

Provider limits bite. Voyage has a 120K token cap per request; we hit it with OpenCOBOL Contrib. Added throttled sub-batching. gfortran ingestion hit rate limits on the free tier; we added exponential backoff. Production ingestion needs rate-limit-aware orchestration.

Chunking quality is everything. Bad chunks → bad retrieval → bad answers. Paragraph and subroutine boundaries matter. Adaptive sizing (64–768 tokens) keeps chunks coherent without arbitrary cuts.

Metadata reranking is free and effective. Cohere improves relevance, but metadata-only (paragraph name, division, codebase) gets you most of the way. We run both and blend; Cohere is optional.

Prompt differentiation beats custom retrieval. We considered per-feature retrieval strategies. Turns out the same hybrid search + rerank + different prompts works. Simpler, easier to maintain.

Try It

Web UI: https://frontend-pied-alpha-71.vercel.app

API: https://gauntlet-assignment-3.onrender.com

First request may take 30 seconds if Render is cold (free tier spins down). Totally normal.

Conclusion

DOGE Lense proves that legacy codebases can be made queryable with a custom RAG pipeline. No frameworks, no black boxes — just purpose-built preprocessing, chunking, search, and generation. Five codebases, two languages, eight features, 81.5% retrieval precision. And yes, we went full DOGE on the name.

vlog