DOGE Lense: Building a RAG Pipeline for Legacy COBOL and Fortran Code
The Problem
Legacy codebases are a nightmare to navigate. LAPACK alone is 1.5 million lines of Fortran. COBOL systems power banks, governments, and airlines — and the developers who wrote them are retiring. You can't just grep for "how does matrix multiplication work" or ask a junior dev to trace a PERFORM chain through 50 paragraphs. The code is dense, domain-specific, and spread across files that haven't been touched in decades.
We needed a way to ask questions in plain English and get answers grounded in the actual source — with citations, not hallucinations.
The Solution: DOGE Lense
DOGE Lense is a RAG (Retrieval-Augmented Generation) pipeline that makes COBOL and Fortran codebases queryable through natural language. Ask "How does DGETRF perform LU factorization?" or "What does the CALCULATE-INTEREST paragraph do?" and you get answers with file:line citations, grounded in retrieved code chunks. No more digging through millions of lines by hand.
We built it from scratch — no LangChain, no LlamaIndex. Every stage is purpose-built for legacy code.
Architecture Overview
The pipeline has two phases: ingestion (offline, per codebase) and retrieval (online, per query).
Ingestion: Raw source → language detection → preprocessing → chunking → metadata extraction → batch embedding → Qdrant storage.
Retrieval: User query → embed query → hybrid search (dense + BM25) → rerank (metadata + Cohere) → format context → GPT-4o generation → cited answer.
Five codebases indexed: LAPACK (12,515 chunks), BLAS (814), OpenCOBOL Contrib (3,893), GnuCOBOL (3), and GNU Fortran (varies). Two languages. Eight analysis features. Web UI and CLI both hit the same backend.
Why a Custom Pipeline?
Off-the-shelf RAG frameworks assume document-style content. Legacy code has structure: COBOL paragraphs, Fortran subroutines, fixed-format columns, continuation lines. Generic chunkers would slice through the middle of a PERFORM block. We needed:
- Language-aware preprocessing — COBOL strips cols 1–6 and 73–80; Fortran handles fixed vs. free form
- Syntax-aware chunking — Boundaries at paragraphs and subroutines, not arbitrary token windows
- Adaptive sizing — 64–768 tokens per chunk, merged or split on structural boundaries
- Hybrid search — Identifier-heavy queries (e.g., "DGETRF") need BM25; semantic queries need dense vectors
- Citation enforcement — Every answer cites file:line; the model is instructed to answer ONLY from context
We could have forced legacy code into a generic pipeline. We chose to build one that respects it.
The Ingestion Pipeline
Language detection routes files by extension: .cob, .cbl, .cpy → COBOL; .f, .f90, .f77, .f95 → Fortran.
COBOL preprocessing strips fixed-format artifacts (sequence numbers, identification columns), detects encoding via chardet, extracts comments, handles continuation lines. Cols 7–72 are the actual code.
Fortran preprocessing detects fixed vs. free form (extension-based with heuristic override), extracts ! comments and & continuations, handles both Fortran 77 and Fortran 90+.
Chunking splits on structural boundaries. COBOL: paragraph boundaries (PERFORM targets, standalone paragraphs). Fortran: SUBROUTINE, FUNCTION, MODULE, PROGRAM, BLOCK DATA. Chunks are 64–768 tokens; small adjacent chunks merge, oversized chunks split on line boundaries.
Metadata extracted per chunk: file_path, line_start, line_end, paragraph_name (or subroutine name), division, chunk_type, language, codebase, dependencies (CALL, USE, INCLUDE).
Embedding via Voyage Code 2 — 1536 dimensions, batch of 128 texts per API call. Code-optimized model.
Storage in Qdrant Cloud with payload indexes on paragraph_name, division, file_path, language, codebase for fast filtering.
The Retrieval Pipeline
Query embedding — Same Voyage Code 2 model, input_type="query" for retrieval.
Hybrid search — Qdrant native dense + BM25. We fuse both with query-adaptive weighting: identifier-heavy queries (e.g., "DGETRF", "MAIN-LOGIC") get 0.6 BM25 / 0.4 dense; semantic queries get 0.7 dense / 0.3 BM25. Optional codebase filter isolates results to one codebase.
Re-ranking — Metadata scoring first (paragraph name match, division hints, file path overlap, dependency overlap). Free, always runs. Then Cohere cross-encoder when configured — 40/60 blend with metadata scores. Confidence: HIGH/MEDIUM/LOW from normalized scores. Graceful fallback to metadata-only when Cohere is unavailable.
Context assembly — Top reranked chunks formatted for the LLM. Dynamic token budget (5,000 total). System prompt enforces citation format and "answer ONLY from context."
Generation — GPT-4o with GPT-4o-mini fallback. Feature-specific prompts shape the answer style (explanation vs. translation vs. dependency mapping). All eight features use the same retrieval path; differentiation is via prompts.
Eight Code Understanding Features
- Code Explanation — Plain English explanation of what code does
- Dependency Mapping — Trace PERFORM/CALL/USE chains
- Pattern Detection — Identify common code structures
- Impact Analysis — What breaks if this code changes?
- Documentation Gen — Generate docs for undocumented code
- Translation Hints — Map to modern language equivalents (e.g., Python)
- Bug Pattern Search — Find potential issues and anti-patterns
- Business Logic — Extract business rules from PROCEDURE DIVISION
All flow through the same pipeline. The feature param selects the prompt template. No custom retrieval strategies — prompt differentiation was enough.
Evaluation
We built a ground truth dataset: 27 queries across 5 codebases and 6 features. Each query has expected files/names; we measure precision@5 (at least one expected hit in top-5 chunks).
Result: 81.5% (22/27). Target was 70%. Per-codebase: gnucobol 100%, opencobol-contrib 100%, blas 88%, lapack 86%. gfortran was 0% at eval time (ingestion pending); excluding it, we're at 91.7%.
The evaluation script is reproducible: python evaluation/evaluate.py. Same queries, same metric, same story.
Tech Stack
| Component | Technology |
|---|---|
| Embeddings | Voyage Code 2 (1536 dims, batch 128) |
| Vector DB | Qdrant Cloud |
| Search | Hybrid dense + BM25 (Qdrant native) |
| Re-ranking | Metadata-first + Cohere cross-encoder |
| Generation | GPT-4o (fallback: GPT-4o-mini) |
| API | FastAPI on Render |
| Frontend | Next.js 14 on Vercel |
| CLI | Click + Rich |
Cost and Scaling
Development cost: ~$4.50 (ingestion + ~200 test queries). Voyage has 50M free tokens; GPT-4o dominated.
Per-query cost: ~$0.015 (Voyage embed negligible, Cohere optional, GPT-4o input/output dominant). GPT-4o-mini would cut that by ~90%.
Scaling: At 1K users × 10 queries/month → ~$185/mo. At 10K users → ~$1,670/mo. Embedding and reranking stay cheap; generation scales with usage.
Lessons Learned
Encoding is tricky. chardet mis-detected four LAPACK files as UTF-7, producing surrogate characters. We added a UTF-7 guard and fixed the affected files. Fortran and COBOL come from the era of EBCDIC and punched cards — encoding detection can't be an afterthought.
Provider limits bite. Voyage has a 120K token cap per request; we hit it with OpenCOBOL Contrib. Added throttled sub-batching. gfortran ingestion hit rate limits on the free tier; we added exponential backoff. Production ingestion needs rate-limit-aware orchestration.
Chunking quality is everything. Bad chunks → bad retrieval → bad answers. Paragraph and subroutine boundaries matter. Adaptive sizing (64–768 tokens) keeps chunks coherent without arbitrary cuts.
Metadata reranking is free and effective. Cohere improves relevance, but metadata-only (paragraph name, division, codebase) gets you most of the way. We run both and blend; Cohere is optional.
Prompt differentiation beats custom retrieval. We considered per-feature retrieval strategies. Turns out the same hybrid search + rerank + different prompts works. Simpler, easier to maintain.
Try It
Web UI: https://frontend-pied-alpha-71.vercel.app
API: https://gauntlet-assignment-3.onrender.com
First request may take 30 seconds if Render is cold (free tier spins down). Totally normal.
Conclusion
DOGE Lense proves that legacy codebases can be made queryable with a custom RAG pipeline. No frameworks, no black boxes — just purpose-built preprocessing, chunking, search, and generation. Five codebases, two languages, eight features, 81.5% retrieval precision. And yes, we went full DOGE on the name.
Subscribe for the latest updates from us
We respect your privacy. you can Unsubscribe at any time

