Skip to content

      DOGE Lense: Building a RAG Pipeline for Legacy COBOL and Fortran Code

      March 05, 2026

      The Problem

      Legacy codebases are a nightmare to navigate. LAPACK alone is 1.5 million lines of Fortran. COBOL systems power banks, governments, and airlines — and the developers who wrote them are retiring. You can't just grep for "how does matrix multiplication work" or ask a junior dev to trace a PERFORM chain through 50 paragraphs. The code is dense, domain-specific, and spread across files that haven't been touched in decades.

      We needed a way to ask questions in plain English and get answers grounded in the actual source — with citations, not hallucinations.

      The Solution: DOGE Lense

      DOGE Lense is a RAG (Retrieval-Augmented Generation) pipeline that makes COBOL and Fortran codebases queryable through natural language. Ask "How does DGETRF perform LU factorization?" or "What does the CALCULATE-INTEREST paragraph do?" and you get answers with file:line citations, grounded in retrieved code chunks. No more digging through millions of lines by hand.

      We built it from scratch — no LangChain, no LlamaIndex. Every stage is purpose-built for legacy code.


      Architecture Overview

      The pipeline has two phases: ingestion (offline, per codebase) and retrieval (online, per query).

      Ingestion: Raw source → language detection → preprocessing → chunking → metadata extraction → batch embedding → Qdrant storage.

      Retrieval: User query → embed query → hybrid search (dense + BM25) → rerank (metadata + Cohere) → format context → GPT-4o generation → cited answer.

      Five codebases indexed: LAPACK (12,515 chunks), BLAS (814), OpenCOBOL Contrib (3,893), GnuCOBOL (3), and GNU Fortran (varies). Two languages. Eight analysis features. Web UI and CLI both hit the same backend.


      Why a Custom Pipeline?

      Off-the-shelf RAG frameworks assume document-style content. Legacy code has structure: COBOL paragraphs, Fortran subroutines, fixed-format columns, continuation lines. Generic chunkers would slice through the middle of a PERFORM block. We needed:

      • Language-aware preprocessing — COBOL strips cols 1–6 and 73–80; Fortran handles fixed vs. free form
      • Syntax-aware chunking — Boundaries at paragraphs and subroutines, not arbitrary token windows
      • Adaptive sizing — 64–768 tokens per chunk, merged or split on structural boundaries
      • Hybrid search — Identifier-heavy queries (e.g., "DGETRF") need BM25; semantic queries need dense vectors
      • Citation enforcement — Every answer cites file:line; the model is instructed to answer ONLY from context

      We could have forced legacy code into a generic pipeline. We chose to build one that respects it.


      The Ingestion Pipeline

      Language detection routes files by extension: .cob, .cbl, .cpy → COBOL; .f, .f90, .f77, .f95 → Fortran.

      COBOL preprocessing strips fixed-format artifacts (sequence numbers, identification columns), detects encoding via chardet, extracts comments, handles continuation lines. Cols 7–72 are the actual code.

      Fortran preprocessing detects fixed vs. free form (extension-based with heuristic override), extracts ! comments and & continuations, handles both Fortran 77 and Fortran 90+.

      Chunking splits on structural boundaries. COBOL: paragraph boundaries (PERFORM targets, standalone paragraphs). Fortran: SUBROUTINE, FUNCTION, MODULE, PROGRAM, BLOCK DATA. Chunks are 64–768 tokens; small adjacent chunks merge, oversized chunks split on line boundaries.

      Metadata extracted per chunk: file_path, line_start, line_end, paragraph_name (or subroutine name), division, chunk_type, language, codebase, dependencies (CALL, USE, INCLUDE).

      Embedding via Voyage Code 2 — 1536 dimensions, batch of 128 texts per API call. Code-optimized model.

      Storage in Qdrant Cloud with payload indexes on paragraph_name, division, file_path, language, codebase for fast filtering.


      The Retrieval Pipeline

      Query embedding — Same Voyage Code 2 model, input_type="query" for retrieval.

      Hybrid search — Qdrant native dense + BM25. We fuse both with query-adaptive weighting: identifier-heavy queries (e.g., "DGETRF", "MAIN-LOGIC") get 0.6 BM25 / 0.4 dense; semantic queries get 0.7 dense / 0.3 BM25. Optional codebase filter isolates results to one codebase.

      Re-ranking — Metadata scoring first (paragraph name match, division hints, file path overlap, dependency overlap). Free, always runs. Then Cohere cross-encoder when configured — 40/60 blend with metadata scores. Confidence: HIGH/MEDIUM/LOW from normalized scores. Graceful fallback to metadata-only when Cohere is unavailable.

      Context assembly — Top reranked chunks formatted for the LLM. Dynamic token budget (5,000 total). System prompt enforces citation format and "answer ONLY from context."

      Generation — GPT-4o with GPT-4o-mini fallback. Feature-specific prompts shape the answer style (explanation vs. translation vs. dependency mapping). All eight features use the same retrieval path; differentiation is via prompts.


      Eight Code Understanding Features

      1. Code Explanation — Plain English explanation of what code does
      2. Dependency Mapping — Trace PERFORM/CALL/USE chains
      3. Pattern Detection — Identify common code structures
      4. Impact Analysis — What breaks if this code changes?
      5. Documentation Gen — Generate docs for undocumented code
      6. Translation Hints — Map to modern language equivalents (e.g., Python)
      7. Bug Pattern Search — Find potential issues and anti-patterns
      8. Business Logic — Extract business rules from PROCEDURE DIVISION

      All flow through the same pipeline. The feature param selects the prompt template. No custom retrieval strategies — prompt differentiation was enough.


      Evaluation

      We built a ground truth dataset: 27 queries across 5 codebases and 6 features. Each query has expected files/names; we measure precision@5 (at least one expected hit in top-5 chunks).

      Result: 81.5% (22/27). Target was 70%. Per-codebase: gnucobol 100%, opencobol-contrib 100%, blas 88%, lapack 86%. gfortran was 0% at eval time (ingestion pending); excluding it, we're at 91.7%.

      The evaluation script is reproducible: python evaluation/evaluate.py. Same queries, same metric, same story.


      Tech Stack

      Component Technology
      Embeddings Voyage Code 2 (1536 dims, batch 128)
      Vector DB Qdrant Cloud
      Search Hybrid dense + BM25 (Qdrant native)
      Re-ranking Metadata-first + Cohere cross-encoder
      Generation GPT-4o (fallback: GPT-4o-mini)
      API FastAPI on Render
      Frontend Next.js 14 on Vercel
      CLI Click + Rich

      Cost and Scaling

      Development cost: ~$4.50 (ingestion + ~200 test queries). Voyage has 50M free tokens; GPT-4o dominated.

      Per-query cost: ~$0.015 (Voyage embed negligible, Cohere optional, GPT-4o input/output dominant). GPT-4o-mini would cut that by ~90%.

      Scaling: At 1K users × 10 queries/month → ~$185/mo. At 10K users → ~$1,670/mo. Embedding and reranking stay cheap; generation scales with usage.


      Lessons Learned

      Encoding is tricky. chardet mis-detected four LAPACK files as UTF-7, producing surrogate characters. We added a UTF-7 guard and fixed the affected files. Fortran and COBOL come from the era of EBCDIC and punched cards — encoding detection can't be an afterthought.

      Provider limits bite. Voyage has a 120K token cap per request; we hit it with OpenCOBOL Contrib. Added throttled sub-batching. gfortran ingestion hit rate limits on the free tier; we added exponential backoff. Production ingestion needs rate-limit-aware orchestration.

      Chunking quality is everything. Bad chunks → bad retrieval → bad answers. Paragraph and subroutine boundaries matter. Adaptive sizing (64–768 tokens) keeps chunks coherent without arbitrary cuts.

      Metadata reranking is free and effective. Cohere improves relevance, but metadata-only (paragraph name, division, codebase) gets you most of the way. We run both and blend; Cohere is optional.

      Prompt differentiation beats custom retrieval. We considered per-feature retrieval strategies. Turns out the same hybrid search + rerank + different prompts works. Simpler, easier to maintain.


      Try It

      Web UI: https://frontend-pied-alpha-71.vercel.app

      API: https://gauntlet-assignment-3.onrender.com

      First request may take 30 seconds if Render is cold (free tier spins down). Totally normal.


      Conclusion

      DOGE Lense proves that legacy codebases can be made queryable with a custom RAG pipeline. No frameworks, no black boxes — just purpose-built preprocessing, chunking, search, and generation. Five codebases, two languages, eight features, 81.5% retrieval precision. And yes, we went full DOGE on the name.

      vlog

      Subscribe for the latest updates from us

      We respect your privacy. you can Unsubscribe at any time

      1. Blog
      2. DOGE Lense: Building a RAG Pipeline for Legacy COBOL and Fortran Code

      Rights Reserved www.jorgealejandrodiez.com