The Retrieval Problem No One Talked About

How early RAG pipelines gave us confidence in the wrong things and what broke when “relevant chunks” weren’t enough.

Kodezi Team

Jul 23, 2025

At the beginning of the Chronos project, we treated retrieval like an infrastructure problem. It was not glamorous. It was not new. You build a vector store, embed your files, drop a similarity model on top, and pick the top few chunks that match your input. Simple. Efficient. Everywhere.

It worked in our Q&A demos. It worked in internal tests. It worked in codebase summarization and documentation alignment.

And yet, it failed when it mattered most.

When Chronos needed to diagnose real bugs, this pipeline produced confident, well-formatted, utterly incorrect patches. Fixes that passed syntax checks. Fixes that passed tests. Fixes that quietly broke everything else.

Traditional flat retrieval achieves only 31.7% success on our Multi Random Retrieval benchmark. Even advanced RAG techniques like HyDE reach just 42.3%, Self-RAG manages 48.1%, and GraphRAG peaks at 51.7%.

The retrieval had done its job. It had returned the top-k most "relevant" files based on cosine similarity to the stack trace or error string. The model saw exactly what we told it to look at. And that was the problem.

Retrieval Was the First Illusion of Progress

We saw it early, but we ignored it. One of the first CI failures Chronos was tested on involved a bug in a dependency injection chain. The error happened inside a factory function in auth/session.py. The retriever pulled in that file. It pulled in the stack trace. It pulled in the unit test that failed.

Chronos patched it within seconds. The patch was valid Python. It was even conceptually correct if you looked only at the file in isolation. But the bug was not in that file. It was in config/providers.py, which redefined a constant injected into the factory at runtime.

That file was never retrieved.

This is exactly the problem: real debugging requires finding relevant context 3-5 hops away in the dependency graph, not just the textually similar files.

Chronos had all the signal it needed to guess. It just did not have the signal it needed to understand.

What "Relevant" Actually Means

Most retrieval pipelines use similarity as the core heuristic. You embed the query, embed the context, and pull the top k most similar items. That works when the answer lives near the question.

In debugging, it almost never does.

Our analysis of 5,000 real-world debugging scenarios found that bugs are typically located within 3-5 hops of the error location in the dependency graph, but share minimal textual similarity (average cosine similarity < 0.3).

Bugs are distributed across time, structure, and intent. They rarely live in the same file as the error. Often, they do not even share function ancestry. The test that fails may have no direct connection to the cause.

We learned this the hard way. Again and again, Chronos pulled in the "right" files. It fixed the symptom. It passed the test. But it missed the reason the bug happened in the first place.

This was not a hallucination problem. It was a context design failure.

Debugging Is Not Document Lookup

In QA settings, relevance means proximity. In documentation, relevance means topical similarity. But in debugging, relevance means causality. It means identifying the code that made the system behave differently from what was intended.

That requires more than matching strings. It requires navigating paths.

Consider a concrete example from our evaluation: A null pointer exception in PaymentProcessor.java:142. Vector search retrieves top-k chunks around line 142 based on embedding similarity, missing that customerAccount is null because initialization happens in AccountService.java. It fails to connect that a recent commit changed config/payment.yml timeout from 30s to 5s, causing the null.

And our early retrieval systems were built for flat answers, not relational structure. They reduced codebases to bags of tokens. They treated logic trees as searchable text. They retrieved content that looked useful instead of content that explained what broke.

This is where we began to pivot.

Enter AGR: Retrieval as Traversal

We rebuilt our system around a simple premise: debugging is not about finding the most similar code. It is about reconstructing the path that caused a failure.

AGR (Adaptive Graph-Guided Retrieval) replaced our flat vector store with a multi-signal code graph achieving 92% precision at 85% recall. This graph is made of nodes that represent functions, files, tests, logs, commits, and documentation entries. The edges encode real structure with specific weights:

Import/dependency edges: weight 0.9
Function call graphs: weight 0.85
Test coverage mappings: weight 0.95
Stack trace paths: weight 0.97
Commit co-occurrence: weight 0.7 × e^(-λt)

Instead of scoring each file individually, AGR starts at the error surface and walks outward. It uses adaptive k-hop expansion with O(k log d) complexity, typically exploring 3.7 hops for complex bugs versus 1.2 for simple ones.

The result was not just better patches. It was better reasoning.

Chronos now retrieves an average of 127 nodes versus 500+ for flat top-k approaches, reducing token usage by 65% while improving fix accuracy from 23.4% to 87.1%.

The Graph That Changed Everything

AGR's graph structure combines eight distinct signal types into one traversable structure:

AST parent/child relationships (weight: 0.8)
Import/dependency edges (weight: 0.9)
Function call graphs (weight: 0.85)
Test coverage mappings (weight: 0.95)
Log emission points (weight: 0.92)
Stack trace paths (weight: 0.97)
Commit co-occurrence (weight: 0.7 × e^(-λt))
PR review comments (weight: 0.6)

No existing system combines more than 2-3 of these signals. This multi-modal fusion enables AGR to trace paths like: test failure → log entry → emitting function → upstream caller → config error → fix location.

Failure Made Us Reconsider Relevance

For months, we chased patch quality by tuning generation. We tried prompt variants. We changed decoding temperatures. We added diff examples and patch templates. None of it worked consistently.

What finally shifted performance was giving Chronos the right ingredients.

Our ablation studies show that removing AGR causes debugging success to drop from 67.3% to 42.6%, a catastrophic 22.7% reduction. Each graph signal type contributes measurably:

Removing AST edges: -18.3% performance
Removing test trace edges: -15.5% performance
Fixed k=3 instead of adaptive: -19.9% performance
No confidence weighting: -11.9% performance

The model began fixing bugs it previously failed on. Not because it got smarter. But because it saw the real picture.

What Benchmarks Hid from Us

On standard debugging benchmarks, our old retrieval system scored well. HumanEval: 89.7%. MBPP: 88.9%. But on our Multi Random Retrieval (MRR) benchmark with 5,000 real-world scenarios, it achieved only 31.7% success.

The difference was that benchmarks are designed for closure. They are finite and curated. Real bugs are not. They are systemic. The bug is not the file. It is the way the file fits into everything else.

MRR specifically tests this by:

Scattering relevant debugging information across 10-50 files
Requiring 3-7 retrieval steps to gather complete context
Including 70% plausible but irrelevant content as noise
Spanning multiple commits across 3-12 months

Chronos needed to learn that. And our retrieval had to stop pretending that relevance was enough.

The Numbers That Matter

AGR's superiority isn't just theoretical. On MRR, it achieves:

89.2% precision (vs 55.2% for GPT-4.1 + RAG)
84.7% recall (vs 42.3% for GPT-4.1 + RAG)
67.3% fix accuracy (vs 13.8% for GPT-4.1 + RAG)
31.2K average tokens (vs 89K for flat retrieval)
47ms retrieval for cached patterns (vs 3.2min cold start)

Final Thoughts

Most debugging failures we encountered were not due to poor generation. They were due to incomplete understanding of what to retrieve.

Top-k gives you neighbors. But debugging needs paths. Flat retrieval gives you locality. But debugging demands lineage.

Chronos only started improving when we stopped asking, "What's near this error?" and started asking, "What flows into it?"

The difference seems small. But in debugging, it is everything.

Today, AGR enables Chronos to achieve 4-5x better debugging performance than state-of-the-art models. The key wasn't a bigger model or better prompts. It was understanding that debugging is fundamentally about tracing causal paths, not finding similar text.

If you want to fix a system, you need to understand how it behaves, not just where it breaks. Retrieval is the first place that understanding needs to show up. We just took too long to see it.

More Insights

See All

[

Updates

]

Introducing Chronos-1: The First Debugging-Native Language Model

Chronos-1 is the first debugging-native language model built for autonomous code repair, deep repo understanding, and continuous code health at scale.

[

Updates

]

Introducing Chronos-1: The First Debugging-Native Language Model

Chronos-1 is the first debugging-native language model built for autonomous code repair, deep repo understanding, and continuous code health at scale.

[

Updates

]

Introducing Chronos-1: The First Debugging-Native Language Model

Chronos-1 is the first debugging-native language model built for autonomous code repair, deep repo understanding, and continuous code health at scale.

[

Research

]

Why We Spent 4 Years on Debugging

What we got wrong about LLMs, what we learned from failure, and why Chronos became necessary

[

Research

]

Why We Spent 4 Years on Debugging

What we got wrong about LLMs, what we learned from failure, and why Chronos became necessary

[

Research

]

Why We Spent 4 Years on Debugging

What we got wrong about LLMs, what we learned from failure, and why Chronos became necessary

[

Research

]

How Real Bugs Taught Chronos More Than Any Dataset

What we thought we were teaching the model, and what it ended up learning from us instead.

[

Research

]

How Real Bugs Taught Chronos More Than Any Dataset

What we thought we were teaching the model, and what it ended up learning from us instead.

[

Research

]

How Real Bugs Taught Chronos More Than Any Dataset

What we thought we were teaching the model, and what it ended up learning from us instead.