
The Retrieval Problem No One Talked About
How early RAG pipelines gave us confidence in the wrong things and what broke when “relevant chunks” weren’t enough.

Kodezi Team
Jul 23, 2025
At the beginning of the Chronos project, we treated retrieval like an infrastructure problem. It was not glamorous. It was not new. You build a vector store, embed your files, drop a similarity model on top, and pick the top few chunks that match your input. Simple. Efficient. Everywhere.
It worked in our Q&A demos. It worked in internal tests. It worked in codebase summarization and documentation alignment.
And yet, it failed when it mattered most.
When Chronos needed to diagnose real bugs, this pipeline produced confident, well-formatted, utterly incorrect patches. Fixes that passed syntax checks. Fixes that passed tests. Fixes that quietly broke everything else.
The retrieval had done its job. It had returned the top-k most “relevant” files based on cosine similarity to the stack trace or error string. The model saw exactly what we told it to look at. And that was the problem.
Retrieval Was the First Illusion of Progress
We saw it early, but we ignored it. One of the first CI failures Chronos was tested on involved a bug in a dependency injection chain. The error happened inside a factory function in auth/session.py
. The retriever pulled in that file. It pulled in the stack trace. It pulled in the unit test that failed.
Chronos patched it within seconds. The patch was valid Python. It was even conceptually correct if you looked only at the file in isolation. But the bug was not in that file. It was in config/providers.py
, which redefined a constant injected into the factory at runtime.
That file was never retrieved.
Chronos had all the signal it needed to guess. It just did not have the signal it needed to understand.
What "Relevant" Actually Means
Most retrieval pipelines use similarity as the core heuristic. You embed the query, embed the context, and pull the top k most similar items. That works when the answer lives near the question.
In debugging, it almost never does.
Bugs are distributed across time, structure, and intent. They rarely live in the same file as the error. Often, they do not even share function ancestry. The test that fails may have no direct connection to the cause.
We learned this the hard way. Again and again, Chronos pulled in the “right” files. It fixed the symptom. It passed the test. But it missed the reason the bug happened in the first place.
This was not a hallucination problem. It was a context design failure.
Debugging Is Not Document Lookup
In QA settings, relevance means proximity. In documentation, relevance means topical similarity. But in debugging, relevance means causality. It means identifying the code that made the system behave differently from what was intended.
That requires more than matching strings. It requires navigating paths.
And our early retrieval systems were built for flat answers, not relational structure. They reduced codebases to bags of tokens. They treated logic trees as searchable text. They retrieved content that looked useful instead of content that explained what broke.
This is where we began to pivot.
Placeholder: prompt
“A vertical split-screen view. Left side: ‘Traditional RAG Pipeline’ shows cosine similarity scores for five code files matched against an error message. Right side: ‘Actual Causal Chain’ shows a multi-step call graph leading to a hidden config file. Visual focus is on the mismatch between what was retrieved and what was needed.”
Enter AGR: Retrieval as Traversal
We rebuilt our system around a simple premise: debugging is not about finding the most similar code. It is about reconstructing the path that caused a failure.
AGR — Adaptive Graph-Guided Retrieval — replaced our flat vector store with a code graph. This graph is made of nodes that represent functions, files, tests, logs, commits, and documentation entries. The edges encode real structure: call chains, import relationships, commit co-diffs, test ancestry, and semantic similarity.
Instead of scoring each file individually, AGR starts at the error surface and walks outward. It traces paths through the graph, scores based on traversal depth, signal types, and memory strength, and retrieves based on how connected a node is to the actual failure, not how similar it looks to the query.
The result was not just better patches. It was better reasoning.
Chronos began retrieving fewer files. But those files were the ones that mattered. The model started producing fewer guesses. The fixes became specific, structural, and explainable.
Placeholder: prompt
“A full-screen code graph visualization. Nodes represent files and functions. Orange lines trace a traversal from a stack trace node to a config function three hops away. Adjacent is a collapsed RAG list showing unrelated files with high similarity scores but no causal connection.”
Failure Made Us Reconsider Relevance
For months, we chased patch quality by tuning generation. We tried prompt variants. We changed decoding temperatures. We added diff examples and patch templates. None of it worked consistently.
What finally shifted performance was giving Chronos the right ingredients.
And the right ingredients meant context that explained failure.
We built path-based retrievers. We tagged commit edges with timestamps and test impacts. We gave higher weights to artifacts mentioned in CI logs. We allowed context expansion based on output behavior, not just input match.
The model began fixing bugs it previously failed on. Not because it got smarter. But because it saw the real picture.
What Benchmarks Hid from Us
On standard debugging benchmarks like SWE-bench or QuixBugs, our old retrieval system scored well. It often pulled in the bug file, the failing test, and the message. The model could then stitch together a valid fix.
But on real CI regressions, it broke.
The difference was that benchmarks are designed for closure. They are finite and curated. Real bugs are not. They are systemic. The bug is not the file. It is the way the file fits into everything else.
Chronos needed to learn that. And our retrieval had to stop pretending that relevance was enough.
Final Thoughts
Most debugging failures we encountered were not due to poor generation. They were due to incomplete understanding of what to retrieve.
Top-k gives you neighbors. But debugging needs paths. Flat retrieval gives you locality. But debugging demands lineage.
Chronos only started improving when we stopped asking, “What’s near this error?” and started asking, “What flows into it?”
The difference seems small. But in debugging, it is everything.
If you want to fix a system, you need to understand how it behaves, not just where it breaks.
Retrieval is the first place that understanding needs to show up.
We just took too long to see it.