An excellent critical friend of mine challenged me to consider the potential of retrieval-augmented generation (RAG) to reduce hallucinations in large language model (LLM)-generated text. Because if I’m gonna diss it, I need to understand it. So I had a look.
I’m not a software person, and I don’t want to pretend I understand anything better than I do. So this is not a programmer’s explanation. It is a laylady’s1 explanation of the underlying logic of RAG.
Once a user has submitted a prompt (that is, a request for output) to the algorithm, RAG has two phases:
Phase 1: Retrieval
The LLM searches for and retrieves snippets of information relevant to the user’s prompt. This search draws on the dataset accessible to the LLM.
Phase 2: Generation
The LLM creates a sort of convincing amalgam of the relevant snippets, then produces a text output phrased as a response to the prompt.
If that sounds like how you already thought LLMs worked, that’s because it is.
Here’s the difference.
Figure 1. Prompting an LLM2
Figure 2. Prompting an LLM with RAG
That’s right — you have to give it the data you want it to find, and then it will bring it back to you. Like a good dog.
How this is done can vary a bit, of course. I asked a nerd I know, and he explained:
"What RAG does is put an inference-time layer in between the LLM and your data, which usually involves either (if it’s good) converting the prompt to a search query and using something like SQL or JSON on the database or (if it’s bad) sending the LLM to wander across unstructured data with slightly better-defined parameters."
Those familiar with RAG at this point would exclaim that RAG’s standout feature that it enables the LLM to retrieve “real time” information. That is, the LLM can go find information that is up to date and bring it faithfully to the user.
You know, like a search engine.
Except it’s not like a web search, in two very important ways.
Way one
A web search engine (Google, Baidu, Bing, DuckDuckGo, etc.) draws on all the data it has of the indexed web (that is, all the findable web pages via HTTP) and then tells you what it found. All the pages on the WWW are datestamped. You can tell when they were created, and when they were modified, and what was modified.
An LLM draws on the data it has in its dataset. How current that data is, depends on what sources are available to it. If the live indexed web is available to it, then you have a real-time dataset. If not, then the search is only going to be as current as the sources are. For example, ChatGPT’s training data apparently only goes up to 2023 at the moment.
Way two
A web search engine straight up tells you what it found. How it presents its results depends on how the search engine was designed, and of course Google is becoming less and less of a search engine, and more and more of an ad catalogue (but then, so is ChatGPT).
An LLM doesn’t do that. It “transforms” the search results into a kind of composite of the search results that it calculates as most acceptable to the prompter. It doesn’t present any of the results as they are — it generates something that, for all the prompter knows, they might as well be.
Figure 3. An LLM makes a composite cat

Usually, if the dataset is big and contains enough quality examples, the transformed version will be pretty likely to sound convincing. But there’s always a chance it’ll misinterpret its data, or have enough bad examples to spit out a thing that is not a satisfactory answer.
We call that “hallucinating”. It’s not. It just means the LLM has said something that’s not true. But the LLM doesn’t know that. It doesn’t know anything. It did exactly, and I mean exactly, what you asked — based on your prompt, it reached into its dataset and transformed what it found into a composite for you.
But with RAG, the LLM is looking at a better subset of available data.
You know, like specifically stuff from a particular relevant time period. Or stuff that is properly tagged as credible.
It doesn’t prevent “hallucinations”, because that’s impossible.
There is no way to prevent an algorithm that doesn’t know what truth is to know whether the things it says are true.
Anyway, if this all sounds like a gimme, let me clarify: the current commercial models, like ChatGPT and Claude, do not use RAG. They draw only from their global training dataset. Developers need to set up RAG themselves, by linking the LLM to an additional, better, dataset. So you could hook it up to a medical database and produce all sorts of amazing diagnostics.
It makes sense not to have RAG going in commercial models. We have this funky reticence to giving OpenAI access to all the health information in the world, so ChatGPT doesn’t just have your cardiovascular history.
However.
If you have a better dataset.
And you don’t want the RAG-augmented LLM to hallucinate at you.
You could also.
Use a search engine.
You could use any of the search tools we already have.
Because they won’t give you a composite.
They will give you the data you are looking for.
Don’t give me that. Laylady is a word now, and you already know what it means.
Of course the mechanics of the LLM and its transformer processes are totally not shown in this diagram, because that’s not what this post is about. If you’re like me, your brain glazes over when these explanations start getting particularly sophisticated. However, you might like to read this conversation with data scientist Colin Fraser to understand a bit more about how LLMs generate outputs, including “hallucinations”.
Thank you! This is fantastic. I am working on a project that critically examines the limitations of NotebookLM (which uses RAG technology) and this post helps me understand and explain to others that "hallucinations" will still occur with the fancy summarization and podcasting app. I also discovered that OpenAI offers a RAG-enhanced app called TLDR, which ostensibly "provides concise key takeaways from short articles." The marketing is so seductive!