RAG Explained: How to Connect Your Own Documents to an LLM
One of the most common problems when working with LLMs in production is that the model doesn't know your specific information: your internal documents, your knowledge base, your company's data. Retraining the model to learn that information is expensive and impractical. RAG solves this problem elegantly.
RAG stands for Retrieval-Augmented Generation. It's a technique that combines a search system with an LLM: instead of asking the model to recall information from its training, you provide the relevant fragments at query time. The model generates its response based on what you just gave it — not on what it memorized during training.
Why RAG and Not Fine-Tuning
Before getting into the implementation, it's worth clarifying when to use RAG versus fine-tuning, because this is a frequent source of confusion.
Fine-tuning involves retraining the model with your data so it internalizes it. It makes sense when you want to change the model's style or behavior — not when you want it to know specific information.
RAG is the right choice when:
- Your documents change frequently
- You need the model to cite specific sources
- You want precise control over what information the model uses
- The volume of information is too large to fit in a single context
For most enterprise use cases — chatbots over documentation, knowledge base assistants, contract search — RAG is the correct solution. OpenAI's team documented this in detail in their guide on RAG vs fine-tuning.
How RAG Works Step by Step
A RAG system has three phases:
1. Indexing (done once)
- Split your documents into chunks
- Convert each chunk into an embedding (a numerical vector)
- Store those vectors in a vector database
2. Retrieval (on each query)
- Convert the user's question into an embedding
- Search for the most similar chunks in the database
- Retrieve the N most relevant chunks
3. Generation
- Build a prompt that includes the retrieved chunks
- Send that prompt to the LLM
- The model generates a response based on those chunks
The key is the embeddings: they are numerical representations of the meaning of text. Two phrases with similar meaning have embeddings that are close together in vector space, even if they use different words. This enables searching by meaning, not just exact keywords.
Implementation with Python
Let's build a minimal functional RAG system. We'll use:
openaifor embeddings and generationnumpyfor vector operations (no external database, to keep it simple)tiktokenfor counting tokens
Install the dependencies:
pip install openai numpy tiktoken
Step 1: Prepare the Documents
In a real system, documents would come from PDFs, web pages, or databases. Here we use text directly to keep the example clear:
import os
from openai import OpenAI
import numpy as np
client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))
documents = [
"Our return policy allows returning any product within 30 days with proof of purchase.",
"Standard shipping takes 3 to 5 business days. Express shipping arrives within 24 hours.",
"Technical support is available Monday through Friday from 9am to 6pm via email and phone.",
"Digital products cannot be returned once downloaded.",
"Premium customers get free shipping on all orders over $50."
]
Step 2: Generate Embeddings
OpenAI offers dedicated embedding models. The text-embedding-3-small model is cost-effective and powerful enough for most use cases:
def generate_embedding(text: str) -> list[float]:
response = client.embeddings.create(
model="text-embedding-3-small",
input=text
)
return response.data[0].embedding
# Index all documents
document_embeddings = []
for doc in documents:
embedding = generate_embedding(doc)
document_embeddings.append(embedding)
print(f"Indexed {len(document_embeddings)} documents.")
print(f"Embedding dimensions: {len(document_embeddings[0])}")
Step 3: Similarity Search Function
We use cosine similarity to find the most relevant documents for a query:
def cosine_similarity(vec1: list, vec2: list) -> float:
a = np.array(vec1)
b = np.array(vec2)
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
def find_relevant_documents(query: str, top_k: int = 2) -> list[str]:
query_embedding = generate_embedding(query)
similarities = []
for i, doc_embedding in enumerate(document_embeddings):
sim = cosine_similarity(query_embedding, doc_embedding)
similarities.append((sim, documents[i]))
similarities.sort(key=lambda x: x[0], reverse=True)
return [doc for _, doc in similarities[:top_k]]
Step 4: Generation with Retrieved Context
This is the core of RAG: building a prompt that includes the retrieved fragments before the user's question:
def answer_with_rag(question: str) -> str:
# Retrieve the most relevant chunks
chunks = find_relevant_documents(question, top_k=2)
context = "
".join(f"- {chunk}" for chunk in chunks)
# Build the prompt with context
system_prompt = """Answer the user's question based solely on
the information provided in the context below.
If the information is not in the context, say you don't have that information.
Do not make up data."""
user_prompt = f"""Context:
{context}
Question: {question}"""
response = client.chat.completions.create(
model="gpt-4o",
temperature=0,
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_prompt}
]
)
return response.choices[0].message.content
# Test the system
questions = [
"Can I return a product after 20 days?",
"How long does express shipping take?",
"Is support available on weekends?"
]
for question in questions:
print(f"Question: {question}")
print(f"Answer: {answer_with_rag(question)}")
print()
Chunking: How to Split Large Documents
In the example above each document is a single sentence. In practice, you'll work with PDFs or long texts that need to be split into chunks before indexing.
Your chunking strategy directly affects system quality. Key considerations:
Chunk size: between 200 and 500 tokens is a common range. Chunks that are too small lose context; chunks that are too large include too much noise.
Overlap: chunks should overlap slightly (50-100 tokens) to avoid cutting relevant information right at a boundary.
import tiktoken
def split_into_chunks(text: str, max_tokens: int = 300, overlap: int = 50) -> list[str]:
encoder = tiktoken.encoding_for_model("gpt-4o")
tokens = encoder.encode(text)
chunks = []
start = 0
while start < len(tokens):
end = min(start + max_tokens, len(tokens))
chunk_tokens = tokens[start:end]
chunk_text = encoder.decode(chunk_tokens)
chunks.append(chunk_text)
start += max_tokens - overlap
return chunks
Vector Databases for Production
The system we built stores embeddings in memory, which works for prototypes but doesn't scale. For production you need a vector database.
The most widely used options:
- Pinecone: managed service, easy to integrate, free tier available
- Chroma: open source, runs locally, ideal for getting started
- Weaviate: open source with cloud option, very feature-complete
- pgvector: PostgreSQL extension for storing vectors — great if you already use Postgres
Chroma is the simplest for a first real project:
pip install chromadb
import chromadb
chroma_client = chromadb.Client()
collection = chroma_client.create_collection("my_documents")
# Index documents
collection.add(
documents=documents,
embeddings=document_embeddings,
ids=[f"doc_{i}" for i in range(len(documents))]
)
# Search
results = collection.query(
query_embeddings=[generate_embedding("return policy")],
n_results=2
)
print(results["documents"])
RAG Limitations You Should Know
RAG is not perfect. The most common production issues:
Incorrect retrieval: if the system retrieves irrelevant chunks, the model generates incorrect answers with full confidence. The quality of chunking and embeddings is critical.
Questions requiring global synthesis: RAG works poorly when the answer requires combining information scattered across many chunks. For these queries, other architectures like MapReduce over documents work better.
Mixed languages: embeddings work best when the query and documents are in the same language. Mixing languages reduces search quality.
The LangChain documentation covers advanced strategies to address these issues, including re-ranking and query expansion.
Next Steps
This basic system covers the complete RAG flow. To take it to production, the natural next steps are:
- Replace in-memory storage with Chroma or Pinecone
- Implement overlapping chunking for real documents
- Add re-ranking to improve chunk relevance
- Evaluate system quality with a set of reference questions
RAG is today the most widely used architecture for LLM applications in enterprise environments — precisely because it combines the generative power of models with controlled, up-to-date information.