RAG Explained: How to Connect Your Own Documents to an LLM

One of the most common problems when working with LLMs in production is that the model doesn't know your specific information: your internal documents, your knowledge base, your company's data. Retraining the model to learn that information is expensive and impractical. RAG solves this problem elegantly.

RAG stands for Retrieval-Augmented Generation. It's a technique that combines a search system with an LLM: instead of asking the model to recall information from its training, you provide the relevant fragments at query time. The model generates its response based on what you just gave it — not on what it memorized during training.

Why RAG and Not Fine-Tuning

Before getting into the implementation, it's worth clarifying when to use RAG versus fine-tuning, because this is a frequent source of confusion.

Fine-tuning involves retraining the model with your data so it internalizes it. It makes sense when you want to change the model's style or behavior — not when you want it to know specific information.

RAG is the right choice when:

  • Your documents change frequently
  • You need the model to cite specific sources
  • You want precise control over what information the model uses
  • The volume of information is too large to fit in a single context

For most enterprise use cases — chatbots over documentation, knowledge base assistants, contract search — RAG is the correct solution. OpenAI's team documented this in detail in their guide on RAG vs fine-tuning.

How RAG Works Step by Step

A RAG system has three phases:

1. Indexing (done once)

  • Split your documents into chunks
  • Convert each chunk into an embedding (a numerical vector)
  • Store those vectors in a vector database

2. Retrieval (on each query)

  • Convert the user's question into an embedding
  • Search for the most similar chunks in the database
  • Retrieve the N most relevant chunks

3. Generation

  • Build a prompt that includes the retrieved chunks
  • Send that prompt to the LLM
  • The model generates a response based on those chunks

The key is the embeddings: they are numerical representations of the meaning of text. Two phrases with similar meaning have embeddings that are close together in vector space, even if they use different words. This enables searching by meaning, not just exact keywords.

Implementation with Python

Let's build a minimal functional RAG system. We'll use:

  • openai for embeddings and generation
  • numpy for vector operations (no external database, to keep it simple)
  • tiktoken for counting tokens

Install the dependencies:

pip install openai numpy tiktoken

Step 1: Prepare the Documents

In a real system, documents would come from PDFs, web pages, or databases. Here we use text directly to keep the example clear:

import os
from openai import OpenAI
import numpy as np

client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))

documents = [
    "Our return policy allows returning any product within 30 days with proof of purchase.",
    "Standard shipping takes 3 to 5 business days. Express shipping arrives within 24 hours.",
    "Technical support is available Monday through Friday from 9am to 6pm via email and phone.",
    "Digital products cannot be returned once downloaded.",
    "Premium customers get free shipping on all orders over $50."
]

Step 2: Generate Embeddings

OpenAI offers dedicated embedding models. The text-embedding-3-small model is cost-effective and powerful enough for most use cases:

def generate_embedding(text: str) -> list[float]:
    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=text
    )
    return response.data[0].embedding

# Index all documents
document_embeddings = []
for doc in documents:
    embedding = generate_embedding(doc)
    document_embeddings.append(embedding)

print(f"Indexed {len(document_embeddings)} documents.")
print(f"Embedding dimensions: {len(document_embeddings[0])}")

Step 3: Similarity Search Function

We use cosine similarity to find the most relevant documents for a query:

def cosine_similarity(vec1: list, vec2: list) -> float:
    a = np.array(vec1)
    b = np.array(vec2)
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

def find_relevant_documents(query: str, top_k: int = 2) -> list[str]:
    query_embedding = generate_embedding(query)

    similarities = []
    for i, doc_embedding in enumerate(document_embeddings):
        sim = cosine_similarity(query_embedding, doc_embedding)
        similarities.append((sim, documents[i]))

    similarities.sort(key=lambda x: x[0], reverse=True)
    return [doc for _, doc in similarities[:top_k]]

Step 4: Generation with Retrieved Context

This is the core of RAG: building a prompt that includes the retrieved fragments before the user's question:

def answer_with_rag(question: str) -> str:
    # Retrieve the most relevant chunks
    chunks = find_relevant_documents(question, top_k=2)
    context = "
".join(f"- {chunk}" for chunk in chunks)

    # Build the prompt with context
    system_prompt = """Answer the user's question based solely on
the information provided in the context below.
If the information is not in the context, say you don't have that information.
Do not make up data."""

    user_prompt = f"""Context:
{context}

Question: {question}"""

    response = client.chat.completions.create(
        model="gpt-4o",
        temperature=0,
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_prompt}
        ]
    )

    return response.choices[0].message.content

# Test the system
questions = [
    "Can I return a product after 20 days?",
    "How long does express shipping take?",
    "Is support available on weekends?"
]

for question in questions:
    print(f"Question: {question}")
    print(f"Answer: {answer_with_rag(question)}")
    print()

Chunking: How to Split Large Documents

In the example above each document is a single sentence. In practice, you'll work with PDFs or long texts that need to be split into chunks before indexing.

Your chunking strategy directly affects system quality. Key considerations:

Chunk size: between 200 and 500 tokens is a common range. Chunks that are too small lose context; chunks that are too large include too much noise.

Overlap: chunks should overlap slightly (50-100 tokens) to avoid cutting relevant information right at a boundary.

import tiktoken

def split_into_chunks(text: str, max_tokens: int = 300, overlap: int = 50) -> list[str]:
    encoder = tiktoken.encoding_for_model("gpt-4o")
    tokens = encoder.encode(text)

    chunks = []
    start = 0

    while start < len(tokens):
        end = min(start + max_tokens, len(tokens))
        chunk_tokens = tokens[start:end]
        chunk_text = encoder.decode(chunk_tokens)
        chunks.append(chunk_text)
        start += max_tokens - overlap

    return chunks

Vector Databases for Production

The system we built stores embeddings in memory, which works for prototypes but doesn't scale. For production you need a vector database.

The most widely used options:

  • Pinecone: managed service, easy to integrate, free tier available
  • Chroma: open source, runs locally, ideal for getting started
  • Weaviate: open source with cloud option, very feature-complete
  • pgvector: PostgreSQL extension for storing vectors — great if you already use Postgres

Chroma is the simplest for a first real project:

pip install chromadb
import chromadb

chroma_client = chromadb.Client()
collection = chroma_client.create_collection("my_documents")

# Index documents
collection.add(
    documents=documents,
    embeddings=document_embeddings,
    ids=[f"doc_{i}" for i in range(len(documents))]
)

# Search
results = collection.query(
    query_embeddings=[generate_embedding("return policy")],
    n_results=2
)
print(results["documents"])

RAG Limitations You Should Know

RAG is not perfect. The most common production issues:

Incorrect retrieval: if the system retrieves irrelevant chunks, the model generates incorrect answers with full confidence. The quality of chunking and embeddings is critical.

Questions requiring global synthesis: RAG works poorly when the answer requires combining information scattered across many chunks. For these queries, other architectures like MapReduce over documents work better.

Mixed languages: embeddings work best when the query and documents are in the same language. Mixing languages reduces search quality.

The LangChain documentation covers advanced strategies to address these issues, including re-ranking and query expansion.

Next Steps

This basic system covers the complete RAG flow. To take it to production, the natural next steps are:

  • Replace in-memory storage with Chroma or Pinecone
  • Implement overlapping chunking for real documents
  • Add re-ranking to improve chunk relevance
  • Evaluate system quality with a set of reference questions

RAG is today the most widely used architecture for LLM applications in enterprise environments — precisely because it combines the generative power of models with controlled, up-to-date information.