What is RAG?
RAG (Retrieval-Augmented Generation) is the technique that makes AI responses accurate and grounded in your actual content instead of relying solely on the model's training data.
The core idea is simple: before the LLM generates a response, the system retrieves the most relevant documents from your content and includes them in the prompt. The LLM then answers based on real, up-to-date information rather than what it memorized during training.
The Problem RAG Solves
Large Language Models are trained on vast amounts of text, but they have critical limitations:
- Knowledge cutoff — They don't know about content created after their training date
- No access to private data — They've never seen your internal documents, product specs, or company policies
- Hallucination — When they don't know an answer, they may generate plausible-sounding but incorrect information
- Generic responses — Without specific context, answers are broad and imprecise
RAG eliminates these problems by giving the LLM access to your actual content at query time.
How RAG Works — Step by Step
Phase 1 — Indexing (one-time per document)
- Document ingestion — Content enters Turing ES via connectors (AEM, web crawler) or file uploads (Assets/Knowledge Base)
- Text extraction — Apache Tika extracts text from PDFs, DOCX, XLSX, HTML, and other formats
- Chunking — The extracted text is split into chunks (default: 1,024 characters) to fit within embedding model limits
- Embedding — Each chunk is passed through the Embedding Model, which converts text into a high-dimensional numerical vector (e.g., a 1,536-dimension array of floats)
- Storage — The vectors are stored in the Embedding Store alongside metadata (source file, chunk position, original text)
Phase 2 — Retrieval (on every user question)
- Query embedding — The user's question is converted into a vector using the same Embedding Model
- Similarity search — The Embedding Store finds the vectors most similar to the query vector (cosine similarity)
- Threshold filtering — Only chunks with similarity ≥ 0.7 are included (configurable)
- Top-K selection — The top 10 most relevant chunks are selected
Phase 3 — Generation
- Prompt construction — Turing ES builds a prompt containing the user's question plus the retrieved document chunks as context
- LLM generation — The LLM reads the context and generates an answer based on the real content
- Streaming response — The answer is streamed back to the user via SSE
The Three Components
LLM (Large Language Model)
The LLM is the "brain" that reads the retrieved context and generates a natural language response. It does the reasoning, summarization, and articulation — but it does not search or retrieve content on its own.
In Turing ES, LLMs are configured as LLM Instances supporting six providers:
| Provider | Example Models |
|---|---|
| OpenAI | GPT-4o, GPT-4o-mini |
| Anthropic | Claude Sonnet 4 |
| Ollama | Mistral, Llama, Qwen |
| Google Gemini | Gemini 2.0 Flash |
| Azure OpenAI | GPT-4o (Azure-hosted) |
The LLM does not store knowledge. It only processes what's given to it in the prompt. RAG ensures the prompt contains the right content.