RAG Systems

RAG Implementation Guide: How to Build a Production Knowledge Assistant in 4 Weeks

May 2025 · 12 min read · by ZYNETRA

Retrieval-Augmented Generation (RAG) is the most impactful AI architecture we have shipped for B2B companies in 2025. Done right, a RAG system transforms your pile of PDFs, wikis, SOPs, and CRM data into a trusted, always-available knowledge layer that answers questions with cited sources — no hallucination, no searching.

Done wrong, it becomes yet another demo that impresses in the pitch and fails in week two when users start asking real questions.

This guide covers exactly how we architect and ship production RAG systems. No hand-waving. Real architecture decisions, real failure modes, and the patterns that actually hold under production load.

What Is RAG and Why Does It Matter?

A standard large language model (LLM) like GPT-4 or Claude is trained on data up to a cutoff date. It has no knowledge of your internal documents, your policies, your product, or your customers. When you ask it a question about your business, it either hallucinates an answer or tells you it does not know.

RAG solves this by retrieving relevant context from your own data sources at query time, then passing that context into the LLM prompt. The model answers based on your data, not its training data — and can cite exactly which document each piece of information came from.

RAG is not a chatbot. It is a knowledge infrastructure layer with an LLM interface on top.

The RAG Pipeline: All 6 Stages

Stage 1: Document Ingestion

Everything starts with your raw documents. These can be PDFs, Word files, Notion pages, Confluence wikis, Zendesk articles, Google Docs, database records, or CRM exports. The ingestion layer needs to:

  • Extract text faithfully (PDF parsing is harder than it looks — use pymupdf or unstructured, not basic text extraction)
  • Preserve metadata (document title, author, date, section headers) — you will need this for citations
  • Handle incremental updates without re-indexing everything from scratch
  • Detect and skip corrupted or empty files

The mistake most teams make: they assume all documents are well-structured. In practice, scanned PDFs, mixed-language documents, and tables without headers all require special handling.

Stage 2: Chunking

You cannot embed an entire 200-page document into a single vector. You chunk it into smaller pieces that can be individually embedded and retrieved. The chunking strategy has an enormous impact on retrieval quality:

  • Fixed-size chunking (e.g. 512 tokens): simple but breaks context mid-sentence
  • Semantic chunking: splits at natural boundaries (paragraphs, sections). More reliable, harder to implement
  • Hierarchical chunking: stores both small chunks (for precision) and large parent chunks (for context) — the best approach for complex documents

We default to hierarchical chunking with 256-token leaf chunks and 1024-token parent chunks. Small chunks retrieve precisely; parent chunks provide full context to the LLM.

Stage 3: Embedding and Vector Storage

Each chunk is converted into a high-dimensional vector (embedding) using an embedding model. We typically use text-embedding-3-large from OpenAI or voyage-large-2from Anthropic's partners for production work.

These vectors are stored in a vector database. Our go-to choices:

  • pgvector (on Supabase or Postgres): best for teams already on Postgres, no extra infrastructure
  • Pinecone: managed, fast, scales to tens of millions of vectors without ops overhead
  • Weaviate: strong multi-tenancy support, good for SaaS products serving multiple clients

Stage 4: Query Processing and Retrieval

When a user asks a question, the query is embedded using the same model that embedded the documents. The system then performs approximate nearest-neighbour (ANN) search to find the most semantically similar chunks.

The key production upgrades most RAG tutorials skip:

  • Query expansion: rewrite the user query into 3–5 variations before embedding to improve recall
  • Hybrid search: combine vector search (semantic) with BM25 full-text search (keyword). Neither alone is sufficient
  • Re-ranking: use a cross-encoder model (e.g. Cohere Rerank) to score top-K results for true relevance before passing to the LLM. This alone often improves answer quality by 20–30%
  • Metadata filtering: pre-filter by document type, date range, or department before running vector search

Stage 5: LLM Generation with Grounding

The retrieved chunks are assembled into a context window and passed to the LLM with a strict system prompt. The prompt matters:

You are a knowledge assistant with access to the following documents.
Answer ONLY based on the provided context.
If the answer is not in the context, say: "I don't have information on this."
Always cite the source document and page number for each claim.

Context:
{retrieved_chunks}

Question: {user_query}

We use Claude 3.5 Sonnet or GPT-4o for most production deployments. They follow instructions more reliably than smaller models, which matters when you cannot afford hallucinations.

Stage 6: Citation and Grounding Verification

Every claim in the response should trace back to a specific document and chunk. We implement a post-processing step that:

  • Parses the LLM's response to extract cited sources
  • Verifies each citation actually appears in the retrieved chunks
  • Strips or flags any claim without a valid source

This is the difference between a RAG system users trust and one they abandon after the first wrong answer.

The 4-Week Production Timeline

Week 1: Data audit, ingestion pipeline, first embedding index. Run baseline retrieval tests with real queries from your team.

Week 2: Chunking strategy refinement, add hybrid search + re-ranking, build the LLM generation layer with grounding. Internal testing with edge cases.

Week 3: UI integration, feedback loop (thumbs up/down on answers), monitoring setup. Soft launch to a small group of real users.

Week 4: Iteration on failure cases from soft launch, load testing, production deployment. Handoff documentation.

The Most Common Failure Modes

  • Chunk boundary cuts context in half — answerable question fails because the relevant sentence spans two chunks. Fix: use overlapping chunks or hierarchical chunking.
  • Retrieved chunks are relevant but outdated — answer is wrong because old document version is still indexed. Fix: implement document versioning with automatic re-embedding on update.
  • LLM ignores grounding instructions — hallucination slips through. Fix: use models with strong instruction following (Claude 3.5+ or GPT-4o), add post-processing validation.
  • Retrieval fails on abbreviations and domain jargon — user types "PO", knowledge base uses "Purchase Order". Fix: query expansion with domain synonym expansion before embedding.

What Does This Cost?

For a corpus of 10,000 documents (roughly 50 million tokens), initial embedding costs around $50–$150 depending on the model. Monthly re-embedding of updated documents adds $10–$50. LLM inference for production traffic (1,000 queries/day) typically runs $200–$600/month. Vector DB hosting (pgvector on Supabase or Pinecone starter): $0–$100/month.

Total production cost for a medium-sized deployment: $300–$800/month. For a system that replaces multiple hours of human search and research time per day, the ROI is immediate.

Ready to Ship?

ZYNETRA has deployed RAG systems for insurance, SaaS, fintech, and e-commerce companies. If you are considering a knowledge assistant for your team, book a 20-minute call and we will tell you exactly what your implementation would look like and how long it would take.