I'm always excited to take on new projects and collaborate with innovative minds.

Address

🇮🇹 | 🇮🇳

Social Links

Build a RAG Chatbot in 30 Minutes: A Step-by-Step Guide for 2026

A practical guide to building a Retrieval-Augmented Generation chatbot using LangChain, Ollama, and Chroma — from zero to working chatbot in half an hour.

Build a RAG Chatbot in 30 Minutes: A Step-by-Step Guide for 2026

Build a RAG Chatbot in 30 Minutes: A Step-by-Step Guide for 2026

Build a fully local RAG chatbot using LangChain, Ollama, and Chroma — no cloud APIs, no monthly bills, and no machine learning experience required.

What Is RAG, Really?

Retrieval-Augmented Generation (RAG) is a technique that lets a language model answer questions about information it was never trained on. Instead of relying solely on memorized knowledge — which is frozen in time and prone to hallucination — RAG gives the model a searchable knowledge base. When you ask a question, the system retrieves the most relevant documents, then feeds them to the LLM alongside your query. The model reads those documents and crafts an answer grounded in facts.

Think of it as an open-book exam. The student is capable, but they do not have every fact memorized. Hand them the right textbook pages, and suddenly they answer with precision. That is RAG: bridging general intelligence with domain-specific knowledge.

In 2026, running LLMs locally has become remarkably accessible. You no longer need a data center budget. With the right open-source tools, you can build a RAG chatbot on your laptop in under 30 minutes.

The Tools You Will Need

  • Ollama — A lightweight tool for running LLMs locally. It handles model downloads and inference with a single command. We will use the llama3.2 model (3B parameters) for text generation and nomic-embed-text for creating embeddings. Both run comfortably on machines with 8GB of RAM.
  • LangChain — The orchestration framework that ties everything together. It provides ready-made components for document loading, text splitting, embedding, vector storage, and retrieval. You compose these building blocks rather than writing glue code from scratch.
  • Chroma — An open-source vector database that stores embeddings and enables fast similarity search. Chroma runs in-process, needs zero configuration, and persists data to disk so your knowledge base survives restarts.

You will also need Python 3.10 or newer and these pip packages: langchain, langchain-ollama, langchain-chroma, and chromadb.

Step-by-Step: Building Your RAG Chatbot

Step 1: Setting Up Ollama

Install Ollama from ollama.com, then pull the models you need. Run ollama pull llama3.2 for text generation and ollama pull nomic-embed-text for embeddings. Verify with ollama run llama3.2 and a test prompt.

Step 2: Loading and Splitting Documents

Create a folder called documents and drop in your files — PDFs, text files, markdown notes, or HTML pages. LangChain supports dozens of loaders. Use DirectoryLoader and TextLoader from langchain_community.document_loaders to scan your folder.

Once loaded, split documents into chunks using RecursiveCharacterTextSplitter. Set a chunk size of around 1000 characters with a 200-character overlap. The overlap prevents sentences from losing context when split across boundaries. After splitting, you will have hundreds of small, self-contained text chunks ready for embedding.

Step 3: Creating Embeddings

Import OllamaEmbeddings from langchain_ollama and initialize it with the model name nomic-embed-text. Call its embed_documents method with your list of text chunks. This returns high-dimensional vectors — numerical fingerprints that capture semantic meaning. This step may take a minute or two but only runs once.

Step 4: Storing Vectors in Chroma

Import Chroma from langchain_chroma and initialize it with your embedding function and a persistence path like ./chroma_db. Use Chroma's from_documents class method, passing in your document chunks. Chroma builds an index for fast similarity searches and saves everything to disk. Test it by calling vectorstore.similarity_search("your query", k=3) to see the three most relevant chunks.

Step 5: Building the Retrieval Chain

Import ChatOllama from langchain_ollama to wrap your llama3.2 model. Create a prompt template using ChatPromptTemplate from langchain_core.prompts with two variables: context and question. Write a system message instructing the model to answer only from the provided context and to say "I don't know" when the answer is not found. This dramatically reduces hallucination.

Now compose the full chain: retrieve relevant documents from Chroma, format them into a context string, insert both context and question into your prompt template, and send the result to the LLM. Use LangChain's pipe syntax or the RunnablePassthrough pattern — the entire pipeline becomes a single callable object.

Step 6: The Interactive Loop

Wrap everything in a simple loop using Python's input() function. Feed each question into your chain and print the response. Add a quit command, an optional sources flag to show retrieved chunks, and basic error handling for empty queries. That is it — fire up your script and ask questions about your documents. Answers arrive in seconds, all generated locally.

Common Pitfalls and How to Avoid Them

  • Chunk size is critical. Too small and you lose context; too large and relevance scores suffer. Start at 1000 characters and adjust. Dense technical content benefits from larger chunks; FAQ-style content works better with smaller ones.
  • Garbage in, garbage out. RAG is not magic. Messy or contradictory source documents produce poor answers. Clean your data first — remove duplicates, fix formatting, and standardize terminology.
  • Retrieval count needs tuning. Retrieving too few documents misses critical context. Too many introduces noise. Four to six documents is a solid default for most use cases.
  • Embeddings can be slow on CPU. Generating embeddings for thousands of chunks takes time. Use a smaller embedding model if speed matters, or pre-compute embeddings during setup rather than at query time.
  • Never skip the system prompt. Without explicit instructions to rely only on retrieved context, LLMs confidently invent plausible-sounding fiction. Always constrain the model.

What You Can Build Next

  • A personal knowledge base. Feed in your notes, journal entries, and saved articles. Query your second brain in natural language.
  • A documentation chatbot. Point it at your company wiki or product docs. Colleagues get instant answers without hunting through pages.
  • A research assistant. Load academic papers and legal documents. Ask comparative questions, summarize findings, or identify gaps — backed by source citations.
  • A customer support agent. Train it on your knowledge base and past support tickets. Deploy as a first-line agent that handles common questions before escalating.
  • A multi-source aggregator. Combine documents from different domains. Use Chroma's metadata filtering to tag categories and scope searches to specific collections.

Everything in this stack runs locally. No API keys, no usage limits, no privacy concerns. As open-source models improve through 2026, your chatbot only gets smarter — just swap in a newer model. Take the next 30 minutes and build something that would have seemed like science fiction a few years ago.

Tharun Ramagiri is a web developer, security researcher, and AI enthusiast exploring the intersection of LLMs and everyday technology. He writes about practical AI tools, cybersecurity awareness, and developer workflows that actually work.

6 min read
May 17, 2026
By Tharun Ramagiri
Share

Leave a comment

Your email address will not be published. Required fields are marked *