Build Your Own AI Chatbot — Python RAG System From Scratch to Deployment [2026 Tutorial]

This is Part 2 of the series.

Part 1: Why Python Still Dominates in 2026

Part 2: Build Your Own AI Chatbot — RAG From Scratch to Deployment ← You are here

Part 3: One AI Is No Longer Enough — LangGraph Multi-Agent Systems

📌 Level: Intermediate (assumes basic Python syntax knowledge) ⏱️ Reading time: ~12 min / Hands-on time: ~2–3 hours 🛠️ End result: An AI chatbot API that reads uploaded PDFs and answers questions about them

“Can you build something like ChatGPT?”

It’s the first question people ask after learning Python. And in 2026, the answer is: “Yes — and you can do it today.”

We’re going to use a technique called RAG (Retrieval-Augmented Generation) to build a personal AI chatbot from scratch, all the way through deployment. A bot that answers questions about your company’s internal documents, a PDF-based Q&A system, a customer support assistant — the core of all these is RAG.

📊 Table of Contents

Understanding RAG in 3 Minutes
Why a Vanilla ChatGPT API Won’t Cut It
The Full Architecture of What We’re Building
Environment Setup — Done in 5 Minutes
Step-by-Step Code Implementation (The Core!)
Building the API Server with FastAPI
Testing Locally
Cloud Deployment — Live on Railway in 5 Minutes
Next Steps: Where to Go From Here

1. Understanding RAG in 3 Minutes

RAG stands for Retrieval-Augmented Generation. In plain English:

Make the AI search for relevant documents first, then answer based on what it found.

A standard LLM (ChatGPT, Claude, etc.) answers only from its training data. Ask it about your internal documents, recent events, or personal files and it makes things up — that’s called hallucination.

RAG fixes this.

			
[User Question]
      ↓
[Document Retrieval] ← Find semantically similar passages from your documents
      ↓
[Retrieved Passages + Question sent to LLM]
      ↓
[LLM answers accurately, grounded in the provided documents]

		

That’s the whole thing. It looks like magic when it works, but the mechanic is simple.

2. Why a Vanilla ChatGPT API Won’t Cut It

An honest comparison:

Approach	Pros	Cons
Plain LLM API	Fast and simple	Knows nothing about your docs, hallucinates, can get expensive
Fine-tuning	Model learns your data	Costs hundreds/thousands to train; re-train every data update
RAG	Accurate and cost-efficient	Requires initial setup

RAG never re-trains the model. When documents change, you just update the vector database. That’s why it’s become the standard architecture for enterprise AI projects in 2026.

💡 RAG’s core strength: It connects private data (internal docs, personal notes, customer data) to the LLM safely. Documents are not sent wholesale to the API — only the relevant passages are passed as context.

3. The Full Architecture

Here’s the architecture we’ll build today.

			
📄 PDF Upload
      ↓
[Document Chunking] → Split long documents into small passages
      ↓
[Embedding] → Convert each passage into a numerical vector
      ↓
[Vector DB Storage] → ChromaDB (local, free)
      ↓
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
📨 User Submits a Question
      ↓
[Question Embedding] → Convert the question into a vector too
      ↓
[Similarity Search] → Find the 3–5 most relevant passages in the DB
      ↓
[LLM Call] → "Answer this question using only the provided passages"
      ↓
✅ Accurate Answer Returned

		

Tech stack:

LangChain — pipeline assembly
ChromaDB — vector database (free, local)
Anthropic Claude — the LLM
FastAPI — API server
PyMuPDF — PDF parsing

4. Environment Setup — Done in 5 Minutes

Install Dependencies

			
pip install langchain langchain-anthropic langchain-community \
            chromadb fastapi uvicorn python-dotenv \
            pymupdf sentence-transformers

Project Structure

			
my-rag-chatbot/
├── .env                  # API keys (never commit this!)
├── main.py               # FastAPI server
├── rag_pipeline.py       # RAG logic
├── uploads/              # PDF upload folder (auto-created)
└── vector_store/         # ChromaDB storage (auto-created)

		

.env File

ANTHROPIC_API_KEY=your-api-key-here

5. Step-by-Step Code Implementation

Step 1 — Load & Split the PDF

			
# rag_pipeline.py
from langchain_community.document_loaders import PyMuPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
def load_and_split_pdf(file_path: str):
    """
    Load a PDF and split it into chunks.
    Chunks too large overflow the LLM context window;
    chunks too small lose their context. 500 chars is a good starting point.
    """
    # 1. Load PDF
    loader = PyMuPDFLoader(file_path)
    documents = loader.load()
    # 2. Split text
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=500,       # Max 500 chars per chunk
        chunk_overlap=50,     # 50-char overlap between chunks (context preservation)
        length_function=len,
    )
    chunks = splitter.split_documents(documents)
    print(f"✅ {len(documents)} pages → split into {len(chunks)} chunks")
    return chunks

		

Step 2 — Embed & Store in Vector DB

			
# rag_pipeline.py (continued)
from langchain_community.vectorstores import Chroma
from langchain_community.embeddings import HuggingFaceEmbeddings
# Embedding model init (free, runs locally)
# First run downloads the model (~90MB)
embedding_model = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-MiniLM-L6-v2"
)
VECTOR_STORE_PATH = "./vector_store"
def save_to_vector_db(chunks, collection_name: str = "my_docs"):
    """
    Convert chunks to vectors and save to ChromaDB.
    Same collection_name appends to existing data.
    """
    vector_db = Chroma.from_documents(
        documents=chunks,
        embedding=embedding_model,
        persist_directory=VECTOR_STORE_PATH,
        collection_name=collection_name,
    )
    vector_db.persist()
    print(f"✅ {len(chunks)} chunks saved to vector DB")
    return vector_db
def load_vector_db(collection_name: str = "my_docs"):
    """Load an existing vector DB."""
    return Chroma(
        persist_directory=VECTOR_STORE_PATH,
        embedding_function=embedding_model,
        collection_name=collection_name,
    )

		

💡 Embedding model tip all-MiniLM-L6-v2 is free, fast, and great for English text. If you need multilingual support, swap it for paraphrase-multilingual-MiniLM-L12-v2.

Step 3 — Build the RAG Chain (The Core)

			
# rag_pipeline.py (continued)
from langchain_anthropic import ChatAnthropic
from langchain.prompts import ChatPromptTemplate
from langchain.schema.runnable import RunnablePassthrough
from langchain.schema.output_parser import StrOutputParser
import os
def create_rag_chain(vector_db):
    """
    Assemble the RAG pipeline.
    Question → retrieve docs → LLM answer, wired as a single chain.
    """
    # 1. Retriever: fetch top 4 similar chunks
    retriever = vector_db.as_retriever(
        search_type="similarity",
        search_kwargs={"k": 4}
    )
    # 2. Prompt template
    prompt = ChatPromptTemplate.from_template("""
You are an AI assistant that answers questions strictly based on the provided documents.
[Reference Documents]
{context}
[Question]
{question}
Answer based only on the documents above.
If the answer cannot be found in the documents, say: "I couldn't find that information in the provided documents."
Do not make anything up.
""")
    # 3. LLM config
    llm = ChatAnthropic(
        model="claude-sonnet-4-20250514",
        api_key=os.getenv("ANTHROPIC_API_KEY"),
        temperature=0,  # 0 = most consistent, minimizes hallucination
        max_tokens=1024
    )
    # 4. Assemble the chain (LangChain Expression Language)
    def format_docs(docs):
        return "\n\n---\n\n".join(
            f"[Source: {doc.metadata.get('source', 'Unknown')}, "
            f"Page: {doc.metadata.get('page', '?')}]\n{doc.page_content}"
            for doc in docs
        )
    rag_chain = (
        {"context": retriever | format_docs, "question": RunnablePassthrough()}
        | prompt
        | llm
        | StrOutputParser()
    )
    return rag_chain
def ask_question(question: str, collection_name: str = "my_docs") -> dict:
    """
    Accept a question, run it through the RAG pipeline, return the answer.
    """
    vector_db = load_vector_db(collection_name)
    chain = create_rag_chain(vector_db)
    # Also return the source docs for citation
    retriever = vector_db.as_retriever(search_kwargs={"k": 4})
    source_docs = retriever.invoke(question)
    answer = chain.invoke(question)
    return {
        "answer": answer,
        "sources": [
            {
                "page": doc.metadata.get("page", "?"),
                "source": doc.metadata.get("source", "Unknown"),
                "snippet": doc.page_content[:200] + "..."
            }
            for doc in source_docs
        ]
    }

		

6. Building the API Server with FastAPI

			
# main.py
from fastapi import FastAPI, UploadFile, File, HTTPException
from fastapi.middleware.cors import CORSMiddleware
from pydantic import BaseModel
from pathlib import Path
import shutil
import os
from dotenv import load_dotenv
from rag_pipeline import load_and_split_pdf, save_to_vector_db, ask_question
load_dotenv()
app = FastAPI(
    title="My RAG Chatbot API",
    description="Upload a PDF and ask questions about its content",
    version="1.0.0"
)
app.add_middleware(
    CORSMiddleware,
    allow_origins=["*"],
    allow_methods=["*"],
    allow_headers=["*"],
)
UPLOAD_DIR = Path("./uploads")
UPLOAD_DIR.mkdir(exist_ok=True)
# ── Request / Response Models ──────────────────────
class QuestionRequest(BaseModel):
    question: str
    collection_name: str = "my_docs"
class QuestionResponse(BaseModel):
    answer: str
    sources: list
# ── Endpoints ──────────────────────────────────────
@app.get("/")
async def root():
    return {"message": "RAG Chatbot API is running 🚀"}
@app.post("/upload")
async def upload_pdf(
    file: UploadFile = File(...),
    collection_name: str = "my_docs"
):
    """Upload a PDF and store it in the vector DB."""
    if not file.filename.endswith(".pdf"):
        raise HTTPException(status_code=400, detail="Only PDF files are accepted.")
    file_path = UPLOAD_DIR / file.filename
    with open(file_path, "wb") as f:
        shutil.copyfileobj(file.file, f)
    try:
        chunks = load_and_split_pdf(str(file_path))
        save_to_vector_db(chunks, collection_name)
    except Exception as e:
        raise HTTPException(status_code=500, detail=f"Processing error: {str(e)}")
    return {
        "message": f"✅ '{file.filename}' uploaded successfully",
        "chunks_created": len(chunks),
        "collection": collection_name
    }
@app.post("/ask", response_model=QuestionResponse)
async def ask(request: QuestionRequest):
    """Answer a question based on the uploaded documents."""
    if not request.question.strip():
        raise HTTPException(status_code=400, detail="Please enter a question.")
    try:
        result = ask_question(request.question, request.collection_name)
        return QuestionResponse(
            answer=result["answer"],
            sources=result["sources"]
        )
    except Exception as e:
        raise HTTPException(status_code=500, detail=f"Answer generation error: {str(e)}")
@app.delete("/collection/{collection_name}")
async def delete_collection(collection_name: str):
    """Delete a document collection."""
    import shutil
    store_path = f"./vector_store/{collection_name}"
    if os.path.exists(store_path):
        shutil.rmtree(store_path)
        return {"message": f"'{collection_name}' collection deleted"}
    raise HTTPException(status_code=404, detail="Collection not found.")

		

7. Testing Locally

Start the Server

uvicorn main:app --reload --port 8000

Once running, open http://localhost:8000/docs in your browser to see the auto-generated Swagger UI — test your endpoints right there.

Test with curl

			
# 1. Upload a PDF
curl -X POST "http://localhost:8000/upload" \
  -F "file=@./my_document.pdf" \
  -F "collection_name=my_docs"
# Sample response:
# {"message": "✅ 'my_document.pdf' uploaded successfully", "chunks_created": 47, "collection": "my_docs"}
# 2. Ask a question
curl -X POST "http://localhost:8000/ask" \
  -H "Content-Type: application/json" \
  -d '{"question": "What are the main points of this document?", "collection_name": "my_docs"}'

		

Test with Python

			
import requests
BASE_URL = "http://localhost:8000"
# Upload PDF
with open("my_document.pdf", "rb") as f:
    response = requests.post(
        f"{BASE_URL}/upload",
        files={"file": f},
        data={"collection_name": "test"}
    )
print(response.json())
# Ask a question
response = requests.post(
    f"{BASE_URL}/ask",
    json={"question": "What is the main conclusion?", "collection_name": "test"}
)
result = response.json()
print("Answer:", result["answer"])
print("\nSources:")
for src in result["sources"]:
    print(f"  - Page {src['page']}: {src['snippet'][:80]}...")

		

8. Cloud Deployment — Live on Railway in 5 Minutes

Required Files

requirements.txt

			
langchain==0.3.19
langchain-anthropic
langchain-community
chromadb
fastapi
uvicorn
python-dotenv
pymupdf
sentence-transformers

		

Procfile

web: uvicorn main:app --host 0.0.0.0 --port $PORT

.gitignore (critical!)

			
.env
uploads/
vector_store/
__pycache__/
*.pyc

		

Deployment Steps

			
# 1. Initialize Git and push to GitHub
git init
git add .
git commit -m "Initial RAG chatbot deployment"
git remote add origin https://github.com/your-username/my-rag-chatbot.git
git push -u origin main
# 2. Go to railway.app → "New Project" → "Deploy from GitHub"
# 3. Select your repository
# 4. Settings → Environment Variables → Add ANTHROPIC_API_KEY
# 5. Auto-deploy complete — access your live URL

		

⚠️ Deployment notes

Never push your .env file to GitHub. Set secrets via Railway environment variables.

The vector_store/ folder resets on server restart. For production, use a persistent vector DB like Pinecone or Supabase pgvector.

The free plan includes $5/month in credits — plenty for personal projects.

9. Next Steps: Where to Go From Here

What we built today is the most fundamental form of RAG. To reach production quality, consider these improvements.

Performance Improvements

Technique	Effect	Difficulty
Hybrid Search (keyword + vector)	30–50% retrieval accuracy gain	★★★
Re-ranking (Cohere Rerank, etc.)	Higher answer quality	★★★
Chunk size tuning	Optimal varies by domain	★★
Multi-query retrieval	Better recall via diverse queries	★★

Adding Conversation History

			
# Multi-turn conversation with memory
from langchain.memory import ConversationBufferWindowMemory
memory = ConversationBufferWindowMemory(
    k=5,                        # Remember last 5 turns
    return_messages=True,
    memory_key="chat_history"
)

		

Wrapping Up — Building It Is How You Really Learn

Building a RAG system makes you realize how much thoughtful design goes into something that looks simple.

How should chunks be split?
Which embedding model fits the use case?
Too many retrieved documents? Too few?
How should the prompt be written to minimize hallucination?

Working through those questions is what makes you an AI engineer. The code is just the implementation of that design thinking.

In Part 3, we’ll take this chatbot and give it a team of AI agents to work with — a research agent, a writing agent, and a fact-checking agent that collaborate to produce higher-quality, more reliable output than any single AI call could.

🔖 Other posts in this series

Part 1: Why Python Still Dominates in 2026

Part 2: Build Your Own AI Chatbot — RAG From Scratch to Deployment ← You are here

Part 3: One AI Is No Longer Enough — LangGraph Multi-Agent Systems (next)

Tags: #Python #RAG #LangChain #FastAPI #AIChatbot #ChromaDB #MachineLearning #DevTutorial #AIDevelopment #2026