This is Part 2 of the series.
- Part 1: Why Python Still Dominates in 2026
- Part 2: Build Your Own AI Chatbot — RAG From Scratch to Deployment ← You are here
- Part 3: One AI Is No Longer Enough — LangGraph Multi-Agent Systems
📌 Level: Intermediate (assumes basic Python syntax knowledge) ⏱️ Reading time: ~12 min / Hands-on time: ~2–3 hours 🛠️ End result: An AI chatbot API that reads uploaded PDFs and answers questions about them
“Can you build something like ChatGPT?”
It’s the first question people ask after learning Python. And in 2026, the answer is: “Yes — and you can do it today.”
We’re going to use a technique called RAG (Retrieval-Augmented Generation) to build a personal AI chatbot from scratch, all the way through deployment. A bot that answers questions about your company’s internal documents, a PDF-based Q&A system, a customer support assistant — the core of all these is RAG.
📊 Table of Contents
- Understanding RAG in 3 Minutes
- Why a Vanilla ChatGPT API Won’t Cut It
- The Full Architecture of What We’re Building
- Environment Setup — Done in 5 Minutes
- Step-by-Step Code Implementation (The Core!)
- Building the API Server with FastAPI
- Testing Locally
- Cloud Deployment — Live on Railway in 5 Minutes
- Next Steps: Where to Go From Here
1. Understanding RAG in 3 Minutes
RAG stands for Retrieval-Augmented Generation. In plain English:
Make the AI search for relevant documents first, then answer based on what it found.
A standard LLM (ChatGPT, Claude, etc.) answers only from its training data. Ask it about your internal documents, recent events, or personal files and it makes things up — that’s called hallucination.
RAG fixes this.
[User Question] ↓[Document Retrieval] ← Find semantically similar passages from your documents ↓[Retrieved Passages + Question sent to LLM] ↓[LLM answers accurately, grounded in the provided documents]
That’s the whole thing. It looks like magic when it works, but the mechanic is simple.
2. Why a Vanilla ChatGPT API Won’t Cut It
An honest comparison:
| Approach | Pros | Cons |
|---|---|---|
| Plain LLM API | Fast and simple | Knows nothing about your docs, hallucinates, can get expensive |
| Fine-tuning | Model learns your data | Costs hundreds/thousands to train; re-train every data update |
| RAG | Accurate and cost-efficient | Requires initial setup |
RAG never re-trains the model. When documents change, you just update the vector database. That’s why it’s become the standard architecture for enterprise AI projects in 2026.
💡 RAG’s core strength: It connects private data (internal docs, personal notes, customer data) to the LLM safely. Documents are not sent wholesale to the API — only the relevant passages are passed as context.
3. The Full Architecture
Here’s the architecture we’ll build today.
📄 PDF Upload ↓[Document Chunking] → Split long documents into small passages ↓[Embedding] → Convert each passage into a numerical vector ↓[Vector DB Storage] → ChromaDB (local, free) ↓━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━📨 User Submits a Question ↓[Question Embedding] → Convert the question into a vector too ↓[Similarity Search] → Find the 3–5 most relevant passages in the DB ↓[LLM Call] → "Answer this question using only the provided passages" ↓✅ Accurate Answer Returned
Tech stack:
- LangChain — pipeline assembly
- ChromaDB — vector database (free, local)
- Anthropic Claude — the LLM
- FastAPI — API server
- PyMuPDF — PDF parsing
4. Environment Setup — Done in 5 Minutes
Install Dependencies
pip install langchain langchain-anthropic langchain-community \ chromadb fastapi uvicorn python-dotenv \ pymupdf sentence-transformers
Project Structure
my-rag-chatbot/├── .env # API keys (never commit this!)├── main.py # FastAPI server├── rag_pipeline.py # RAG logic├── uploads/ # PDF upload folder (auto-created)└── vector_store/ # ChromaDB storage (auto-created)
.env File
ANTHROPIC_API_KEY=your-api-key-here
5. Step-by-Step Code Implementation
Step 1 — Load & Split the PDF
# rag_pipeline.pyfrom langchain_community.document_loaders import PyMuPDFLoaderfrom langchain.text_splitter import RecursiveCharacterTextSplitterdef load_and_split_pdf(file_path: str): """ Load a PDF and split it into chunks. Chunks too large overflow the LLM context window; chunks too small lose their context. 500 chars is a good starting point. """ # 1. Load PDF loader = PyMuPDFLoader(file_path) documents = loader.load() # 2. Split text splitter = RecursiveCharacterTextSplitter( chunk_size=500, # Max 500 chars per chunk chunk_overlap=50, # 50-char overlap between chunks (context preservation) length_function=len, ) chunks = splitter.split_documents(documents) print(f"✅ {len(documents)} pages → split into {len(chunks)} chunks") return chunks
Step 2 — Embed & Store in Vector DB
# rag_pipeline.py (continued)from langchain_community.vectorstores import Chromafrom langchain_community.embeddings import HuggingFaceEmbeddings# Embedding model init (free, runs locally)# First run downloads the model (~90MB)embedding_model = HuggingFaceEmbeddings( model_name="sentence-transformers/all-MiniLM-L6-v2")VECTOR_STORE_PATH = "./vector_store"def save_to_vector_db(chunks, collection_name: str = "my_docs"): """ Convert chunks to vectors and save to ChromaDB. Same collection_name appends to existing data. """ vector_db = Chroma.from_documents( documents=chunks, embedding=embedding_model, persist_directory=VECTOR_STORE_PATH, collection_name=collection_name, ) vector_db.persist() print(f"✅ {len(chunks)} chunks saved to vector DB") return vector_dbdef load_vector_db(collection_name: str = "my_docs"): """Load an existing vector DB.""" return Chroma( persist_directory=VECTOR_STORE_PATH, embedding_function=embedding_model, collection_name=collection_name, )
💡 Embedding model tip
all-MiniLM-L6-v2is free, fast, and great for English text. If you need multilingual support, swap it forparaphrase-multilingual-MiniLM-L12-v2.
Step 3 — Build the RAG Chain (The Core)
# rag_pipeline.py (continued)from langchain_anthropic import ChatAnthropicfrom langchain.prompts import ChatPromptTemplatefrom langchain.schema.runnable import RunnablePassthroughfrom langchain.schema.output_parser import StrOutputParserimport osdef create_rag_chain(vector_db): """ Assemble the RAG pipeline. Question → retrieve docs → LLM answer, wired as a single chain. """ # 1. Retriever: fetch top 4 similar chunks retriever = vector_db.as_retriever( search_type="similarity", search_kwargs={"k": 4} ) # 2. Prompt template prompt = ChatPromptTemplate.from_template("""You are an AI assistant that answers questions strictly based on the provided documents.[Reference Documents]{context}[Question]{question}Answer based only on the documents above.If the answer cannot be found in the documents, say: "I couldn't find that information in the provided documents."Do not make anything up.""") # 3. LLM config llm = ChatAnthropic( model="claude-sonnet-4-20250514", api_key=os.getenv("ANTHROPIC_API_KEY"), temperature=0, # 0 = most consistent, minimizes hallucination max_tokens=1024 ) # 4. Assemble the chain (LangChain Expression Language) def format_docs(docs): return "\n\n---\n\n".join( f"[Source: {doc.metadata.get('source', 'Unknown')}, " f"Page: {doc.metadata.get('page', '?')}]\n{doc.page_content}" for doc in docs ) rag_chain = ( {"context": retriever | format_docs, "question": RunnablePassthrough()} | prompt | llm | StrOutputParser() ) return rag_chaindef ask_question(question: str, collection_name: str = "my_docs") -> dict: """ Accept a question, run it through the RAG pipeline, return the answer. """ vector_db = load_vector_db(collection_name) chain = create_rag_chain(vector_db) # Also return the source docs for citation retriever = vector_db.as_retriever(search_kwargs={"k": 4}) source_docs = retriever.invoke(question) answer = chain.invoke(question) return { "answer": answer, "sources": [ { "page": doc.metadata.get("page", "?"), "source": doc.metadata.get("source", "Unknown"), "snippet": doc.page_content[:200] + "..." } for doc in source_docs ] }
6. Building the API Server with FastAPI
# main.pyfrom fastapi import FastAPI, UploadFile, File, HTTPExceptionfrom fastapi.middleware.cors import CORSMiddlewarefrom pydantic import BaseModelfrom pathlib import Pathimport shutilimport osfrom dotenv import load_dotenvfrom rag_pipeline import load_and_split_pdf, save_to_vector_db, ask_questionload_dotenv()app = FastAPI( title="My RAG Chatbot API", description="Upload a PDF and ask questions about its content", version="1.0.0")app.add_middleware( CORSMiddleware, allow_origins=["*"], allow_methods=["*"], allow_headers=["*"],)UPLOAD_DIR = Path("./uploads")UPLOAD_DIR.mkdir(exist_ok=True)# ── Request / Response Models ──────────────────────class QuestionRequest(BaseModel): question: str collection_name: str = "my_docs"class QuestionResponse(BaseModel): answer: str sources: list# ── Endpoints ──────────────────────────────────────@app.get("/")async def root(): return {"message": "RAG Chatbot API is running 🚀"}@app.post("/upload")async def upload_pdf( file: UploadFile = File(...), collection_name: str = "my_docs"): """Upload a PDF and store it in the vector DB.""" if not file.filename.endswith(".pdf"): raise HTTPException(status_code=400, detail="Only PDF files are accepted.") file_path = UPLOAD_DIR / file.filename with open(file_path, "wb") as f: shutil.copyfileobj(file.file, f) try: chunks = load_and_split_pdf(str(file_path)) save_to_vector_db(chunks, collection_name) except Exception as e: raise HTTPException(status_code=500, detail=f"Processing error: {str(e)}") return { "message": f"✅ '{file.filename}' uploaded successfully", "chunks_created": len(chunks), "collection": collection_name }@app.post("/ask", response_model=QuestionResponse)async def ask(request: QuestionRequest): """Answer a question based on the uploaded documents.""" if not request.question.strip(): raise HTTPException(status_code=400, detail="Please enter a question.") try: result = ask_question(request.question, request.collection_name) return QuestionResponse( answer=result["answer"], sources=result["sources"] ) except Exception as e: raise HTTPException(status_code=500, detail=f"Answer generation error: {str(e)}")@app.delete("/collection/{collection_name}")async def delete_collection(collection_name: str): """Delete a document collection.""" import shutil store_path = f"./vector_store/{collection_name}" if os.path.exists(store_path): shutil.rmtree(store_path) return {"message": f"'{collection_name}' collection deleted"} raise HTTPException(status_code=404, detail="Collection not found.")
7. Testing Locally
Start the Server
uvicorn main:app --reload --port 8000
Once running, open http://localhost:8000/docs in your browser to see the auto-generated Swagger UI — test your endpoints right there.
Test with curl
# 1. Upload a PDFcurl -X POST "http://localhost:8000/upload" \ -F "file=@./my_document.pdf" \ -F "collection_name=my_docs"# Sample response:# {"message": "✅ 'my_document.pdf' uploaded successfully", "chunks_created": 47, "collection": "my_docs"}# 2. Ask a questioncurl -X POST "http://localhost:8000/ask" \ -H "Content-Type: application/json" \ -d '{"question": "What are the main points of this document?", "collection_name": "my_docs"}'
Test with Python
import requestsBASE_URL = "http://localhost:8000"# Upload PDFwith open("my_document.pdf", "rb") as f: response = requests.post( f"{BASE_URL}/upload", files={"file": f}, data={"collection_name": "test"} )print(response.json())# Ask a questionresponse = requests.post( f"{BASE_URL}/ask", json={"question": "What is the main conclusion?", "collection_name": "test"})result = response.json()print("Answer:", result["answer"])print("\nSources:")for src in result["sources"]: print(f" - Page {src['page']}: {src['snippet'][:80]}...")
8. Cloud Deployment — Live on Railway in 5 Minutes
Required Files
requirements.txt
langchain==0.3.19langchain-anthropiclangchain-communitychromadbfastapiuvicornpython-dotenvpymupdfsentence-transformers
Procfile
web: uvicorn main:app --host 0.0.0.0 --port $PORT
.gitignore (critical!)
.envuploads/vector_store/__pycache__/*.pyc
Deployment Steps
# 1. Initialize Git and push to GitHubgit initgit add .git commit -m "Initial RAG chatbot deployment"git remote add origin https://github.com/your-username/my-rag-chatbot.gitgit push -u origin main# 2. Go to railway.app → "New Project" → "Deploy from GitHub"# 3. Select your repository# 4. Settings → Environment Variables → Add ANTHROPIC_API_KEY# 5. Auto-deploy complete — access your live URL
⚠️ Deployment notes
- Never push your
.envfile to GitHub. Set secrets via Railway environment variables.- The
vector_store/folder resets on server restart. For production, use a persistent vector DB like Pinecone or Supabase pgvector.- The free plan includes $5/month in credits — plenty for personal projects.
9. Next Steps: Where to Go From Here
What we built today is the most fundamental form of RAG. To reach production quality, consider these improvements.
Performance Improvements
| Technique | Effect | Difficulty |
|---|---|---|
| Hybrid Search (keyword + vector) | 30–50% retrieval accuracy gain | ★★★ |
| Re-ranking (Cohere Rerank, etc.) | Higher answer quality | ★★★ |
| Chunk size tuning | Optimal varies by domain | ★★ |
| Multi-query retrieval | Better recall via diverse queries | ★★ |
Adding Conversation History
# Multi-turn conversation with memoryfrom langchain.memory import ConversationBufferWindowMemorymemory = ConversationBufferWindowMemory( k=5, # Remember last 5 turns return_messages=True, memory_key="chat_history")
Wrapping Up — Building It Is How You Really Learn
Building a RAG system makes you realize how much thoughtful design goes into something that looks simple.
- How should chunks be split?
- Which embedding model fits the use case?
- Too many retrieved documents? Too few?
- How should the prompt be written to minimize hallucination?
Working through those questions is what makes you an AI engineer. The code is just the implementation of that design thinking.
In Part 3, we’ll take this chatbot and give it a team of AI agents to work with — a research agent, a writing agent, and a fact-checking agent that collaborate to produce higher-quality, more reliable output than any single AI call could.
🔖 Other posts in this series
- Part 1: Why Python Still Dominates in 2026
- Part 2: Build Your Own AI Chatbot — RAG From Scratch to Deployment ← You are here
- Part 3: One AI Is No Longer Enough — LangGraph Multi-Agent Systems (next)
Tags: #Python #RAG #LangChain #FastAPI #AIChatbot #ChromaDB #MachineLearning #DevTutorial #AIDevelopment #2026