Opening the AI Agent Black Box — How to See Inside Your Agent with LangSmith [2026]

📌 Level: Intermediate (AI agent development experience recommended) ⏱️ Reading time: ~13 minutes 🛠️ After reading this: You’ll be able to implement agent tracing, evaluation, and monitoring with real, working code


You deployed your agent.

It works at first. Then a few days later a user says:

“It was fine yesterday, but now it’s giving weird answers.”

You open the logs. ERROR: None is all you see. What the agent was thinking, which tool it chose and why, where it went wrong — none of it is visible.

This is the most common pain in AI agent development.

With regular software, when something breaks you read the stack trace. But an AI agent is a complex flow where the LLM makes decisions at every step, calls tools in sequence, and passes intermediate results to the next stage. If you can’t see what’s happening, you can’t fix it.

LangSmith was built to solve exactly this problem.


Why Agent Observability Is Non-Negotiable

“Traces are the only record of what your agent did and why.” — LangChain Official Documentation

LLM-based agents are inherently non-deterministic. The same input can lead to different execution paths. That’s why reading the code alone isn’t enough to understand actual behavior — you need to see the execution record.

What LangSmith provides:

  • Tracing — Visualizes every LLM call, tool execution, and intermediate reasoning step
  • Evaluation — Measures agent response quality with data-driven metrics
  • Monitoring — Tracks cost, latency, and error rates in real-time in production
  • Continuous improvement — Turns failure cases into datasets for regression testing

📊 Table of Contents

  1. Setting Up LangSmith in 5 Minutes
  2. Tracing — Watching Your Agent’s Internals in Real Time
  3. Reading Traces — Diagnosing Where Things Went Wrong
  4. Evaluation — Measuring Quality in Numbers
  5. LLM-as-Judge — Using AI to Evaluate AI
  6. Production Monitoring — Real-Time Dashboards
  7. The Continuous Improvement Loop — Failures → Datasets → Regression Tests
  8. Cost & Latency Optimization

1. Setting Up LangSmith in 5 Minutes

pip install langsmith langchain langchain-anthropic python-dotenv

.env file:

# LangSmith API key (free at smith.langchain.com)
LANGCHAIN_API_KEY=your-langsmith-api-key
LANGCHAIN_TRACING_V2=true # Enable automatic tracing
LANGCHAIN_PROJECT=my-agent-project # Project name
# Anthropic
ANTHROPIC_API_KEY=your-anthropic-api-key

That’s it. With just three environment variables, every execution of a LangChain/LangGraph-based agent is automatically recorded in LangSmith.

# Verify tracing is working
from dotenv import load_dotenv
from langchain_anthropic import ChatAnthropic
from langchain_core.messages import HumanMessage
load_dotenv()
llm = ChatAnthropic(model="claude-sonnet-4-20250514")
response = llm.invoke([HumanMessage(content="Hello! This is a tracing test.")])
print(response.content)
# Check the trace at smith.langchain.com
print("✅ View your trace at: https://smith.langchain.com")

2. Tracing — Watching Your Agent’s Internals in Real Time

Tracing records the agent’s execution as a hierarchical tree structure.

[Full Execution Trace]
├── User input: "What's the weather in Tokyo?"
├── LLM Call #1
│ ├── Input prompt (full text)
│ ├── Output: {"action": "get_weather", "input": "Tokyo"}
│ ├── Tokens: 247 in / 32 out
│ └── Latency: 1.2s
├── Tool execution: get_weather("Tokyo")
│ ├── Input: "Tokyo"
│ ├── Output: "Tokyo: Cloudy, 22°C"
│ └── Latency: 0.3s
├── LLM Call #2
│ ├── Input: [prior conversation + tool result]
│ ├── Output: "Tokyo is currently cloudy at 22°C."
│ ├── Tokens: 312 in / 18 out
│ └── Latency: 0.9s
└── Final answer: "Tokyo is currently cloudy at 22°C."
Total cost: $0.0023 / Total time: 2.4s

Adding Custom Traces

Beyond LangChain’s automatic tracing, you can manually add custom traces.

# custom_tracing.py
from langsmith import traceable
from langchain_anthropic import ChatAnthropic
import os
llm = ChatAnthropic(model="claude-sonnet-4-20250514")
# @traceable decorator includes this function in the trace
@traceable(name="Weather API Call")
def fetch_weather(city: str) -> dict:
"""Calls an external weather API."""
# Simulated API call
return {"city": city, "temp": 22, "condition": "Cloudy"}
@traceable(name="Data Formatting")
def format_weather(data: dict) -> str:
"""Formats weather data into human-readable text."""
return f"{data['city']}: {data['condition']}, {data['temp']}°C"
@traceable(name="Weather Agent Run", run_type="chain")
def weather_agent(user_query: str) -> str:
"""
Agent that receives a user query and provides weather info.
This entire function is recorded as one trace.
"""
city = "Tokyo" # Simplified for demo
# Each step is recorded as a nested trace
raw_data = fetch_weather(city)
formatted = format_weather(raw_data)
# LLM call is automatically included in the trace
response = llm.invoke(f"Please explain this weather information clearly: {formatted}")
return response.content
# Run it
result = weather_agent("What's the weather like in Tokyo?")
print(result)

In the LangSmith UI, you’ll see this:

[Weather Agent Run] 2.8s, $0.0031
├── [Weather API Call] 0.3s
├── [Data Formatting] 0.001s
└── [ChatAnthropic] 2.5s
├── Input tokens: 89
└── Output tokens: 42

3. Reading Traces — Diagnosing Where Things Went Wrong

The ability to read traces is a core skill in agent development. Here are the key failure patterns and how to diagnose them.

Failure Pattern 1: Wrong Tool Selection

[What you see in the trace]
LLM Call #1
Output: {"action": "send_email", "input": "weather is clear"}
← Should have checked weather, selected email tool instead!
[Diagnosis]
→ Tool description is unclear, OR
→ System prompt lacks tool selection guidance
[Fix]
→ Sharpen tool descriptions
→ Add tool selection guidelines to the system prompt

Failure Pattern 2: Hallucination

[What you see in the trace]
Tool execution: search_product("iPhone 17")
Output: "Product not found"
LLM Call #2
Input: [includes tool result]
Output: "The iPhone 17 costs $1,299."
← The tool said "not found" but the LLM invented a price!
[Diagnosis]
→ System prompt is missing explicit instruction to not invent information
→ Temperature may be set too high
[Fix]
→ Strengthen system prompt + set temperature=0

Failure Pattern 3: Infinite Loop

[What you see in the trace]
LLM Call #1 → Tool A → LLM Call #2 → Tool A
LLM Call #3 → Tool A → LLM Call #4 → Tool A
... (repeating)
[Diagnosis]
→ Tool A's output isn't giving the LLM enough information to move forward
→ Or the LLM isn't recognizing the termination condition
[Fix]
→ Add maximum iteration limit
→ Improve tool output format

4. Evaluation — Measuring Quality in Numbers

Tracing tells you “what happened.” Evaluation measures “how well it went.”

# evaluation_basic.py
from langsmith import Client
from langsmith.evaluation import evaluate
from langchain_anthropic import ChatAnthropic
from langchain_core.messages import HumanMessage
client = Client()
llm = ChatAnthropic(model="claude-sonnet-4-20250514")
# ── Step 1: Create Evaluation Dataset ─────────────────
def create_dataset():
"""Create an input-output pair dataset for evaluation"""
examples = [
{
"inputs": {"question": "Why is Python popular in AI development?"},
"outputs": {"answer": "Python's rich libraries (PyTorch, TensorFlow, etc.) and clean syntax make it ideal for AI work."}
},
{
"inputs": {"question": "What are the main features of FastAPI?"},
"outputs": {"answer": "FastAPI is a high-performance async Python web framework based on type hints, with automatic API documentation generation."}
},
{
"inputs": {"question": "What is RAG?"},
"outputs": {"answer": "RAG (Retrieval-Augmented Generation) is a technique that searches relevant documents before generating LLM responses to improve accuracy."}
},
]
dataset = client.create_dataset(
dataset_name="AI QA Eval Set v1",
description="Evaluation of response quality for AI development questions"
)
client.create_examples(
inputs=[e["inputs"] for e in examples],
outputs=[e["outputs"] for e in examples],
dataset_id=dataset.id
)
print(f"✅ Dataset created (ID: {dataset.id})")
return dataset
# ── Step 2: Define the Agent to Evaluate ──────────────
def my_agent(inputs: dict) -> dict:
"""The agent being evaluated"""
response = llm.invoke([HumanMessage(content=inputs["question"])])
return {"answer": response.content}
# ── Step 3: Define Evaluators ─────────────────────────
def correctness_evaluator(run, example) -> dict:
"""
Check whether the agent's response contains key keywords.
In production, use LLM-as-Judge or semantic similarity instead.
"""
agent_answer = run.outputs.get("answer", "").lower()
reference = example.outputs.get("answer", "").lower()
ref_words = set(reference.split())
agent_words = set(agent_answer.split())
important_words = {w for w in ref_words if len(w) > 3}
if not important_words:
return {"key": "correctness", "score": 0.5}
overlap = len(important_words & agent_words) / len(important_words)
return {
"key": "correctness",
"score": overlap,
"comment": f"Keyword match rate: {overlap:.0%}"
}
def length_evaluator(run, example) -> dict:
"""Check whether the response length is appropriate."""
answer = run.outputs.get("answer", "")
length = len(answer)
if 50 <= length <= 300:
score, comment = 1.0, "Appropriate length"
elif length < 50:
score, comment = 0.3, f"Too short ({length} chars)"
else:
score, comment = 0.7, f"On the long side ({length} chars)"
return {"key": "length_score", "score": score, "comment": comment}
# ── Step 4: Run Evaluation ────────────────────────────
def run_evaluation():
dataset = create_dataset()
results = evaluate(
my_agent,
data=dataset.name,
evaluators=[correctness_evaluator, length_evaluator],
experiment_prefix="baseline-claude-sonnet",
metadata={"model": "claude-sonnet-4-20250514", "version": "v1"}
)
print("\n📊 Evaluation Results:")
print(f" - Avg correctness: {results.get('correctness', 0):.1%}")
print(f" - Avg length score: {results.get('length_score', 0):.1%}")
print(" → View details at: https://smith.langchain.com")
return results
if __name__ == "__main__":
run_evaluation()

5. LLM-as-Judge — Using AI to Evaluate AI

Keyword matching has limits. Having another LLM judge the response quality is far more accurate.

# llm_judge_evaluator.py
from langsmith.evaluation import run_evaluator, EvaluationResult
from langchain_anthropic import ChatAnthropic
from langchain_core.messages import HumanMessage
import json
judge_llm = ChatAnthropic(
model="claude-sonnet-4-20250514",
temperature=0 # Fix at 0 for consistency
)
@run_evaluator
def llm_judge_helpfulness(run, example) -> EvaluationResult:
"""
An LLM scores the response's helpfulness on a 1–5 scale.
This evaluator itself is traced in LangSmith.
"""
question = example.inputs.get("question", "")
agent_answer = run.outputs.get("answer", "")
reference = example.outputs.get("answer", "")
prompt = f"""
You are an expert evaluating AI response quality.
[Question]
{question}
[Reference Answer]
{reference}
[Response to Evaluate]
{agent_answer}
Score from 1–5 using these criteria:
- 5: Perfectly accurate and helpful
- 4: Mostly accurate and sufficiently helpful
- 3: Partially accurate or incomplete
- 2: Inaccurate or potentially misleading
- 1: Completely wrong or unhelpful
Respond ONLY in JSON:
{{"score": integer 1-5, "reasoning": "your evaluation"}}
"""
response = judge_llm.invoke([HumanMessage(content=prompt)])
try:
result = json.loads(response.content)
score_normalized = (result["score"] - 1) / 4 # normalize to 0–1
return EvaluationResult(
key="helpfulness",
score=score_normalized,
comment=result.get("reasoning", "")
)
except:
return EvaluationResult(key="helpfulness", score=0.5, comment="Parse failed")
@run_evaluator
def llm_judge_no_hallucination(run, example) -> EvaluationResult:
"""An LLM determines whether the response contains hallucinations."""
question = example.inputs.get("question", "")
agent_answer = run.outputs.get("answer", "")
prompt = f"""
Determine whether the following AI response contains factually incorrect information (hallucinations).
[Question]: {question}
[AI Response]: {agent_answer}
Respond ONLY in JSON:
{{"has_hallucination": true/false, "explanation": "your reasoning"}}
"""
response = judge_llm.invoke([HumanMessage(content=prompt)])
try:
result = json.loads(response.content)
score = 0.0 if result["has_hallucination"] else 1.0
return EvaluationResult(
key="no_hallucination",
score=score,
comment=result.get("explanation", "")
)
except:
return EvaluationResult(key="no_hallucination", score=0.5)

6. Production Monitoring — Real-Time Dashboards

After deploying, use Online Evaluation to monitor live traffic.

# production_monitoring.py
from langsmith.evaluation import run_evaluator, EvaluationResult
@run_evaluator
def safety_check(run, example) -> EvaluationResult:
"""
Safety evaluator that runs automatically on production traffic.
Checks for harmful content or personal information leakage.
"""
answer = run.outputs.get("answer", "").lower()
# Danger keyword detection (use more sophisticated logic in production)
danger_words = ["password", "ssn", "social security", "credit card"]
has_danger = any(word in answer for word in danger_words)
return EvaluationResult(
key="safety",
score=0.0 if has_danger else 1.0,
comment="Sensitive info detected" if has_danger else "Safe"
)
@run_evaluator
def format_validator(run, example) -> EvaluationResult:
"""Automatically validates response format."""
answer = run.outputs.get("answer", "")
checks = {
"Not empty": len(answer.strip()) > 0,
"Not too long": len(answer) < 2000,
"No error message": "error" not in answer.lower(),
}
score = sum(checks.values()) / len(checks)
failed = [k for k, v in checks.items() if not v]
return EvaluationResult(
key="format_valid",
score=score,
comment=f"Failed: {failed}" if failed else "All format checks passed"
)

Key metrics tracked in the LangSmith dashboard:

MetricDescriptionAlert Threshold
P50 / P99 LatencyMedian and worst-case latencyP99 > 5s
Token CostAverage cost per request+20% vs. prior week
Error RatePercentage of error responses> 1%
Quality ScoreLLM-as-Judge average< 0.7
Tool Failure RateTool call failure percentage> 5%

7. The Continuous Improvement Loop

LangSmith’s real power is automating the observe → evaluate → improve → redeploy cycle.

[Continuous Improvement Cycle]
Collect production traces
Insights Agent analyzes patterns
(What types of failures are most common?)
Add failure cases to evaluation dataset
Modify prompts / tools
Run offline evaluation with updated version
(Did the score improve vs. the old dataset?)
Pass → Deploy / Fail → Revise again
Collect production traces (repeat)
# continuous_improvement.py
from langsmith import Client
from datetime import datetime, timedelta
client = Client()
def collect_failures_to_dataset(
project_name: str,
dataset_name: str,
min_score_threshold: float = 0.5,
limit: int = 50
):
"""
Automatically adds low-quality production runs to a dataset.
This dataset becomes the foundation for future evaluation and regression testing.
"""
runs = client.list_runs(
project_name=project_name,
filter=f'and(lt(feedback_stats.helpfulness.avg, {min_score_threshold}), eq(error, null))',
limit=limit
)
collected = 0
for run in runs:
if run.inputs and run.outputs:
client.create_example(
inputs=run.inputs,
outputs=run.outputs,
dataset_name=dataset_name,
metadata={
"source_run_id": str(run.id),
"quality_score": run.feedback_stats.get("helpfulness", {}).get("avg"),
"reason": "production_failure"
}
)
collected += 1
print(f"✅ {collected} failure cases added to dataset")
return collected
def regression_test_before_deploy(
new_agent_fn,
dataset_name: str,
min_pass_score: float = 0.75
) -> bool:
"""
Regression test before deployment.
Blocks deployment if score falls below the threshold.
Can be integrated into CI/CD pipelines.
"""
from langsmith.evaluation import evaluate
results = evaluate(
new_agent_fn,
data=dataset_name,
evaluators=[llm_judge_helpfulness, llm_judge_no_hallucination],
experiment_prefix="pre-deploy-regression"
)
avg_score = results.get("helpfulness", 0)
passed = avg_score >= min_pass_score
if passed:
print(f"✅ Regression test passed (score: {avg_score:.2f} ≥ {min_pass_score})")
else:
print(f"❌ Regression test failed (score: {avg_score:.2f} < {min_pass_score})")
print(" Deployment blocked. Please fix the prompt or logic.")
return passed

8. Cost & Latency Optimization

LangSmith trace data gives you the evidence to optimize costs.

# cost_optimization.py
from langsmith import Client
from collections import defaultdict
from datetime import datetime, timedelta
client = Client()
def analyze_cost_by_step(project_name: str, days: int = 7):
"""
Analyze cost by execution step for the past N days.
Pinpoints which LLM calls are consuming the most budget.
"""
runs = client.list_runs(
project_name=project_name,
start_time=datetime.now() - timedelta(days=days)
)
cost_by_node = defaultdict(float)
token_by_node = defaultdict(int)
for run in runs:
node_name = run.name
if run.total_cost:
cost_by_node[node_name] += run.total_cost
if run.prompt_tokens:
token_by_node[node_name] += run.prompt_tokens + (run.completion_tokens or 0)
print(f"\n💰 Cost analysis by step (last {days} days):")
for node, cost in sorted(cost_by_node.items(), key=lambda x: -x[1]):
tokens = token_by_node[node]
print(f" {node}: ${cost:.4f} ({tokens:,} tokens)")
return cost_by_node

Cost reduction strategies based on trace data:

# Strategy 1: Route simple queries to cheaper models
def smart_model_routing(query: str) -> str:
"""Auto-select model based on query complexity"""
if len(query) < 100 and "?" in query:
return "claude-haiku-4-5-20251001" # Fast and cheap for simple questions
else:
return "claude-sonnet-4-20250514" # High-performance for complex tasks
# Strategy 2: Cache repeated requests
from functools import lru_cache
@lru_cache(maxsize=1000)
def cached_llm_call(prompt_hash: str, prompt: str) -> str:
"""Return cached response for identical prompts"""
response = llm.invoke(prompt)
return response.content

Wrapping Up

As long as your agent remains a black box, improving it is guesswork.

LangSmith opens that black box so you can see what’s actually happening. Trace the execution, measure quality with evaluations, watch production with monitoring, and turn failures into datasets for continuous improvement.

Once this loop is running, your agent gets better every day it operates.

The next post will cover AI Agent Cost Optimization — practical strategies to cut API costs in half while maintaining quality.


🔖 AI Agent Development Series

  • The Complete AI Agent Development Guide — From Concepts to Production Architecture
  • The Complete MCP Guide — The Standard Protocol That Gives Your AI Agent Hands and Feet
  • Opening the AI Agent Black Box — How to See Inside Your Agent with LangSmith ← You are here

Tags: #LangSmith #AIAgents #Observability #Tracing #LLMEvaluation #LangChain #ProductionAI #2026 #DevTutorial #AgentDevelopment


Sources: LangChain Official Docs · DigitalOcean LangSmith Tutorial · LangSmith Evaluation Docs · Dev.to AI Agent Observability 2026