📋 The Scenario (Used Throughout This Playbook)
Client: Mid-market B2B SaaS company (CRM platform, ~500 users)
Problem: Customer support team is drowning in tickets. They need to surface relevant past customer conversations faster.
Stated Solution: "Build us an AI chatbot that can search our ticket history"
Real Problem (discovered via discovery call): Support reps spend 10 minutes per ticket hunting for context. Customers wait. Quality suffers.
Timeline: MVP in 3 weeks. 1 engineer on their team (must be able to maintain it).
You've heard the problem. They want "AI chatbot search" or "intelligent document classification." Now comes the work: translating vague pain into a specific solution. Here's the exact sequence.
You ask: "What's the chatbot solving? How many queries/day? What are they asking? How long do reps spend answering?"
Reality: 50 queries/day, rep spends 5 min per query hunting for context, customer wait time kills deals.
Cost: 50 queries/day × 5 min = 250 min/day of rep time = $1,250/day (at $30/hr).
Quality: 70% customer satisfaction (ambiguous answers).
Success metric: Cut time to 2 min/ticket, 85% satisfaction.
Freshness: New tickets added daily, but search only uses historical context (not real-time).
Constraints: <2s latency (user can't wait), $50–100/month budget (no GPT-4), junior engineer maintains.
• Latency: <2s (p99)
• Quality: 80%+ relevant results in top 3
• Cost: <$75/month
• Adoption: 90%+ of reps use daily
• Maintenance: Junior engineer can add tickets without consultant help
✅ Best accuracy | ❌ $200/month, 3–5s latency, black box
Option B (Hybrid SQL+Vector): PostgreSQL + pgvector + Sonnet
✅ $60/month, <2s, transparent | ⚠️ Slightly lower quality, needs setup
Option C (Full-text only): PostgreSQL full-text search
✅ Free, <0.5s latency | ❌ Fails on semantic mismatches
Why: Meets all constraints, client can maintain, transparent failure modes.
ADR: ADR-001-hybrid-search.md (in repo, explains tradeoffs)
LLM: Claude 3.5 Sonnet (cost-optimal for rerank task)
Vector DB: pgvector (built into PostgreSQL, no new infrastructure)
Deployment: Railway or Azure Container Apps
Observability: SQLite metrics + LangSmith (for token tracking)
Week 2: Reranking + latency optimization, observability setup
Week 3: Testing, edge cases, deployment to staging
Handoff: Runbook, tests, ADRs, demo
• Architecture diagram (boxes: Postgres, pgvector, Claude, retrieval pipeline)
• Why these tools (ADR)
• Timeline: 3 weeks to MVP
• Infrastructure cost: ~$50/month
• Consultant fee: $X
• What's in / out of scope (no agents, no chat history, v1 memory only)
✓ Checkpoint: Before You Code
- Problem statement: "Current state: X → Desired state: Y" (quantified)
- Success metrics: Latency, cost, quality, adoption (measurable)
- Architecture: Diagram + ADR explaining choices
- Tools: Specific tools + why (vs. alternatives)
- Timeline: Realistic breakdown by week
- Signed-off scope: Client agrees to what's in/out
Once you know the architecture, use this matrix to pick specific tools. The constraints from your discovery should drive the choice.
| Category | Decision Point | If Constraint Is… | Choose… | Not This | Cost / Notes |
|---|---|---|---|---|---|
| 🤖 LLM Provider | Cost vs. Quality | $50/month budget | Claude 3.5 Sonnet (rerank, not gen) | GPT-4 (10x cost) | $0.003 / 1K input tokens |
| Cost vs. Quality | $500/month budget | GPT-4 Turbo or Claude Opus | Free tier models | Better quality for generation tasks | |
| Latency critical | <1s response time | Sonnet (fast) or local Llama (cached) | Opus (slower) | Sonnet ~400ms, Opus ~2s | |
| 🗄️ Vector DB | Operational burden | Junior engineer, no DevOps | pgvector (in Postgres) or Pinecone | Qdrant, Weaviate (self-hosted) | pgvector = free, Pinecone = managed |
| Scale + cost | >100M vectors | Qdrant or Milvus (self-hosted) | Pinecone (goes 10x over budget) | Self-hosted = cheap at scale | |
| Existing DB | Data already in PostgreSQL | pgvector (no migration) | Separate vector DB (ETL overhead) | Same connection pool, simpler | |
| ⚙️ Framework | Multi-step orchestration? | Yes, complex agent flows | LangGraph or CrewAI | Simple retrieval (don't overcomplicate) | Graph = state management |
| Multi-step orchestration? | No, simple retrieval + rerank | None (write plain Python) | LangChain (adds complexity) | Less abstraction = more control | |
| Team maintenance | Junior engineer, limited ML | No framework (plain Python + SQL) | LangChain abstraction (hides bugs) | Transparency > magic | |
| 📊 Embedding | Cost + quality | Semantic search (not generation) | text-embedding-3-small or voyage-3 | text-embedding-3-large (2x cost) | Small = $0.02 / 1M tokens |
| Privacy critical | No API calls allowed | all-minilm, bge-small (local) | OpenAI embeddings (API) | Run locally, ~50ms per query | |
| 🚀 Deployment | Speed to MVP | 3-week sprint, simple app | Railway or Render | Kubernetes (too much setup) | Deploy with `git push` |
| Enterprise, Azure ecosystem | Client uses Azure already | Azure Container Apps or App Service | AWS services (different ecosystem) | Same dashboard, billing, SSO | |
| Scale + cost | >1M requests/month | Kubernetes or managed Container Apps | Serverless (cost explodes on volume) | Containers = per-replica billing | |
| 📈 Observability | Startup phase | MVP, <100 searches/day | SQLite (free, transparent) | Datadog or New Relic (overkill) | $0, queryable, no vendor lock |
| Scale phase | 10k+ searches/day, LLM evals | LangSmith + custom monitoring | Separate tools (splinter visibility) | LangSmith = $30/mo for traces |
🎯 Decision Algorithm
- Start with constraints: latency, budget, scale, team skill, deployment env
- Pick the most restrictive constraint: If budget is $50, eliminate expensive tools first
- Within that bucket, optimize for maintainability: Junior engineer? Pick transparent tools
- Benchmark against existing infrastructure: Using PostgreSQL? Use pgvector (no ETL)
- Document the "why not" for each alternative: Creates your ADR
Write an ADR Before Code
❌ Vibe Approach AVOID
"Let's use LangChain + Pinecone + GPT-4. It's the most powerful setup. We'll build a RAG pipeline and the LLM will figure it out."
No tradeoff analysis. No mention of cost, latency, or maintainability. Optimized for "sounds impressive."
✅ Senior Approach DO THIS
Write an ADR that shows the decision, the constraints, alternatives considered + why they were rejected, tradeoffs, and how you'll measure success.
📄 ADR-001: Hybrid Search Architecture (SQL + Vector)
Status: ACCEPTED | Date: 2024-04-15
Decision: Use PostgreSQL (full-text search) + pgvector (semantic search) + Claude 3.5 Sonnet, NOT pure vector RAG with Pinecone/GPT-4.
Context & Constraints:
- Latency SLA: <2 seconds end-to-end
- Monthly budget: $50–100 in AI costs
- Team: 1 junior backend engineer, not an ML specialist
- Data: ~50k support tickets, mostly text
- Maintenance: Client must understand and modify it 6 months from now
Considered Alternatives:
| Option | Latency | Cost/mo | Maintainability | Why Not |
|---|---|---|---|---|
| Pure Vector (Pinecone + GPT-4) | ~1.5s ✅ | $150–250 ❌ | Black box ❌ | Over budget; client can't debug failures |
| Full-text only (PostgreSQL) | ~0.2s ✅ | $0 ✅ | Simple ✅ | Fails on semantic mismatches ("Can't login" vs "access denied") |
| Hybrid (SQL + pgvector) ← CHOSEN | ~1.8s ✅ | $60 ✅ | Transparent ✅ | — |
Why This Works:
- Cost: Sonnet 3.5 is 1/10th the price of GPT-4. One reranking call per search = ~$0.005 per query.
- Latency: Parallel SQL + vector search, then rerank = stays under 2s.
- Maintainability: Junior engineer can understand SQL + embeddings. No black-box prompt magic.
- Failure modes clear: If results are bad, they know to add more training data or tweak the embedding model.
Tradeoffs:
How We'll Know This Works:
- Support reps find the right ticket in top 3 results 80%+ of the time
- Time-to-answer drops from 10 min → 2 min
- Monthly token bill stays under $75
- Client's engineer can add new tickets to index without our help
Path Forward If This Breaks:
- If semantic quality is bad → try voyage-3 instead of text-embedding-3-small
- If latency creeps over 2s → add caching + prioritize recent tickets
- If cost explodes → reduce rerank frequency (only rerank top 10, not 500)
- If client grows to 1M tickets → migrate to Qdrant (same pgvector code, different backend)
- Client sees exactly why you chose this — they can debate it before code starts
- If something breaks, you have a clear roadmap
- Their engineer can extend it because they understand the reasoning
Add Observability from Day 1
❌ Vibe Approach AVOID
"We'll add monitoring in phase 2 if there are issues." → (There are issues. It's too late to instrument. Client has no data.)
✅ Senior Approach DO THIS
Instrument before the first API call. Show cost + quality + latency from day 1.
langsmith>=0.1.0 anthropic>=0.25.0 psycopg[binary]>=3.1.0 pgvector>=0.2.0
import json
import time
from datetime import datetime
from typing import Any
import sqlite3
class CostTracker:
"""Track cost, latency, and quality metrics for each search."""
def __init__(self, db_path: str = "search_metrics.db"):
self.db_path = db_path
self._init_db()
def _init_db(self):
with sqlite3.connect(self.db_path) as conn:
conn.execute("""
CREATE TABLE IF NOT EXISTS search_metrics (
id INTEGER PRIMARY KEY,
timestamp TEXT NOT NULL,
query TEXT,
sql_latency_ms REAL,
vector_latency_ms REAL,
rerank_latency_ms REAL,
total_latency_ms REAL,
input_tokens INTEGER,
output_tokens INTEGER,
estimated_cost_cents REAL,
result_count INTEGER,
user_feedback_relevant BOOLEAN,
user_feedback_rating INTEGER
)
""")
def log_search(self, **kwargs):
"""Log a single search operation."""
with sqlite3.connect(self.db_path) as conn:
conn.execute("""
INSERT INTO search_metrics
(timestamp, query, sql_latency_ms, vector_latency_ms,
rerank_latency_ms, total_latency_ms, input_tokens,
output_tokens, estimated_cost_cents, result_count)
VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
""", (
datetime.utcnow().isoformat(),
kwargs.get('query'),
kwargs.get('sql_latency_ms'),
kwargs.get('vector_latency_ms'),
kwargs.get('rerank_latency_ms'),
kwargs.get('total_latency_ms'),
kwargs.get('input_tokens', 0),
kwargs.get('output_tokens', 0),
self._estimate_cost(kwargs.get('input_tokens', 0),
kwargs.get('output_tokens', 0)),
kwargs.get('result_count', 0)
))
@staticmethod
def _estimate_cost(input_tokens: int, output_tokens: int) -> float:
"""Estimate cost in cents for Claude 3.5 Sonnet."""
# Input: $3/1M tokens → 0.0003 cents per token
# Output: $15/1M tokens → 0.0015 cents per token
input_cost = input_tokens * 0.0003
output_cost = output_tokens * 0.0015
return input_cost + output_cost
def get_daily_summary(self, days: int = 7) -> dict:
"""Get cost, latency, and quality metrics for the last N days."""
with sqlite3.connect(self.db_path) as conn:
conn.row_factory = sqlite3.Row
rows = conn.execute("""
SELECT
DATE(timestamp) as date,
COUNT(*) as search_count,
ROUND(AVG(total_latency_ms), 1) as avg_latency_ms,
ROUND(MAX(total_latency_ms), 1) as p99_latency_ms,
ROUND(SUM(estimated_cost_cents) / 100, 2) as cost_usd,
ROUND(AVG(CAST(user_feedback_relevant AS FLOAT)), 2) as quality_relevant_ratio
FROM search_metrics
WHERE timestamp > datetime('now', '-' || ? || ' days')
GROUP BY DATE(timestamp)
ORDER BY date DESC
""", (days,))
return [dict(row) for row in rows]
from anthropic import Anthropic
from observability import CostTracker
import time
import json
class HybridSearchService:
def __init__(self, db_conn, tracker: CostTracker):
self.db = db_conn
self.tracker = tracker
self.client = Anthropic()
def search(self, query: str, customer_id: str, limit: int = 10) -> list[dict]:
"""Execute hybrid search with full observability."""
total_start = time.time()
# 1. SQL full-text search
sql_start = time.time()
sql_results = self._sql_search(query, customer_id, limit * 2)
sql_latency = (time.time() - sql_start) * 1000
# 2. Vector semantic search
vector_start = time.time()
vector_results = self._vector_search(query, customer_id, limit * 2)
vector_latency = (time.time() - vector_start) * 1000
# 3. Combine + rerank using Claude
combined = self._deduplicate_results(sql_results, vector_results)
rerank_start = time.time()
ranked = self._rerank_with_claude(query, combined, limit)
rerank_latency = (time.time() - rerank_start) * 1000
total_latency = (time.time() - total_start) * 1000
# 4. Log everything
self.tracker.log_search(
query=query,
sql_latency_ms=sql_latency,
vector_latency_ms=vector_latency,
rerank_latency_ms=rerank_latency,
total_latency_ms=total_latency,
input_tokens=150, # approximate; replace with response.usage.input_tokens
output_tokens=50,
result_count=len(ranked)
)
return ranked
def _rerank_with_claude(self, query: str, candidates: list, limit: int) -> list:
"""Use Claude to rerank candidates."""
prompt = f"""Given this support ticket search query:
"{query}"
And these candidate tickets, rank them by relevance (most relevant first):
{json.dumps(candidates, indent=2)}
Return a JSON array of the top {limit} tickets, ordered by relevance."""
response = self.client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=1000,
messages=[{"role": "user", "content": prompt}]
)
try:
return json.loads(response.content[0].text)
except json.JSONDecodeError:
return candidates[:limit]
def _sql_search(self, query: str, customer_id: str, limit: int) -> list:
cur = self.db.cursor()
cur.execute("""
SELECT id, customer_id, content, created_at
FROM tickets
WHERE customer_id = %s
AND to_tsvector('english', content) @@ plainto_tsquery('english', %s)
LIMIT %s
""", (customer_id, query, limit))
return [dict(row) for row in cur.fetchall()]
def _vector_search(self, query: str, customer_id: str, limit: int) -> list:
cur = self.db.cursor()
cur.execute("""
SELECT id, customer_id, content, created_at,
1 - (embedding <=> %s::vector) as similarity
FROM tickets
WHERE customer_id = %s
ORDER BY embedding <=> %s::vector
LIMIT %s
""", (query, customer_id, query, limit))
return [dict(row) for row in cur.fetchall()]
def _deduplicate_results(self, sql_results, vector_results) -> list:
seen, combined = set(), []
for r in sql_results + vector_results:
if r['id'] not in seen:
seen.add(r['id'])
combined.append(r)
return combined
API Dashboard: Support Ticket Search Last 7 days: 📊 PERFORMANCE Avg Latency: 1.65s (target: <2s) ✅ P99 Latency: 1.98s Search Count: 412 queries 💰 COST Day 1: $0.34 Day 2: $0.41 Day 3: $0.38 Week Total: $2.65 (annualized: ~$138) 📈 QUALITY Relevant (user rated): 78% ✅ Not Relevant: 14% Skipped Rating: 8% 🔴 ALERTS None this week!
- Client sees exactly what they're paying for ($0.34/day, not "thousands/month")
- You catch latency creep before it hits SLA
- When results are bad, you have data: is it the rerank step or the retrieval?
- Client's engineer can monitor in production without calling you
Write 3 Edge Case Tests
❌ Vibe Approach AVOID
"I tested it locally and it works. We're shipping." → (Crashes in production on edge cases)
✅ Senior Approach DO THIS
Identify the 3 failure modes most likely to bite you in production. Write tests that catch them.
For this project, the 3 edge cases are:
- Empty/null results: Customer has no tickets. Search returns 0 results. UI crashes.
- Timeout on rerank: Claude API is slow (>5s). Total latency exceeds SLA. Request hangs.
- Duplicate tickets in results: SQL and vector search return same ticket twice. UI shows dups.
import pytest
import sqlite3
import time
from unittest.mock import patch, MagicMock
from search_service import HybridSearchService
from observability import CostTracker
class TestHybridSearchEdgeCases:
@pytest.fixture
def db_conn(self):
"""Create in-memory DB with test data."""
conn = sqlite3.connect(":memory:")
conn.row_factory = sqlite3.Row
conn.execute("""
CREATE TABLE tickets (
id INTEGER PRIMARY KEY,
customer_id TEXT,
content TEXT,
created_at TEXT,
embedding VECTOR(1536)
)
""")
conn.execute("""
INSERT INTO tickets (customer_id, content, created_at)
VALUES ('cust_123', 'User cannot login to account', '2024-04-10')
""")
conn.commit()
return conn
def test_empty_result_set_no_tickets_for_customer(self, db_conn):
"""EDGE CASE 1: Customer has no tickets.
Should return empty list, NOT crash."""
tracker = CostTracker()
service = HybridSearchService(db_conn, tracker)
results = service.search(
query="login issue",
customer_id="cust_999" # No tickets for this customer
)
assert results == []
assert isinstance(results, list)
def test_timeout_on_rerank_exceeds_latency_sla(self, db_conn):
"""EDGE CASE 2: Rerank step times out (>5s).
Should degrade gracefully, NOT exceed 2s SLA."""
tracker = CostTracker()
service = HybridSearchService(db_conn, tracker)
with patch.object(service.client.messages, 'create') as mock_claude:
def slow_api_call(*args, **kwargs):
time.sleep(0.1)
raise TimeoutError("API took too long")
mock_claude.side_effect = slow_api_call
start = time.time()
results = service.search(query="login", customer_id="cust_123")
elapsed = (time.time() - start) * 1000
assert len(results) > 0 # Still return something
assert elapsed < 2000 # Stayed under SLA
def test_duplicate_tickets_in_combined_results(self, db_conn):
"""EDGE CASE 3: SQL and vector search both return same ticket.
Should deduplicate, NOT show ticket twice."""
tracker = CostTracker()
service = HybridSearchService(db_conn, tracker)
ticket = {
'id': 1,
'customer_id': 'cust_123',
'content': 'User cannot login',
'created_at': '2024-04-10'
}
with patch.object(service, '_sql_search', return_value=[ticket]):
with patch.object(service, '_vector_search', return_value=[ticket]):
with patch.object(service, '_rerank_with_claude', return_value=[ticket]):
results = service.search(
query="login", customer_id="cust_123", limit=10
)
ids = [r['id'] for r in results]
assert len(ids) == len(set(ids)), "Duplicate IDs found"
assert ids.count(1) == 1
def test_malformed_rerank_response_json(self, db_conn):
"""EDGE CASE 4 (bonus): Claude returns invalid JSON.
Should fallback to original results."""
tracker = CostTracker()
service = HybridSearchService(db_conn, tracker)
with patch.object(service.client.messages, 'create') as mock_claude:
mock_response = MagicMock()
mock_response.content = [MagicMock(text="Not valid JSON {{{")]
mock_claude.return_value = mock_response
candidates = [
{'id': 1, 'content': 'ticket 1'},
{'id': 2, 'content': 'ticket 2'}
]
results = service._rerank_with_claude(
query="test", candidates=candidates, limit=2
)
assert len(results) <= 2
assert results[0]['id'] in [1, 2]
def test_cost_tracking_under_high_load(self, db_conn):
"""EDGE CASE 5 (bonus): 100 searches in a row.
Cost tracking should not slow down searches."""
tracker = CostTracker()
service = HybridSearchService(db_conn, tracker)
times = []
for i in range(100):
start = time.time()
service.search(query=f"query {i}", customer_id="cust_123")
times.append((time.time() - start) * 1000)
avg_latency = sum(times) / len(times)
assert avg_latency < 300, f"Avg latency {avg_latency}ms is too high"
$ pytest tests/test_search_edge_cases.py -v tests/...::test_empty_result_set_no_tickets_for_customer PASSED tests/...::test_timeout_on_rerank_exceeds_latency_sla PASSED tests/...::test_duplicate_tickets_in_combined_results PASSED tests/...::test_malformed_rerank_response_json PASSED tests/...::test_cost_tracking_under_high_load PASSED ====== 5 passed in 2.34s ======
- Happy path works → you still look bad when an edge case breaks in week 2
- Client's engineer runs tests, knows what's safe to change
- When something breaks later, they add a test; regression stays fixed
- You have proof you thought about failure modes
Document "Why Not" Decisions
❌ Vibe Approach AVOID
"I chose X because it's the best." (No context on what you considered or why alternatives lost.)
✅ Senior Approach DO THIS
Document every alternative seriously considered. Make the tradeoff matrix visible to the client.
# Architecture Decisions
## Decision 1: PostgreSQL + pgvector, NOT Neo4j or Pinecone
### Option A: Neo4j (Knowledge Graph)
Pros: Excellent for "related tickets" queries, strong for cross-customer insights
Cons: Overkill for single-tenant search; requires Cypher (client doesn't know it); 5–10s latency on large datasets
Why Not: "The client's problem is 'find the right past ticket for this one customer, fast.'
Graph queries are powerful, but we'd pay in complexity and latency for features they don't need."
### Option B: Pinecone Vector-Only
Pros: Managed service, fastest single-query latency (~500ms), battle-tested
Cons: $1,500–2,000/month at their scale; proprietary; junior engineer can't debug failures
Why Not: "Cost explodes budget 20x. Also junior engineer can't debug why results are bad —
Pinecone is a black box. We need transparency for maintenance."
### Option C: Elasticsearch
Pros: Open source, excellent hybrid search, widely known
Cons: Requires separate infrastructure; more complex than PostgreSQL; overkill for 50k tickets
Why Not: "PostgreSQL + pgvector gets 95% of value with 50% less complexity.
Client has existing Postgres. No new infra to manage."
### ✅ Final Choice: PostgreSQL + pgvector
1. Data already in Postgres → no migration
2. Full-text search is fast for exact matches
3. Vector search is fast for semantic matches
4. Junior engineer understands SQL
5. Cost = hosting only (~$50/month)
6. Transparent failure modes
7. Upgrade path: Qdrant if they outgrow it (same pgvector code)
---
## Decision 2: Claude 3.5 Sonnet, NOT GPT-4
### Option A: GPT-4
Pros: Slightly better semantic understanding
Cons: 10x cost; for a reranking task the extra quality buys almost nothing; would triple monthly budget
Why Not: "Reranking is binary ('is this relevant?'). GPT-4 and Sonnet are both 99% accurate.
GPT-4 costs 10x more. Spend that on better retrieval instead."
### Option B: Open Source (Llama 2, Mixtral)
Pros: Free (no per-token cost); runs on their server; complete control
Cons: Requires GPU infrastructure ($500–1000/month); slower inference (3–5s); hard to maintain
Why Not: "Infrastructure + maintenance + latency hit cancels the 'free' savings.
Managed API is cheaper and faster."
### ✅ Final Choice: Claude 3.5 Sonnet
1. Cost-optimal for this task (reranking, not generation)
2. Fast (<1s)
3. Client pays per-use (no waste if searches drop)
4. Anthropic API is reliable
---
## Decision 3: SQLite for Metrics, NOT Datadog or CloudWatch
### Option A: Datadog
Pros: Enterprise monitoring, dashboards, alerting
Cons: Minimum $25/month + $0.25/GB; overkill for <100 searches/day
Why Not: "Client doesn't need enterprise monitoring for an MVP. SQLite is free and queryable."
### Option B: CloudWatch (AWS)
Pros: Integrated if on AWS, scales easily
Cons: Client is on Azure — adds vendor complexity
Why Not: "Client's stack is Azure. Keep it simple: one vendor."
### ✅ Final Choice: SQLite + simple Python dashboard
1. Zero cost
2. Client's engineer can query it directly: `SELECT * FROM metrics`
3. Easy to migrate: `INSERT INTO datadog_metrics SELECT * FROM sqlite_metrics`
4. Metrics live with the code (version controlled together)
---
## When to Revisit These Decisions
- If search volume exceeds 1,000 queries/day → Migrate pgvector to Qdrant or Weaviate
- If monthly cost exceeds $150 → Evaluate open-source LLM (self-hosted)
- If latency hits 2.5s → Add caching, prioritize recent tickets, or switch embedding model
- If quality drops below 70% → Collect more training data, or use GPT-4 for rerank
- 6 months later, client's engineer asks "Why not use Neo4j?" — you have the answer
- If a "better" tool comes out next month, you have a decision framework to evaluate it
- You look thoughtful, not impulsive ("I picked this because it's trendy")
- When they grow and need to optimize, they know what knobs to turn
🎯 The 4 Moves Checklist
- ✅ ADR-001: Document the main choice (hybrid SQL + vector), alternatives rejected, and tradeoffs
- ✅ Observability: Cost tracker (daily spend), latency dashboard, quality metric (% relevant results)
- ✅ Edge Case Tests: Empty results, timeout degradation, duplicate deduplication, malformed JSON fallback
- ✅ Why Not Decisions: Why PostgreSQL (not Neo4j), why Sonnet (not GPT-4), why SQLite (not Datadog)
Database schemas, API route handlers, test skeletons, config files. AI excels here.
- Example: "Generate a SQLAlchemy ORM model for a tickets table with id, customer_id, content, created_at"
- What AI does right: Correct syntax, standard patterns, no mistakes
- Your job: Verify it matches your schema, tweak indexes/constraints
- Time saved: 20 mins → 2 mins
Search pipelines, data transformations, multi-step flows. AI needs direction.
- Example: "Write a Python function that combines SQL + vector search results, deduplicates by ticket id, and reranks using Claude"
- What AI does right: Structure, skeleton, most of the code
- Where it fails: Edge cases (empty results, timeouts), error handling, cost tracking
- Your job: Add edge case handling, integrate observability, verify latency
- Time saved: 2 hours → 30 mins (you fix the important parts)
System design, error recovery, performance optimization. AI generates syntax; you drive strategy.
- Example: "Should we add caching to the retrieval pipeline?"
- What AI does wrong: Suggests cache without measuring latency, assumes all loads are equal
- Your job: Measure first (where's the bottleneck?), then propose solution
- What AI CAN do: Generate the cache implementation once you've decided to add it
🔄 Pattern 1: Structured Prompt for Code Generation
Bad prompt: "Write a search function"
Good prompt:
I'm building a support ticket search service. The stack is: - Python 3.11 + Flask - PostgreSQL with pgvector extension - Claude 3.5 Sonnet API for reranking Write a Python function called `hybrid_search` that: 1. Takes query (str), customer_id (str), limit (int=10) 2. Runs SQL full-text search on tickets.content 3. Runs pgvector semantic search (same query embedding) 4. Combines & deduplicates results (by ticket.id) 5. Calls Claude with prompt: "Rank these tickets by relevance" 6. Returns top N results as dicts with id, content, customer_id, similarity_score Requirements: - Max latency: 2 seconds (timeout rerank at 1.5s) - Handle empty result sets (return []) - Log tokens + latency to a metrics dict for observability - Use the existing db connection (psycopg connection object) Include error handling for: - Database connection errors - Claude API timeouts - Malformed JSON responses from Claude Also include docstring with example usage.
🔍 Pattern 2: Iterative Refinement (Don't Trust First Output)
Step 1: Get the initial code from AI
Step 2: Run it. Find issues.
Step 3: Give AI the error + constraint. Ask for fix.
You: "Generate hybrid_search function with the constraints above" AI: ✓ Returns code You: "When query returns 0 results, it crashes on rerank. Fix the empty result case." AI: ✓ Adds guard clause You: "The rerank step is timing out at 1.8s. Add timeout handling that falls back to original ranking." AI: ✓ Adds timeout + fallback You: "Now track tokens + latency in a metrics dict. Log before returning." AI: ✓ Adds observability
🧪 Pattern 3: AI + Test-Driven Development
- You write the test first: "Write a pytest test that verifies empty result set returns []"
- AI generates the code to pass the test
- You write another test (edge case): "Write a pytest test that verifies timeout on rerank falls back to original order"
- AI generates the implementation
Why this works: Tests define the contract. AI fills in the implementation. You control behavior.
def hybrid_search(query: str, customer_id: str, limit: int = 10):
# SQL full-text search
sql_results = db.execute("SELECT ... WHERE to_tsvector ...")
# Vector search
vec_results = db.execute("SELECT ... WHERE embedding <=> ...")
# Combine
combined = deduplicate(sql_results + vec_results)
# Rerank
ranked = rerank_with_claude(query, combined)
return ranked[:limit]
❌ No error handling | ❌ No empty result check | ❌ No timeout handling | ❌ No observability | ❌ No latency measurement
# Prompt to AI:
"Add:
1. Timeout handling for rerank step (>1.5s → return SQL+vector unranked, log timeout)
2. Empty result handling (if combined has 0 results, return [] immediately)
3. Metrics tracking (sql_latency_ms, vector_latency_ms, rerank_latency_ms, tokens)
4. Error handling for DB errors, Claude API errors, JSON parse errors
5. Docstring with example usage
Rewrite the full function with all of these."
# AI Output (now production-grade):
def hybrid_search(query: str, customer_id: str, limit: int = 10):
metrics = {
'sql_latency_ms': 0, 'vector_latency_ms': 0,
'rerank_latency_ms': 0, 'total_latency_ms': 0,
'input_tokens': 0, 'output_tokens': 0
}
start = time.time()
try:
sql_start = time.time()
sql_results = db.execute(...)
metrics['sql_latency_ms'] = (time.time() - sql_start) * 1000
vec_start = time.time()
vec_results = db.execute(...)
metrics['vector_latency_ms'] = (time.time() - vec_start) * 1000
combined = deduplicate(sql_results + vec_results)
if not combined:
metrics['total_latency_ms'] = (time.time() - start) * 1000
return [], metrics
rerank_start = time.time()
try:
ranked = rerank_with_claude(query, combined, timeout_seconds=1.5)
metrics['rerank_latency_ms'] = (time.time() - rerank_start) * 1000
except TimeoutError:
logger.warning(f"Rerank timeout for query={query}")
ranked = combined
metrics['rerank_latency_ms'] = 1500
metrics['total_latency_ms'] = (time.time() - start) * 1000
return ranked[:limit], metrics
except Exception as e:
logger.error(f"hybrid_search failed: {e}")
metrics['total_latency_ms'] = (time.time() - start) * 1000
return combined[:limit], metrics
def test_empty_result_set():
results, metrics = hybrid_search("xyz", "unknown_customer")
assert results == []
assert metrics['total_latency_ms'] < 100 # Should be fast
def test_rerank_timeout_fallback():
with patch('client.messages.create') as mock:
mock.side_effect = TimeoutError()
results, metrics = hybrid_search("query", "cust_123")
assert len(results) > 0 # Still returns unranked
assert metrics['total_latency_ms'] < 2000 # Under SLA
def test_claude_api_error_fallback():
with patch('client.messages.create') as mock:
mock.return_value.content = [MagicMock(text="garbage")]
results, metrics = hybrid_search("query", "cust_123")
assert len(results) > 0 # Still returns unranked
✓ What to Check After AI Generates Code
- Error handling: Does it crash on empty results, timeouts, API errors?
- Observability: Can you measure latency, cost, tokens used?
- Edge cases: What happens if query is empty? N=0? Very large N?
- Latency: Does it exceed SLA? Any N+1 queries?
- Cost: How many API calls per request? Can it be optimized?
- Testability: Can you unit test it? Is it overfit to specific data?
- Maintainability: Can a junior engineer understand and modify it?
- Documentation: Docstring + examples? Known limitations?
9:00 AM — Sketch the function
Whiteboard or docstring: "We need hybrid_search. Takes query, returns ranked results. Must be <2s."
9:15 AM — Write tests first
3–5 pytest tests for happy path + edge cases (empty, timeout, error).
9:30 AM — Detailed prompt to AI
Full context: stack, constraints, requirements, test cases. Ask for code to pass tests.
9:35 AM — AI generates code
Paste into IDE. Run tests.
9:40 AM — Review + identify gaps
Check error handling, observability, latency. Find 3–5 issues. Give specific prompts to fix.
9:50 AM — AI refines
"Add timeout handling with fallback," "Add metrics tracking," "Handle JSON errors."
10:05 AM — Tests pass ✓
All tests green. Code is production-ready. Docstring + examples included.
10:10 AM — Benchmark
Run locally with real data. Verify <2s latency, cost/query, quality metrics.
10:30 AM — Move to next component
This function took 1.5 hours and is bulletproof. Scale to the full system.
❌ Vibe Approach to AI Code
"Generate a full search service" → copy-paste entire output → ship
- No error handling (crashes on edge cases)
- No observability (no visibility into cost/latency)
- No tests (client finds bugs, not you)
- Code is opaque (you can't fix it)
✅ Senior Approach to AI Code
Use AI to 10x boilerplate; hand-build critical paths
- AI: Schema, routes, test skeletons
- You: Business logic, edge cases, observability
- AI: Refinements based on test failures
- You: Architecture, monitoring, handoff docs
Week 1: Discovery + Architecture
- Days 1–2: Client discovery calls. Map problem → constraints → metrics.
- Day 2 PM: Generate 2–3 solution options. Map to constraints.
- Day 3: Client picks architecture. Write ADR + tool selection justification.
- Days 4–5: Setup infrastructure (DB schema, repo structure, dev environment).
Week 2: Core Implementation + Tests
- Days 1–2: Write tests (TDD). Build retrieval pipeline (SQL + vector).
- Days 3–4: Build reranking + observability. Integrate Claude API.
- Day 5: All tests pass. Verify latency <2s, cost <$0.01/query.
Week 3: Hardening + Handoff
- Days 1–2: Edge case tests. Error handling. Load testing.
- Days 3–4: Demo + metrics dashboard. Document ADR + DECISIONS + runbooks.
- Day 5: Deploy to staging. Client trains their team. Handoff.
↓
Architecture Options → Chosen Architecture → Tool Selection
↓
Tech Stack + Timeline → Scope Lock-In
↓
(NOW: Code Phase)
↓
Tests (TDD) → Function Skeleton (AI) → Refinement (You + AI loop)
↓
Production Code (error handling, observability, docs)
↓
Demo → Handoff (ADRs, tests, runbooks, dashboards)
Key Insight
Discovery defines success. Architecture limits tool choice. Tests define the contract. AI fills in the syntax. You ensure it's production-ready.
🚀 Handoff Package Structure
ticket-search-service/ ├── README.md # Quick start guide ├── DECISIONS.md # Why you chose what you chose ← THEY READ THIS ├── ADR-001-hybrid-search.md # The architecture decision record ├── src/ │ ├── search_service.py │ ├── observability.py # Cost tracker (transparent) │ └── main.py ├── tests/ │ └── test_search_edge_cases.py # Edge case tests ← THEY RUN THESE ├── requirements.txt ├── docker-compose.yml # Local dev environment └── metrics.db # Their costs, latency, quality data
When they ask "Can you add X to the search?" they can:
- Read
DECISIONS.mdto understand what you tried - Look at edge case tests to see what could break
- Check
metrics.dbto see if their change improved cost/latency/quality - Run tests before pushing to prod
✓ Final Checklist: AI-Assisted Code (Production Ready)
- Tests pass: All edge cases, happy path, error cases
- Latency verified: Measured with real data, under SLA
- Observability baked in: Metrics dict, logging, tokens tracked
- Error handling: Every exception caught + fallback defined
- Documented: Docstring + usage example + known limitations
- Code reviewed: You understand every line; can explain to client
- Maintainable: Junior engineer can modify + extend without breaking
- ADR written: Architecture decision + alternatives documented
- "Why not" documented: Every rejected option has a reason
ticket-search-service/ │ ├── README.md # Quick-start: install, run, test, deploy ├── DECISIONS.md # Living "why not" decision log ├── CHANGELOG.md # Version history and notable changes ├── .env.example # All required env vars (no secrets) ├── .gitignore ├── Makefile # Shortcuts: make test, make run, make deploy │ ├── docs/ │ ├── architecture.md # System diagram + data-flow narrative │ ├── runbook.md # How to operate this in production │ ├── onboarding.md # Guide for client's junior engineer │ │ │ ├── adr/ # Architecture Decision Records │ │ ├── ADR-001-hybrid-search.md # SQL + pgvector vs pure vector │ │ ├── ADR-002-llm-provider.md # Claude Sonnet vs GPT-4 vs OSS │ │ ├── ADR-003-metrics-storage.md # SQLite vs Datadog vs CloudWatch │ │ ├── ADR-004-deployment-target.md # Railway vs Azure vs Kubernetes │ │ └── ADR-005-embedding-model.md # text-embedding-3-small vs voyage-3 │ │ │ └── diagrams/ │ ├── system-overview.mmd # Mermaid: top-level architecture │ ├── search-pipeline.mmd # Mermaid: retrieval + rerank flow │ └── data-flow.mmd # Mermaid: ingest → index → query │ ├── src/ │ ├── __init__.py │ ├── main.py # App entrypoint (Flask/FastAPI) │ ├── config.py # Settings loaded from env vars │ │ │ ├── search/ │ │ ├── __init__.py │ │ ├── hybrid_search.py # Core: SQL + vector + rerank │ │ ├── sql_retriever.py # PostgreSQL full-text search │ │ ├── vector_retriever.py # pgvector semantic search │ │ ├── reranker.py # Claude rerank call + fallback │ │ └── deduplicator.py # Merge + deduplicate result sets │ │ │ ├── ingestion/ │ │ ├── __init__.py │ │ ├── embedder.py # Batch embed tickets → pgvector │ │ ├── indexer.py # Upsert tickets into search index │ │ └── preprocessor.py # Clean + chunk raw ticket text │ │ │ ├── observability/ │ │ ├── __init__.py │ │ ├── cost_tracker.py # Log tokens, cost, latency per query │ │ ├── metrics_db.py # SQLite schema + query helpers │ │ └── dashboard.py # CLI summary: cost/day, p99 latency │ │ │ ├── api/ │ │ ├── __init__.py │ │ ├── routes.py # REST endpoints: /search, /health │ │ ├── schemas.py # Pydantic request/response models │ │ └── middleware.py # Auth, rate limiting, error handlers │ │ │ └── db/ │ ├── __init__.py │ ├── connection.py # PostgreSQL connection pool │ ├── migrations/ │ │ ├── 001_create_tickets.sql │ │ ├── 002_add_pgvector.sql │ │ └── 003_add_indexes.sql │ └── seed_data.sql # Dev/test sample tickets │ ├── tests/ │ ├── conftest.py # Shared fixtures (in-memory DB, mock Claude) │ │ │ ├── unit/ │ │ ├── test_hybrid_search.py # Happy path + edge cases │ │ ├── test_sql_retriever.py # Full-text search correctness │ │ ├── test_vector_retriever.py # Semantic search correctness │ │ ├── test_reranker.py # Timeout fallback, JSON parse error │ │ ├── test_deduplicator.py # Duplicate ID removal │ │ └── test_cost_tracker.py # Token math, latency logging │ │ │ ├── integration/ │ │ ├── test_search_pipeline.py # Full SQL→vector→rerank flow │ │ ├── test_ingestion.py # Embed + index a batch of tickets │ │ └── test_api_routes.py # HTTP endpoint smoke tests │ │ │ ├── edge_cases/ │ │ ├── test_empty_results.py # Customer with zero tickets │ │ ├── test_timeout_handling.py # Claude API slow/down │ │ ├── test_duplicate_tickets.py# Same ticket from SQL + vector │ │ └── test_malformed_json.py # Claude returns bad JSON │ │ │ └── performance/ │ ├── test_latency_sla.py # Assert p99 < 2000ms under load │ └── test_cost_per_query.py # Assert cost < $0.01 per search │ ├── scripts/ │ ├── embed_all_tickets.py # One-time: embed historical tickets │ ├── backfill_embeddings.py # Incremental: embed new tickets │ ├── export_metrics.py # Dump SQLite metrics to CSV/JSON │ └── check_health.py # Ping /health, verify DB + Claude API │ ├── infra/ │ ├── docker-compose.yml # Local dev: Postgres + pgvector + app │ ├── Dockerfile # Production container image │ ├── railway.toml # Railway deployment config (MVP) │ └── azure/ │ ├── container-app.bicep # Azure Container Apps (enterprise) │ └── keyvault-secrets.bicep # Azure Key Vault for secrets │ ├── .github/ │ └── workflows/ │ ├── ci.yml # On PR: run tests, lint, type-check │ └── deploy.yml # On merge to main: deploy to staging │ ├── requirements.txt # Production dependencies ├── requirements-dev.txt # Dev + test dependencies └── pyproject.toml # Black, mypy, pytest config
Key Principle: Every File Has a Job
docs/ → Everything a human needs to understand or operate the system. Client reads this.
src/ → Modular by domain (search, ingestion, observability, api, db). Junior engineer edits one folder without touching others.
tests/ → Separated by type. Unit tests run in milliseconds. Integration tests need DB. Edge cases document known failure modes. Performance tests enforce SLAs.
scripts/ → Operational one-offs. Runbook references these by name.
infra/ → Everything needed to deploy. Railway for MVP, Azure for enterprise.
ADR Naming Convention
ADR-001 through ADR-00N — one file per major architectural decision. Each ADR covers: Decision, Status, Context, Options Considered, Chosen Option, Tradeoffs, and Revisit Triggers. The numbering is permanent — never delete or renumber, only add Status: SUPERSEDED by ADR-00X.