Senior AI Consultant · Complete Playbook

Discovery → Architecture → Production

From "client explains problem" to deployed, maintainable system — every move, every decision, every line of thinking.

Part I: Discovery Part II: 4 Moves in Action Part III: AI-Assisted Code

📋 The Scenario (Used Throughout This Playbook)

Client: Mid-market B2B SaaS company (CRM platform, ~500 users)

Problem: Customer support team is drowning in tickets. They need to surface relevant past customer conversations faster.

Stated Solution: "Build us an AI chatbot that can search our ticket history"

Real Problem (discovered via discovery call): Support reps spend 10 minutes per ticket hunting for context. Customers wait. Quality suffers.

Timeline: MVP in 3 weeks. 1 engineer on their team (must be able to maintain it).

💰 Budget Constraint: $50–100/month in AI costs (no sky-high token bills)

⏱️ Latency Constraint: Must return results in <2 seconds (they're not waiting)

🔧 Team Constraint: One junior backend engineer. They know Python, not ML.

Part One

After the Client Explains the Problem: What's Next?

The 9-step sequence from vague pain to locked scope — before a single line of code.

You've heard the problem. They want "AI chatbot search" or "intelligent document classification." Now comes the work: translating vague pain into a specific solution. Here's the exact sequence.

Uncover the Real Problem (Not the Imagined Solution)

Clients pitch the solution they think they want. Your job: ask until you find the actual pain.

Client says: "Build us an AI chatbot."
You ask: "What's the chatbot solving? How many queries/day? What are they asking? How long do reps spend answering?"
Reality: 50 queries/day, rep spends 5 min per query hunting for context, customer wait time kills deals.

Map the Current State (Baseline)

Document how things work now. Measure cost, latency, quality, human effort. This is your benchmark.

Current state: Support reps spend 5 min/ticket → answer (no AI).
Cost: 50 queries/day × 5 min = 250 min/day of rep time = $1,250/day (at $30/hr).
Quality: 70% customer satisfaction (ambiguous answers).
Success metric: Cut time to 2 min/ticket, 85% satisfaction.

Scope the Data & Constraints

What data feeds the system? How much? How fresh? What are the hard constraints?

Data: 50k past support tickets (text + metadata like customer_id, resolved_date).
Freshness: New tickets added daily, but search only uses historical context (not real-time).
Constraints: <2s latency (user can't wait), $50–100/month budget (no GPT-4), junior engineer maintains.

Define Success Metrics (Quantified)

Not "good results." Specific, measurable outcomes the client cares about.

Metrics:
• Latency: <2s (p99)
• Quality: 80%+ relevant results in top 3
• Cost: <$75/month
• Adoption: 90%+ of reps use daily
• Maintenance: Junior engineer can add tickets without consultant help

Generate & Evaluate Solution Options

Don't jump to one answer. Show 2–3 architectures. Map each to constraints. Client picks knowing tradeoffs.

Option A (RAG-only): Pinecone + GPT-4 + LangChain
✅ Best accuracy | ❌ $200/month, 3–5s latency, black box

Option B (Hybrid SQL+Vector): PostgreSQL + pgvector + Sonnet
✅ $60/month, <2s, transparent | ⚠️ Slightly lower quality, needs setup

Option C (Full-text only): PostgreSQL full-text search
✅ Free, <0.5s latency | ❌ Fails on semantic mismatches

Pick the Framework & Architecture

Framework choice gates everything. Don't skip this. Write an ADR (see Step 3 of this playbook).

Chosen: Option B (Hybrid SQL + pgvector + Claude 3.5 Sonnet)
Why: Meets all constraints, client can maintain, transparent failure modes.
ADR: ADR-001-hybrid-search.md (in repo, explains tradeoffs)

Map Tools & Dependencies

Now that architecture is locked, choose specific tools. Use the matrix below.

Orchestration: None needed (simple retrieval, no multi-step agents)
LLM: Claude 3.5 Sonnet (cost-optimal for rerank task)
Vector DB: pgvector (built into PostgreSQL, no new infrastructure)
Deployment: Railway or Azure Container Apps
Observability: SQLite metrics + LangSmith (for token tracking)

Estimate Effort & Timeline

Break into components. Map to sprint. Be realistic about what you can do.

Week 1: Data pipeline + embedding + schema design (3 days), first retrieval working (2 days)
Week 2: Reranking + latency optimization, observability setup
Week 3: Testing, edge cases, deployment to staging
Handoff: Runbook, tests, ADRs, demo

Get Approval & Lock Scope

Show client: architecture diagram, tool choices, timeline, cost (your time + infrastructure). Lock scope.

Approval doc includes:
• Architecture diagram (boxes: Postgres, pgvector, Claude, retrieval pipeline)
• Why these tools (ADR)
• Timeline: 3 weeks to MVP
• Infrastructure cost: ~$50/month
• Consultant fee: $X
• What's in / out of scope (no agents, no chat history, v1 memory only)

✓ Checkpoint: Before You Code

Problem statement: "Current state: X → Desired state: Y" (quantified)
Success metrics: Latency, cost, quality, adoption (measurable)
Architecture: Diagram + ADR explaining choices
Tools: Specific tools + why (vs. alternatives)
Timeline: Realistic breakdown by week
Signed-off scope: Client agrees to what's in/out

↗

Tool Selection Framework

Picking the Right Tools for Every Constraint

Constraints from discovery drive tool choice — not trends, not "what's most popular."

Once you know the architecture, use this matrix to pick specific tools. The constraints from your discovery should drive the choice.

Category	Decision Point	If Constraint Is…	Choose…	Not This	Cost / Notes
🤖 LLM Provider	Cost vs. Quality	$50/month budget	Claude 3.5 Sonnet (rerank, not gen)	GPT-4 (10x cost)	$0.003 / 1K input tokens
	Cost vs. Quality	$500/month budget	GPT-4 Turbo or Claude Opus	Free tier models	Better quality for generation tasks
	Latency critical	<1s response time	Sonnet (fast) or local Llama (cached)	Opus (slower)	Sonnet ~400ms, Opus ~2s
🗄️ Vector DB	Operational burden	Junior engineer, no DevOps	pgvector (in Postgres) or Pinecone	Qdrant, Weaviate (self-hosted)	pgvector = free, Pinecone = managed
	Scale + cost	>100M vectors	Qdrant or Milvus (self-hosted)	Pinecone (goes 10x over budget)	Self-hosted = cheap at scale
	Existing DB	Data already in PostgreSQL	pgvector (no migration)	Separate vector DB (ETL overhead)	Same connection pool, simpler
⚙️ Framework	Multi-step orchestration?	Yes, complex agent flows	LangGraph or CrewAI	Simple retrieval (don't overcomplicate)	Graph = state management
	Multi-step orchestration?	No, simple retrieval + rerank	None (write plain Python)	LangChain (adds complexity)	Less abstraction = more control
	Team maintenance	Junior engineer, limited ML	No framework (plain Python + SQL)	LangChain abstraction (hides bugs)	Transparency > magic
📊 Embedding	Cost + quality	Semantic search (not generation)	text-embedding-3-small or voyage-3	text-embedding-3-large (2x cost)	Small = $0.02 / 1M tokens
	Privacy critical	No API calls allowed	all-minilm, bge-small (local)	OpenAI embeddings (API)	Run locally, ~50ms per query
🚀 Deployment	Speed to MVP	3-week sprint, simple app	Railway or Render	Kubernetes (too much setup)	Deploy with `git push`
	Enterprise, Azure ecosystem	Client uses Azure already	Azure Container Apps or App Service	AWS services (different ecosystem)	Same dashboard, billing, SSO
	Scale + cost	>1M requests/month	Kubernetes or managed Container Apps	Serverless (cost explodes on volume)	Containers = per-replica billing
📈 Observability	Startup phase	MVP, <100 searches/day	SQLite (free, transparent)	Datadog or New Relic (overkill)	$0, queryable, no vendor lock
	Scale phase	10k+ searches/day, LLM evals	LangSmith + custom monitoring	Separate tools (splinter visibility)	LangSmith = $30/mo for traces

🎯 Decision Algorithm

Start with constraints: latency, budget, scale, team skill, deployment env
Pick the most restrictive constraint: If budget is $50, eliminate expensive tools first
Within that bucket, optimize for maintainability: Junior engineer? Pick transparent tools
Benchmark against existing infrastructure: Using PostgreSQL? Use pgvector (no ETL)
Document the "why not" for each alternative: Creates your ADR

Part Two

The 4 Moves in Action

From vibe coding to real delivery — the four things that separate a senior consultant from someone "shipping demos."

1️⃣

Write an ADR Before Code

Make the tradeoff matrix explicit. Why this, not that?

❌ Vibe Approach AVOID

"Let's use LangChain + Pinecone + GPT-4. It's the most powerful setup. We'll build a RAG pipeline and the LLM will figure it out."

No tradeoff analysis. No mention of cost, latency, or maintainability. Optimized for "sounds impressive."

✅ Senior Approach DO THIS

Write an ADR that shows the decision, the constraints, alternatives considered + why they were rejected, tradeoffs, and how you'll measure success.

📄 ADR-001: Hybrid Search Architecture (SQL + Vector)

Status: ACCEPTED | Date: 2024-04-15

Decision: Use PostgreSQL (full-text search) + pgvector (semantic search) + Claude 3.5 Sonnet, NOT pure vector RAG with Pinecone/GPT-4.

Context & Constraints:

Latency SLA: <2 seconds end-to-end
Monthly budget: $50–100 in AI costs
Team: 1 junior backend engineer, not an ML specialist
Data: ~50k support tickets, mostly text
Maintenance: Client must understand and modify it 6 months from now

Considered Alternatives:

Option	Latency	Cost/mo	Maintainability	Why Not
Pure Vector (Pinecone + GPT-4)	~1.5s ✅	$150–250 ❌	Black box ❌	Over budget; client can't debug failures
Full-text only (PostgreSQL)	~0.2s ✅	$0 ✅	Simple ✅	Fails on semantic mismatches ("Can't login" vs "access denied")
Hybrid (SQL + pgvector) ← CHOSEN	~1.8s ✅	$60 ✅	Transparent ✅	—

Why This Works:

Cost: Sonnet 3.5 is 1/10th the price of GPT-4. One reranking call per search = ~$0.005 per query.
Latency: Parallel SQL + vector search, then rerank = stays under 2s.
Maintainability: Junior engineer can understand SQL + embeddings. No black-box prompt magic.
Failure modes clear: If results are bad, they know to add more training data or tweak the embedding model.

Tradeoffs:

Lose: Absolute state-of-the-art semantic understanding (GPT-4 is slightly better at nuance)

Gain: Cost predictability, <2s latency guarantee, debuggable system, trainable staff

How We'll Know This Works:

Support reps find the right ticket in top 3 results 80%+ of the time
Time-to-answer drops from 10 min → 2 min
Monthly token bill stays under $75
Client's engineer can add new tickets to index without our help

Path Forward If This Breaks:

If semantic quality is bad → try voyage-3 instead of text-embedding-3-small
If latency creeps over 2s → add caching + prioritize recent tickets
If cost explodes → reduce rerank frequency (only rerank top 10, not 500)
If client grows to 1M tickets → migrate to Qdrant (same pgvector code, different backend)

✅ What This ADR Does:

Client sees exactly why you chose this — they can debate it before code starts
If something breaks, you have a clear roadmap
Their engineer can extend it because they understand the reasoning

2️⃣

Add Observability from Day 1

LangSmith + cost tracker. Show the client what they're paying for.

❌ Vibe Approach AVOID

"We'll add monitoring in phase 2 if there are issues." → (There are issues. It's too late to instrument. Client has no data.)

✅ Senior Approach DO THIS

Instrument before the first API call. Show cost + quality + latency from day 1.

requirements.txt — add observability deps

langsmith>=0.1.0
anthropic>=0.25.0
psycopg[binary]>=3.1.0
pgvector>=0.2.0

observability.py — custom cost tracker

import json
import time
from datetime import datetime
from typing import Any
import sqlite3

class CostTracker:
    """Track cost, latency, and quality metrics for each search."""
    
    def __init__(self, db_path: str = "search_metrics.db"):
        self.db_path = db_path
        self._init_db()
    
    def _init_db(self):
        with sqlite3.connect(self.db_path) as conn:
            conn.execute("""
                CREATE TABLE IF NOT EXISTS search_metrics (
                    id INTEGER PRIMARY KEY,
                    timestamp TEXT NOT NULL,
                    query TEXT,
                    sql_latency_ms REAL,
                    vector_latency_ms REAL,
                    rerank_latency_ms REAL,
                    total_latency_ms REAL,
                    input_tokens INTEGER,
                    output_tokens INTEGER,
                    estimated_cost_cents REAL,
                    result_count INTEGER,
                    user_feedback_relevant BOOLEAN,
                    user_feedback_rating INTEGER
                )
            """)
    
    def log_search(self, **kwargs):
        """Log a single search operation."""
        with sqlite3.connect(self.db_path) as conn:
            conn.execute("""
                INSERT INTO search_metrics 
                (timestamp, query, sql_latency_ms, vector_latency_ms, 
                 rerank_latency_ms, total_latency_ms, input_tokens, 
                 output_tokens, estimated_cost_cents, result_count)
                VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
            """, (
                datetime.utcnow().isoformat(),
                kwargs.get('query'),
                kwargs.get('sql_latency_ms'),
                kwargs.get('vector_latency_ms'),
                kwargs.get('rerank_latency_ms'),
                kwargs.get('total_latency_ms'),
                kwargs.get('input_tokens', 0),
                kwargs.get('output_tokens', 0),
                self._estimate_cost(kwargs.get('input_tokens', 0), 
                                    kwargs.get('output_tokens', 0)),
                kwargs.get('result_count', 0)
            ))
    
    @staticmethod
    def _estimate_cost(input_tokens: int, output_tokens: int) -> float:
        """Estimate cost in cents for Claude 3.5 Sonnet."""
        # Input: $3/1M tokens  → 0.0003 cents per token
        # Output: $15/1M tokens → 0.0015 cents per token
        input_cost  = input_tokens  * 0.0003
        output_cost = output_tokens * 0.0015
        return input_cost + output_cost
    
    def get_daily_summary(self, days: int = 7) -> dict:
        """Get cost, latency, and quality metrics for the last N days."""
        with sqlite3.connect(self.db_path) as conn:
            conn.row_factory = sqlite3.Row
            rows = conn.execute("""
                SELECT 
                    DATE(timestamp) as date,
                    COUNT(*) as search_count,
                    ROUND(AVG(total_latency_ms), 1) as avg_latency_ms,
                    ROUND(MAX(total_latency_ms), 1) as p99_latency_ms,
                    ROUND(SUM(estimated_cost_cents) / 100, 2) as cost_usd,
                    ROUND(AVG(CAST(user_feedback_relevant AS FLOAT)), 2) as quality_relevant_ratio
                FROM search_metrics
                WHERE timestamp > datetime('now', '-' || ? || ' days')
                GROUP BY DATE(timestamp)
                ORDER BY date DESC
            """, (days,))
            return [dict(row) for row in rows]

search_service.py — instrument the search pipeline

from anthropic import Anthropic
from observability import CostTracker
import time
import json

class HybridSearchService:
    def __init__(self, db_conn, tracker: CostTracker):
        self.db = db_conn
        self.tracker = tracker
        self.client = Anthropic()
    
    def search(self, query: str, customer_id: str, limit: int = 10) -> list[dict]:
        """Execute hybrid search with full observability."""
        
        total_start = time.time()
        
        # 1. SQL full-text search
        sql_start = time.time()
        sql_results = self._sql_search(query, customer_id, limit * 2)
        sql_latency = (time.time() - sql_start) * 1000
        
        # 2. Vector semantic search
        vector_start = time.time()
        vector_results = self._vector_search(query, customer_id, limit * 2)
        vector_latency = (time.time() - vector_start) * 1000
        
        # 3. Combine + rerank using Claude
        combined = self._deduplicate_results(sql_results, vector_results)
        
        rerank_start = time.time()
        ranked = self._rerank_with_claude(query, combined, limit)
        rerank_latency = (time.time() - rerank_start) * 1000
        
        total_latency = (time.time() - total_start) * 1000
        
        # 4. Log everything
        self.tracker.log_search(
            query=query,
            sql_latency_ms=sql_latency,
            vector_latency_ms=vector_latency,
            rerank_latency_ms=rerank_latency,
            total_latency_ms=total_latency,
            input_tokens=150,   # approximate; replace with response.usage.input_tokens
            output_tokens=50,
            result_count=len(ranked)
        )
        
        return ranked
    
    def _rerank_with_claude(self, query: str, candidates: list, limit: int) -> list:
        """Use Claude to rerank candidates."""
        prompt = f"""Given this support ticket search query:
"{query}"

And these candidate tickets, rank them by relevance (most relevant first):
{json.dumps(candidates, indent=2)}

Return a JSON array of the top {limit} tickets, ordered by relevance."""
        
        response = self.client.messages.create(
            model="claude-3-5-sonnet-20241022",
            max_tokens=1000,
            messages=[{"role": "user", "content": prompt}]
        )
        
        try:
            return json.loads(response.content[0].text)
        except json.JSONDecodeError:
            return candidates[:limit]
    
    def _sql_search(self, query: str, customer_id: str, limit: int) -> list:
        cur = self.db.cursor()
        cur.execute("""
            SELECT id, customer_id, content, created_at
            FROM tickets
            WHERE customer_id = %s
            AND to_tsvector('english', content) @@ plainto_tsquery('english', %s)
            LIMIT %s
        """, (customer_id, query, limit))
        return [dict(row) for row in cur.fetchall()]
    
    def _vector_search(self, query: str, customer_id: str, limit: int) -> list:
        cur = self.db.cursor()
        cur.execute("""
            SELECT id, customer_id, content, created_at,
                   1 - (embedding <=> %s::vector) as similarity
            FROM tickets
            WHERE customer_id = %s
            ORDER BY embedding <=> %s::vector
            LIMIT %s
        """, (query, customer_id, query, limit))
        return [dict(row) for row in cur.fetchall()]
    
    def _deduplicate_results(self, sql_results, vector_results) -> list:
        seen, combined = set(), []
        for r in sql_results + vector_results:
            if r['id'] not in seen:
                seen.add(r['id'])
                combined.append(r)
        return combined

What the client sees — daily dashboard output

API Dashboard: Support Ticket Search
Last 7 days:

📊 PERFORMANCE
  Avg Latency:     1.65s (target: <2s) ✅
  P99 Latency:     1.98s
  Search Count:    412 queries

💰 COST
  Day 1: $0.34
  Day 2: $0.41
  Day 3: $0.38
  Week Total: $2.65 (annualized: ~$138)

📈 QUALITY
  Relevant (user rated):  78% ✅
  Not Relevant:           14%
  Skipped Rating:          8%

🔴 ALERTS
  None this week!

Avg Latency

1.65s ✅

Week Cost

$2.65

Quality Score

78% ✅

✅ Why This Matters:

Client sees exactly what they're paying for ($0.34/day, not "thousands/month")
You catch latency creep before it hits SLA
When results are bad, you have data: is it the rerank step or the retrieval?
Client's engineer can monitor in production without calling you

3️⃣

Write 3 Edge Case Tests

Not happy-path tests. Break it before production does.

❌ Vibe Approach AVOID

"I tested it locally and it works. We're shipping." → (Crashes in production on edge cases)

✅ Senior Approach DO THIS

Identify the 3 failure modes most likely to bite you in production. Write tests that catch them.

For this project, the 3 edge cases are:

Empty/null results: Customer has no tickets. Search returns 0 results. UI crashes.
Timeout on rerank: Claude API is slow (>5s). Total latency exceeds SLA. Request hangs.
Duplicate tickets in results: SQL and vector search return same ticket twice. UI shows dups.

tests/test_search_edge_cases.py

import pytest
import sqlite3
import time
from unittest.mock import patch, MagicMock
from search_service import HybridSearchService
from observability import CostTracker

class TestHybridSearchEdgeCases:
    
    @pytest.fixture
    def db_conn(self):
        """Create in-memory DB with test data."""
        conn = sqlite3.connect(":memory:")
        conn.row_factory = sqlite3.Row
        conn.execute("""
            CREATE TABLE tickets (
                id INTEGER PRIMARY KEY,
                customer_id TEXT,
                content TEXT,
                created_at TEXT,
                embedding VECTOR(1536)
            )
        """)
        conn.execute("""
            INSERT INTO tickets (customer_id, content, created_at)
            VALUES ('cust_123', 'User cannot login to account', '2024-04-10')
        """)
        conn.commit()
        return conn
    
    def test_empty_result_set_no_tickets_for_customer(self, db_conn):
        """EDGE CASE 1: Customer has no tickets.
           Should return empty list, NOT crash."""
        tracker = CostTracker()
        service = HybridSearchService(db_conn, tracker)
        
        results = service.search(
            query="login issue",
            customer_id="cust_999"   # No tickets for this customer
        )
        
        assert results == []
        assert isinstance(results, list)
    
    def test_timeout_on_rerank_exceeds_latency_sla(self, db_conn):
        """EDGE CASE 2: Rerank step times out (>5s).
           Should degrade gracefully, NOT exceed 2s SLA."""
        tracker = CostTracker()
        service = HybridSearchService(db_conn, tracker)
        
        with patch.object(service.client.messages, 'create') as mock_claude:
            def slow_api_call(*args, **kwargs):
                time.sleep(0.1)
                raise TimeoutError("API took too long")
            mock_claude.side_effect = slow_api_call
            
            start = time.time()
            results = service.search(query="login", customer_id="cust_123")
            elapsed = (time.time() - start) * 1000
            
            assert len(results) > 0    # Still return something
            assert elapsed < 2000      # Stayed under SLA
    
    def test_duplicate_tickets_in_combined_results(self, db_conn):
        """EDGE CASE 3: SQL and vector search both return same ticket.
           Should deduplicate, NOT show ticket twice."""
        tracker = CostTracker()
        service = HybridSearchService(db_conn, tracker)
        
        ticket = {
            'id': 1,
            'customer_id': 'cust_123',
            'content': 'User cannot login',
            'created_at': '2024-04-10'
        }
        
        with patch.object(service, '_sql_search', return_value=[ticket]):
            with patch.object(service, '_vector_search', return_value=[ticket]):
                with patch.object(service, '_rerank_with_claude', return_value=[ticket]):
                    results = service.search(
                        query="login", customer_id="cust_123", limit=10
                    )
                    ids = [r['id'] for r in results]
                    assert len(ids) == len(set(ids)), "Duplicate IDs found"
                    assert ids.count(1) == 1
    
    def test_malformed_rerank_response_json(self, db_conn):
        """EDGE CASE 4 (bonus): Claude returns invalid JSON.
           Should fallback to original results."""
        tracker = CostTracker()
        service = HybridSearchService(db_conn, tracker)
        
        with patch.object(service.client.messages, 'create') as mock_claude:
            mock_response = MagicMock()
            mock_response.content = [MagicMock(text="Not valid JSON {{{")]
            mock_claude.return_value = mock_response
            
            candidates = [
                {'id': 1, 'content': 'ticket 1'},
                {'id': 2, 'content': 'ticket 2'}
            ]
            results = service._rerank_with_claude(
                query="test", candidates=candidates, limit=2
            )
            assert len(results) <= 2
            assert results[0]['id'] in [1, 2]
    
    def test_cost_tracking_under_high_load(self, db_conn):
        """EDGE CASE 5 (bonus): 100 searches in a row.
           Cost tracking should not slow down searches."""
        tracker = CostTracker()
        service = HybridSearchService(db_conn, tracker)
        
        times = []
        for i in range(100):
            start = time.time()
            service.search(query=f"query {i}", customer_id="cust_123")
            times.append((time.time() - start) * 1000)
        
        avg_latency = sum(times) / len(times)
        assert avg_latency < 300, f"Avg latency {avg_latency}ms is too high"

Run the tests

$ pytest tests/test_search_edge_cases.py -v

tests/...::test_empty_result_set_no_tickets_for_customer PASSED
tests/...::test_timeout_on_rerank_exceeds_latency_sla PASSED
tests/...::test_duplicate_tickets_in_combined_results PASSED
tests/...::test_malformed_rerank_response_json PASSED
tests/...::test_cost_tracking_under_high_load PASSED

====== 5 passed in 2.34s ======

✅ Why Edge Cases Matter:

Happy path works → you still look bad when an edge case breaks in week 2
Client's engineer runs tests, knows what's safe to change
When something breaks later, they add a test; regression stays fixed
You have proof you thought about failure modes

4️⃣

Document "Why Not" Decisions

Show the paths you explicitly rejected and why.

❌ Vibe Approach AVOID

"I chose X because it's the best." (No context on what you considered or why alternatives lost.)

✅ Senior Approach DO THIS

Document every alternative seriously considered. Make the tradeoff matrix visible to the client.

DECISIONS.md — living document in the repo

# Architecture Decisions

## Decision 1: PostgreSQL + pgvector, NOT Neo4j or Pinecone

### Option A: Neo4j (Knowledge Graph)
Pros: Excellent for "related tickets" queries, strong for cross-customer insights
Cons: Overkill for single-tenant search; requires Cypher (client doesn't know it); 5–10s latency on large datasets
Why Not: "The client's problem is 'find the right past ticket for this one customer, fast.'
Graph queries are powerful, but we'd pay in complexity and latency for features they don't need."

### Option B: Pinecone Vector-Only
Pros: Managed service, fastest single-query latency (~500ms), battle-tested
Cons: $1,500–2,000/month at their scale; proprietary; junior engineer can't debug failures
Why Not: "Cost explodes budget 20x. Also junior engineer can't debug why results are bad —
Pinecone is a black box. We need transparency for maintenance."

### Option C: Elasticsearch
Pros: Open source, excellent hybrid search, widely known
Cons: Requires separate infrastructure; more complex than PostgreSQL; overkill for 50k tickets
Why Not: "PostgreSQL + pgvector gets 95% of value with 50% less complexity.
Client has existing Postgres. No new infra to manage."

### ✅ Final Choice: PostgreSQL + pgvector
1. Data already in Postgres → no migration
2. Full-text search is fast for exact matches
3. Vector search is fast for semantic matches
4. Junior engineer understands SQL
5. Cost = hosting only (~$50/month)
6. Transparent failure modes
7. Upgrade path: Qdrant if they outgrow it (same pgvector code)

---

## Decision 2: Claude 3.5 Sonnet, NOT GPT-4

### Option A: GPT-4
Pros: Slightly better semantic understanding
Cons: 10x cost; for a reranking task the extra quality buys almost nothing; would triple monthly budget
Why Not: "Reranking is binary ('is this relevant?'). GPT-4 and Sonnet are both 99% accurate.
GPT-4 costs 10x more. Spend that on better retrieval instead."

### Option B: Open Source (Llama 2, Mixtral)
Pros: Free (no per-token cost); runs on their server; complete control
Cons: Requires GPU infrastructure ($500–1000/month); slower inference (3–5s); hard to maintain
Why Not: "Infrastructure + maintenance + latency hit cancels the 'free' savings.
Managed API is cheaper and faster."

### ✅ Final Choice: Claude 3.5 Sonnet
1. Cost-optimal for this task (reranking, not generation)
2. Fast (<1s)
3. Client pays per-use (no waste if searches drop)
4. Anthropic API is reliable

---

## Decision 3: SQLite for Metrics, NOT Datadog or CloudWatch

### Option A: Datadog
Pros: Enterprise monitoring, dashboards, alerting
Cons: Minimum $25/month + $0.25/GB; overkill for <100 searches/day
Why Not: "Client doesn't need enterprise monitoring for an MVP. SQLite is free and queryable."

### Option B: CloudWatch (AWS)
Pros: Integrated if on AWS, scales easily
Cons: Client is on Azure — adds vendor complexity
Why Not: "Client's stack is Azure. Keep it simple: one vendor."

### ✅ Final Choice: SQLite + simple Python dashboard
1. Zero cost
2. Client's engineer can query it directly: `SELECT * FROM metrics`
3. Easy to migrate: `INSERT INTO datadog_metrics SELECT * FROM sqlite_metrics`
4. Metrics live with the code (version controlled together)

---

## When to Revisit These Decisions
- If search volume exceeds 1,000 queries/day → Migrate pgvector to Qdrant or Weaviate
- If monthly cost exceeds $150 → Evaluate open-source LLM (self-hosted)
- If latency hits 2.5s → Add caching, prioritize recent tickets, or switch embedding model
- If quality drops below 70% → Collect more training data, or use GPT-4 for rerank

✅ Why Document "Why Not":

6 months later, client's engineer asks "Why not use Neo4j?" — you have the answer
If a "better" tool comes out next month, you have a decision framework to evaluate it
You look thoughtful, not impulsive ("I picked this because it's trendy")
When they grow and need to optimize, they know what knobs to turn

🎯 The 4 Moves Checklist

✅ ADR-001: Document the main choice (hybrid SQL + vector), alternatives rejected, and tradeoffs
✅ Observability: Cost tracker (daily spend), latency dashboard, quality metric (% relevant results)
✅ Edge Case Tests: Empty results, timeout degradation, duplicate deduplication, malformed JSON fallback
✅ Why Not Decisions: Why PostgreSQL (not Neo4j), why Sonnet (not GPT-4), why SQLite (not Datadog)

III

Part Three

How to Use AI for Writing Code

The gap between "AI wrote this" and "production-grade code" is intention and specificity.

The Three Zones of AI Code Assistance

✅ Zone 1: Boilerplate & Scaffolding — High Leverage

Database schemas, API route handlers, test skeletons, config files. AI excels here.

Example: "Generate a SQLAlchemy ORM model for a tickets table with id, customer_id, content, created_at"
What AI does right: Correct syntax, standard patterns, no mistakes
Your job: Verify it matches your schema, tweak indexes/constraints
Time saved: 20 mins → 2 mins

⚠️ Zone 2: Business Logic & Integration — Medium Leverage

Search pipelines, data transformations, multi-step flows. AI needs direction.

Example: "Write a Python function that combines SQL + vector search results, deduplicates by ticket id, and reranks using Claude"
What AI does right: Structure, skeleton, most of the code
Where it fails: Edge cases (empty results, timeouts), error handling, cost tracking
Your job: Add edge case handling, integrate observability, verify latency
Time saved: 2 hours → 30 mins (you fix the important parts)

❌ Zone 3: Architecture Decisions — Low Leverage

System design, error recovery, performance optimization. AI generates syntax; you drive strategy.

Example: "Should we add caching to the retrieval pipeline?"
What AI does wrong: Suggests cache without measuring latency, assumes all loads are equal
Your job: Measure first (where's the bottleneck?), then propose solution
What AI CAN do: Generate the cache implementation once you've decided to add it

🔄 Pattern 1: Structured Prompt for Code Generation

Bad prompt: "Write a search function"

Good prompt:

I'm building a support ticket search service. The stack is:
- Python 3.11 + Flask
- PostgreSQL with pgvector extension
- Claude 3.5 Sonnet API for reranking

Write a Python function called `hybrid_search` that:
1. Takes query (str), customer_id (str), limit (int=10)
2. Runs SQL full-text search on tickets.content
3. Runs pgvector semantic search (same query embedding)
4. Combines & deduplicates results (by ticket.id)
5. Calls Claude with prompt: "Rank these tickets by relevance"
6. Returns top N results as dicts with id, content, customer_id, similarity_score

Requirements:
- Max latency: 2 seconds (timeout rerank at 1.5s)
- Handle empty result sets (return [])
- Log tokens + latency to a metrics dict for observability
- Use the existing db connection (psycopg connection object)

Include error handling for:
- Database connection errors
- Claude API timeouts
- Malformed JSON responses from Claude

Also include docstring with example usage.

🔍 Pattern 2: Iterative Refinement (Don't Trust First Output)

Step 1: Get the initial code from AI

Step 2: Run it. Find issues.

Step 3: Give AI the error + constraint. Ask for fix.

You:  "Generate hybrid_search function with the constraints above"
AI:   ✓ Returns code

You:  "When query returns 0 results, it crashes on rerank. Fix the empty result case."
AI:   ✓ Adds guard clause

You:  "The rerank step is timing out at 1.8s. Add timeout handling that falls back to original ranking."
AI:   ✓ Adds timeout + fallback

You:  "Now track tokens + latency in a metrics dict. Log before returning."
AI:   ✓ Adds observability

🧪 Pattern 3: AI + Test-Driven Development

You write the test first: "Write a pytest test that verifies empty result set returns []"
AI generates the code to pass the test
You write another test (edge case): "Write a pytest test that verifies timeout on rerank falls back to original order"
AI generates the implementation

Why this works: Tests define the contract. AI fills in the implementation. You control behavior.

Pattern 4: Full Iteration Loop (with Code)

Step 1: AI generates the skeleton

def hybrid_search(query: str, customer_id: str, limit: int = 10):
    # SQL full-text search
    sql_results = db.execute("SELECT ... WHERE to_tsvector ...")
    
    # Vector search
    vec_results = db.execute("SELECT ... WHERE embedding <=> ...")
    
    # Combine
    combined = deduplicate(sql_results + vec_results)
    
    # Rerank
    ranked = rerank_with_claude(query, combined)
    
    return ranked[:limit]

Issues found in AI's first output:
❌ No error handling | ❌ No empty result check | ❌ No timeout handling | ❌ No observability | ❌ No latency measurement

Step 3: You request specific improvements

# Prompt to AI:
"Add:
1. Timeout handling for rerank step (>1.5s → return SQL+vector unranked, log timeout)
2. Empty result handling (if combined has 0 results, return [] immediately)
3. Metrics tracking (sql_latency_ms, vector_latency_ms, rerank_latency_ms, tokens)
4. Error handling for DB errors, Claude API errors, JSON parse errors
5. Docstring with example usage

Rewrite the full function with all of these."

# AI Output (now production-grade):
def hybrid_search(query: str, customer_id: str, limit: int = 10):
    metrics = {
        'sql_latency_ms': 0, 'vector_latency_ms': 0,
        'rerank_latency_ms': 0, 'total_latency_ms': 0,
        'input_tokens': 0, 'output_tokens': 0
    }
    start = time.time()
    
    try:
        sql_start = time.time()
        sql_results = db.execute(...)
        metrics['sql_latency_ms'] = (time.time() - sql_start) * 1000
        
        vec_start = time.time()
        vec_results = db.execute(...)
        metrics['vector_latency_ms'] = (time.time() - vec_start) * 1000
        
        combined = deduplicate(sql_results + vec_results)
        
        if not combined:
            metrics['total_latency_ms'] = (time.time() - start) * 1000
            return [], metrics
        
        rerank_start = time.time()
        try:
            ranked = rerank_with_claude(query, combined, timeout_seconds=1.5)
            metrics['rerank_latency_ms'] = (time.time() - rerank_start) * 1000
        except TimeoutError:
            logger.warning(f"Rerank timeout for query={query}")
            ranked = combined
            metrics['rerank_latency_ms'] = 1500
        
        metrics['total_latency_ms'] = (time.time() - start) * 1000
        return ranked[:limit], metrics
        
    except Exception as e:
        logger.error(f"hybrid_search failed: {e}")
        metrics['total_latency_ms'] = (time.time() - start) * 1000
        return combined[:limit], metrics

Step 4: You write tests for edge cases

def test_empty_result_set():
    results, metrics = hybrid_search("xyz", "unknown_customer")
    assert results == []
    assert metrics['total_latency_ms'] < 100  # Should be fast

def test_rerank_timeout_fallback():
    with patch('client.messages.create') as mock:
        mock.side_effect = TimeoutError()
        results, metrics = hybrid_search("query", "cust_123")
        assert len(results) > 0       # Still returns unranked
        assert metrics['total_latency_ms'] < 2000  # Under SLA

def test_claude_api_error_fallback():
    with patch('client.messages.create') as mock:
        mock.return_value.content = [MagicMock(text="garbage")]
        results, metrics = hybrid_search("query", "cust_123")
        assert len(results) > 0       # Still returns unranked

Pattern 5: Code Review Checklist (After AI Generates)

✓ What to Check After AI Generates Code

Error handling: Does it crash on empty results, timeouts, API errors?
Observability: Can you measure latency, cost, tokens used?
Edge cases: What happens if query is empty? N=0? Very large N?
Latency: Does it exceed SLA? Any N+1 queries?
Cost: How many API calls per request? Can it be optimized?
Testability: Can you unit test it? Is it overfit to specific data?
Maintainability: Can a junior engineer understand and modify it?
Documentation: Docstring + examples? Known limitations?

A Day in the Life: AI + Code

9:00 AM — Sketch the function

Whiteboard or docstring: "We need hybrid_search. Takes query, returns ranked results. Must be <2s."

9:15 AM — Write tests first

3–5 pytest tests for happy path + edge cases (empty, timeout, error).

9:30 AM — Detailed prompt to AI

Full context: stack, constraints, requirements, test cases. Ask for code to pass tests.

9:35 AM — AI generates code

Paste into IDE. Run tests.

9:40 AM — Review + identify gaps

Check error handling, observability, latency. Find 3–5 issues. Give specific prompts to fix.

9:50 AM — AI refines

"Add timeout handling with fallback," "Add metrics tracking," "Handle JSON errors."

10:05 AM — Tests pass ✓

All tests green. Code is production-ready. Docstring + examples included.

10:10 AM — Benchmark

Run locally with real data. Verify <2s latency, cost/query, quality metrics.

10:30 AM — Move to next component

This function took 1.5 hours and is bulletproof. Scale to the full system.

❌ Vibe Approach to AI Code

"Generate a full search service" → copy-paste entire output → ship

No error handling (crashes on edge cases)
No observability (no visibility into cost/latency)
No tests (client finds bugs, not you)
Code is opaque (you can't fix it)

✅ Senior Approach to AI Code

Use AI to 10x boilerplate; hand-build critical paths

AI: Schema, routes, test skeletons
You: Business logic, edge cases, observability
AI: Refinements based on test failures
You: Architecture, monitoring, handoff docs

↗

Synthesis

The Full 3-Week Sprint

Discovery → Architecture → Implementation → Handoff

Week 1: Discovery + Architecture

Days 1–2: Client discovery calls. Map problem → constraints → metrics.
Day 2 PM: Generate 2–3 solution options. Map to constraints.
Day 3: Client picks architecture. Write ADR + tool selection justification.
Days 4–5: Setup infrastructure (DB schema, repo structure, dev environment).

Week 2: Core Implementation + Tests

Days 1–2: Write tests (TDD). Build retrieval pipeline (SQL + vector).
Days 3–4: Build reranking + observability. Integrate Claude API.
Day 5: All tests pass. Verify latency <2s, cost <$0.01/query.

Week 3: Hardening + Handoff

Days 1–2: Edge case tests. Error handling. Load testing.
Days 3–4: Demo + metrics dashboard. Document ADR + DECISIONS + runbooks.
Day 5: Deploy to staging. Client trains their team. Handoff.

Discovery → Code: Mental Model

Problem Statement → Success Metrics → Constraints
    ↓
Architecture Options → Chosen Architecture → Tool Selection
    ↓
Tech Stack + Timeline → Scope Lock-In
    ↓
(NOW: Code Phase)
    ↓
Tests (TDD) → Function Skeleton (AI) → Refinement (You + AI loop)
    ↓
Production Code (error handling, observability, docs)
    ↓
Demo → Handoff (ADRs, tests, runbooks, dashboards)

Key Insight

Discovery defines success. Architecture limits tool choice. Tests define the contract. AI fills in the syntax. You ensure it's production-ready.

↗

Delivery

What You Hand Off to the Client

A folder structure that says "This is production-ready, not a demo."

🚀 Handoff Package Structure

ticket-search-service/
├── README.md                  # Quick start guide
├── DECISIONS.md               # Why you chose what you chose  ← THEY READ THIS
├── ADR-001-hybrid-search.md   # The architecture decision record
├── src/
│   ├── search_service.py
│   ├── observability.py       # Cost tracker (transparent)
│   └── main.py
├── tests/
│   └── test_search_edge_cases.py  # Edge case tests  ← THEY RUN THESE
├── requirements.txt
├── docker-compose.yml         # Local dev environment
└── metrics.db                 # Their costs, latency, quality data

When they ask "Can you add X to the search?" they can:

Read DECISIONS.md to understand what you tried
Look at edge case tests to see what could break
Check metrics.db to see if their change improved cost/latency/quality
Run tests before pushing to prod

✓ Final Checklist: AI-Assisted Code (Production Ready)

Tests pass: All edge cases, happy path, error cases
Latency verified: Measured with real data, under SLA
Observability baked in: Metrics dict, logging, tokens tracked
Error handling: Every exception caught + fallback defined
Documented: Docstring + usage example + known limitations
Code reviewed: You understand every line; can explain to client
Maintainable: Junior engineer can modify + extend without breaking
ADR written: Architecture decision + alternatives documented
"Why not" documented: Every rejected option has a reason

📁

Project Structure

Complete Project Skeleton

Every folder, file, and purpose — the full handoff-ready repo structure for a production AI consulting engagement.

Full repo layout — copy this as your starting point

ticket-search-service/
│
├── README.md                        # Quick-start: install, run, test, deploy
├── DECISIONS.md                     # Living "why not" decision log
├── CHANGELOG.md                     # Version history and notable changes
├── .env.example                     # All required env vars (no secrets)
├── .gitignore
├── Makefile                         # Shortcuts: make test, make run, make deploy
│
├── docs/
│   ├── architecture.md              # System diagram + data-flow narrative
│   ├── runbook.md                   # How to operate this in production
│   ├── onboarding.md                # Guide for client's junior engineer
│   │
│   ├── adr/                         # Architecture Decision Records
│   │   ├── ADR-001-hybrid-search.md         # SQL + pgvector vs pure vector
│   │   ├── ADR-002-llm-provider.md          # Claude Sonnet vs GPT-4 vs OSS
│   │   ├── ADR-003-metrics-storage.md       # SQLite vs Datadog vs CloudWatch
│   │   ├── ADR-004-deployment-target.md     # Railway vs Azure vs Kubernetes
│   │   └── ADR-005-embedding-model.md       # text-embedding-3-small vs voyage-3
│   │
│   └── diagrams/
│       ├── system-overview.mmd              # Mermaid: top-level architecture
│       ├── search-pipeline.mmd              # Mermaid: retrieval + rerank flow
│       └── data-flow.mmd                    # Mermaid: ingest → index → query
│
├── src/
│   ├── __init__.py
│   ├── main.py                      # App entrypoint (Flask/FastAPI)
│   ├── config.py                    # Settings loaded from env vars
│   │
│   ├── search/
│   │   ├── __init__.py
│   │   ├── hybrid_search.py         # Core: SQL + vector + rerank
│   │   ├── sql_retriever.py         # PostgreSQL full-text search
│   │   ├── vector_retriever.py      # pgvector semantic search
│   │   ├── reranker.py              # Claude rerank call + fallback
│   │   └── deduplicator.py          # Merge + deduplicate result sets
│   │
│   ├── ingestion/
│   │   ├── __init__.py
│   │   ├── embedder.py              # Batch embed tickets → pgvector
│   │   ├── indexer.py               # Upsert tickets into search index
│   │   └── preprocessor.py          # Clean + chunk raw ticket text
│   │
│   ├── observability/
│   │   ├── __init__.py
│   │   ├── cost_tracker.py          # Log tokens, cost, latency per query
│   │   ├── metrics_db.py            # SQLite schema + query helpers
│   │   └── dashboard.py             # CLI summary: cost/day, p99 latency
│   │
│   ├── api/
│   │   ├── __init__.py
│   │   ├── routes.py                # REST endpoints: /search, /health
│   │   ├── schemas.py               # Pydantic request/response models
│   │   └── middleware.py            # Auth, rate limiting, error handlers
│   │
│   └── db/
│       ├── __init__.py
│       ├── connection.py            # PostgreSQL connection pool
│       ├── migrations/
│       │   ├── 001_create_tickets.sql
│       │   ├── 002_add_pgvector.sql
│       │   └── 003_add_indexes.sql
│       └── seed_data.sql            # Dev/test sample tickets
│
├── tests/
│   ├── conftest.py                  # Shared fixtures (in-memory DB, mock Claude)
│   │
│   ├── unit/
│   │   ├── test_hybrid_search.py    # Happy path + edge cases
│   │   ├── test_sql_retriever.py    # Full-text search correctness
│   │   ├── test_vector_retriever.py # Semantic search correctness
│   │   ├── test_reranker.py         # Timeout fallback, JSON parse error
│   │   ├── test_deduplicator.py     # Duplicate ID removal
│   │   └── test_cost_tracker.py     # Token math, latency logging
│   │
│   ├── integration/
│   │   ├── test_search_pipeline.py  # Full SQL→vector→rerank flow
│   │   ├── test_ingestion.py        # Embed + index a batch of tickets
│   │   └── test_api_routes.py       # HTTP endpoint smoke tests
│   │
│   ├── edge_cases/
│   │   ├── test_empty_results.py    # Customer with zero tickets
│   │   ├── test_timeout_handling.py # Claude API slow/down
│   │   ├── test_duplicate_tickets.py# Same ticket from SQL + vector
│   │   └── test_malformed_json.py   # Claude returns bad JSON
│   │
│   └── performance/
│       ├── test_latency_sla.py      # Assert p99 < 2000ms under load
│       └── test_cost_per_query.py   # Assert cost < $0.01 per search
│
├── scripts/
│   ├── embed_all_tickets.py         # One-time: embed historical tickets
│   ├── backfill_embeddings.py       # Incremental: embed new tickets
│   ├── export_metrics.py            # Dump SQLite metrics to CSV/JSON
│   └── check_health.py              # Ping /health, verify DB + Claude API
│
├── infra/
│   ├── docker-compose.yml           # Local dev: Postgres + pgvector + app
│   ├── Dockerfile                   # Production container image
│   ├── railway.toml                 # Railway deployment config (MVP)
│   └── azure/
│       ├── container-app.bicep      # Azure Container Apps (enterprise)
│       └── keyvault-secrets.bicep   # Azure Key Vault for secrets
│
├── .github/
│   └── workflows/
│       ├── ci.yml                   # On PR: run tests, lint, type-check
│       └── deploy.yml               # On merge to main: deploy to staging
│
├── requirements.txt                 # Production dependencies
├── requirements-dev.txt             # Dev + test dependencies
└── pyproject.toml                   # Black, mypy, pytest config

Key Principle: Every File Has a Job

docs/ → Everything a human needs to understand or operate the system. Client reads this.

src/ → Modular by domain (search, ingestion, observability, api, db). Junior engineer edits one folder without touching others.

tests/ → Separated by type. Unit tests run in milliseconds. Integration tests need DB. Edge cases document known failure modes. Performance tests enforce SLAs.

scripts/ → Operational one-offs. Runbook references these by name.

infra/ → Everything needed to deploy. Railway for MVP, Azure for enterprise.

ADR Naming Convention

ADR-001 through ADR-00N — one file per major architectural decision. Each ADR covers: Decision, Status, Context, Options Considered, Chosen Option, Tradeoffs, and Revisit Triggers. The numbering is permanent — never delete or renumber, only add Status: SUPERSEDED by ADR-00X.

✓

Master Checklist

End-to-End Delivery Checklist

Check each item off as you complete it. Every box is a deliberate decision — not just a task.