Chapter 01 — Overview

AI Project Testing Framework

Complete testing strategy from data pipelines to model evaluation.

Why Testing Matters in AI

AI projects have unique testing challenges: data quality affects model performance, silent failures in data pipelines are common, and debugging requires both code inspection and data flow validation. This guide covers the full testing lifecycle.

Key Principle Build from the foundation: data validation (bottom) → unit tests → integration tests → model evaluation → end-to-end tests (top). Each layer must pass before moving higher.

Testing Pyramid for AI

1

Data Validation

Check data quality, schema, nulls, outliers, and distributions before any processing.
2

Unit Testing

Test individual functions: data loaders, preprocessors, vectorizers, prompts.
3

Integration Testing

Test components together: RAG pipelines, multi-step workflows, API interactions.
4

Model Testing

Evaluate LLM outputs, RAG quality, vector search accuracy, response consistency.
5

End-to-End Testing

Full user workflows from input to output; error handling and edge cases.
6

Performance Testing

Latency, throughput, cost tracking, memory usage, concurrent users.

Key Principles

Test data flow first: Data issues cascade downstream. Validate early and often.
Breakpoints are debugging tools, not tests: Use them to inspect state during development, then convert insights into automated tests.
Pytest for automation: All repeatable tests belong in pytest. Breakpoints are for exploration.
Test happy path AND edge cases: Off-by-one errors, null values, empty inputs, malformed data.
Separate concerns: Keep unit tests, integration tests, and model tests in different files.
Use fixtures for data: Reusable test data across multiple test functions.

Chapter 02 — Data Validation

Phase 1: Data Validation Testing

Ensure data quality before it enters your pipeline.

What to Validate

Schema & Types

Column names, types, order, data type correctness (int vs float vs string), format consistency for dates, emails, phone numbers.

Quality Checks

Missing values (nulls, empty strings, NaN), duplication and uniqueness constraints, referential integrity (foreign keys).

Statistical Checks

Value ranges (min/max, outliers), distribution checks (skew, balance), class balance for classification tasks.

Volume Checks

Minimum dataset size, row counts within expected bounds, batch completeness for time-series or streaming data.

Pandas Data Validation

Python

import pandas as pd
import numpy as np

# Load data
df = pd.read_csv('data.csv')

# Schema validation
expected_columns = ['id', 'text', 'label', 'timestamp']
assert list(df.columns) == expected_columns, "Column mismatch"

# Check nulls
print(df.isnull().sum())  # Should be 0 for critical columns
assert df['id'].isnull().sum() == 0, "ID has null values"

# Check data types
assert df['id'].dtype == 'int64', "ID should be integer"
assert df['label'].isin([0, 1]).all(), "Label must be 0 or 1"

# Check duplicates
assert df['id'].is_unique, "ID has duplicates"

# Value range checks
assert df['id'].min() > 0, "ID should be positive"
assert len(df) > 100, "Dataset too small"

print("✓ All validation checks passed")

Python

import pandas as pd
from scipy import stats
import re

# Distribution check
skewness = stats.skew(df['numeric_col'])
print(f"Skewness: {skewness:.2f}")
assert abs(skewness) < 2, "Data too skewed"

# Class balance (for classification)
class_counts = df['label'].value_counts()
ratio = class_counts.max() / class_counts.min()
print(f"Class imbalance ratio: {ratio:.2f}x")
assert ratio < 3, "Classes severely imbalanced"

# Format validation (regex)
email_pattern = r'^[\w\.-]+@[\w\.-]+\.\w+$'
assert df['email'].str.match(email_pattern).all(), \
    "Invalid email format"

# Outlier detection (IQR method)
Q1 = df['value'].quantile(0.25)
Q3 = df['value'].quantile(0.75)
IQR = Q3 - Q1
outliers = df[(df['value'] < Q1 - 1.5*IQR) |
              (df['value'] > Q3 + 1.5*IQR)]
print(f"Outliers: {len(outliers)}/{len(df)}")

# Date validation
df['date'] = pd.to_datetime(df['date'])
assert (df['date'] > '2020-01-01').all(), "Dates too old"

print("✓ Advanced validation passed")

Pytest for Data Validation

Python — tests/test_data_validation.py

import pytest
import pandas as pd
from src.data.loader import load_data

@pytest.fixture
def sample_data():
    return pd.DataFrame({
        'id': [1, 2, 3],
        'text': ['hello', 'world', 'test'],
        'label': [0, 1, 0],
    })

def test_no_null_values(sample_data):
    """Critical columns should have no nulls"""
    assert sample_data['id'].isnull().sum() == 0
    assert sample_data['text'].isnull().sum() == 0

def test_correct_schema(sample_data):
    """Columns and types must match spec"""
    expected = ['id', 'text', 'label']
    assert list(sample_data.columns) == expected
    assert sample_data['id'].dtype == 'int64'

def test_no_duplicates(sample_data):
    """Primary key must be unique"""
    assert sample_data['id'].is_unique

def test_label_values_valid(sample_data):
    """Labels must be in valid range"""
    assert sample_data['label'].isin([0, 1]).all()

def test_text_length_reasonable(sample_data):
    """Text field should have content"""
    assert (sample_data['text'].str.len() > 0).all()
    assert (sample_data['text'].str.len() < 10000).all()

def test_load_data_from_file():
    """Integration: test actual data loading"""
    df = load_data('data/sample.csv')
    assert len(df) > 0
    assert 'id' in df.columns

Common Data Issues

Issue	Detection	Fix
Missing values	`df.isnull().sum()`	Drop rows, forward-fill, or use mean
Type mismatch	`df.dtypes`	`pd.to_numeric()`, `astype()`
Duplicates	`df.duplicated().sum()`	`df.drop_duplicates()`
Outliers	IQR or z-score	Clip, cap, or remove
Encoding issues	Garbled text, unicode errors	`df['col'].str.encode('utf-8')`

Chapter 03 — Unit Testing

Phase 2: Unit Testing

Test individual functions with pytest.

What to Unit Test in AI

Data loaders and preprocessing functions
Tokenizers and text cleaners
Feature extraction and embedding functions
Prompt templates and prompt formatting
Configuration loading and validation
Utility functions (parsers, formatters, validators)
Vector search queries
Post-processing and output formatting

Project Structure

Directory Structure

ai_project/
├── src/
│   ├── data/
│   │   ├── loader.py
│   │   └── preprocessor.py
│   ├── models/
│   │   └── embedder.py
│   ├── prompts/
│   │   └── templates.py
│   └── utils/
│       └── helpers.py
├── tests/
│   ├── test_data_loader.py
│   ├── test_preprocessor.py
│   ├── test_embedder.py
│   └── test_prompts.py
├── requirements.txt
├── requirements-dev.txt
└── pytest.ini

Unit Test Examples

Python — tests/test_data_loader.py

import pytest
from pathlib import Path
from src.data.loader import load_csv, clean_text

@pytest.fixture
def test_data_dir(tmp_path):
    """Create temporary test data"""
    csv_file = tmp_path / "test.csv"
    csv_file.write_text("id,text,label\n1,hello,0\n2,world,1")
    return tmp_path

@pytest.fixture
def sample_texts():
    return [
        "Hello world!",
        "Test    with   spaces",
        "Special chars: @#$%",
        "   leading/trailing   ",
    ]

def test_load_csv_success(test_data_dir):
    """CSV loads correctly"""
    df = load_csv(test_data_dir / "test.csv")
    assert len(df) == 2
    assert list(df.columns) == ['id', 'text', 'label']

def test_load_csv_file_not_found():
    """Missing file raises error"""
    with pytest.raises(FileNotFoundError):
        load_csv("nonexistent.csv")

def test_clean_text_removes_special_chars(sample_texts):
    """Special characters are removed"""
    result = clean_text(sample_texts[2])
    assert not any(c in result for c in "@#$%")

def test_clean_text_normalizes_spaces(sample_texts):
    """Multiple spaces become single"""
    result = clean_text(sample_texts[1])
    assert "   " not in result

def test_clean_text_strips_whitespace(sample_texts):
    """Leading/trailing spaces removed"""
    result = clean_text(sample_texts[3])
    assert not result.startswith(" ")
    assert not result.endswith(" ")

Python — tests/test_preprocessor.py

import pytest
import numpy as np
from src.data.preprocessor import tokenize, vectorize, normalize_embeddings

@pytest.fixture
def sample_texts():
    return ["hello world", "test data", "example text"]

def test_tokenize_basic(sample_texts):
    """Tokenizer splits text correctly"""
    tokens = tokenize(sample_texts[0])
    assert tokens == ["hello", "world"]

def test_tokenize_empty_string():
    """Empty strings handled gracefully"""
    tokens = tokenize("")
    assert tokens == []

def test_vectorize_output_shape(sample_texts):
    """Vectorizer returns correct shape"""
    vectors = vectorize(sample_texts)
    assert vectors.shape == (3, 384)  # 3 texts, 384-dim embeddings

def test_vectorize_no_nan(sample_texts):
    """Vectors contain no NaN values"""
    vectors = vectorize(sample_texts)
    assert not np.isnan(vectors).any()

def test_normalize_embeddings_magnitude():
    """Unit norm after normalization"""
    vectors = np.random.randn(5, 384)
    normalized = normalize_embeddings(vectors)
    norms = np.linalg.norm(normalized, axis=1)
    np.testing.assert_array_almost_equal(norms, np.ones(5))

Python — tests/test_prompts.py

import pytest
from src.prompts.templates import format_rag_prompt, format_system_message

@pytest.fixture
def sample_context():
    return {
        "query": "What is AI?",
        "documents": [
            "AI is artificial intelligence",
            "Machine learning is a subset of AI"
        ],
        "user_name": "Alice"
    }

def test_format_rag_prompt_includes_query(sample_context):
    """Query is included in prompt"""
    prompt = format_rag_prompt(**sample_context)
    assert sample_context["query"] in prompt

def test_format_rag_prompt_includes_documents(sample_context):
    """Retrieved documents are in prompt"""
    prompt = format_rag_prompt(**sample_context)
    for doc in sample_context["documents"]:
        assert doc in prompt

def test_format_rag_prompt_empty_documents():
    """Empty documents handled gracefully"""
    prompt = format_rag_prompt(query="test", documents=[], user_name="Test")
    assert len(prompt) > 0

@pytest.mark.parametrize("query", [
    "short",
    "a" * 5000,
    "query with special chars: @#$%",
])
def test_format_rag_prompt_various_queries(query):
    """Handles various query formats"""
    prompt = format_rag_prompt(query=query, documents=["doc1"], user_name="Test")
    assert query in prompt

Running Pytest

Shell

# Run all tests
pytest

# Run specific file
pytest tests/test_data_loader.py

# Run specific test
pytest tests/test_data_loader.py::test_load_csv_success

# Verbose output
pytest -v

# Coverage report
pytest --cov=src --cov-report=html

# Run only tests matching pattern
pytest -k "test_clean_text"

# Stop at first failure
pytest -x

# Show print statements
pytest -s

pytest.ini Configuration

INI — pytest.ini

[pytest]
minversion = 7.0
addopts =
    -v
    --strict-markers
    --tb=short
testpaths = tests
python_files = test_*.py
python_classes = Test*
python_functions = test_*
markers =
    slow: marks tests as slow
    integration: marks tests as integration
    unit: marks tests as unit

Chapter 04 — Integration Testing

Phase 3: Integration Testing

Test components working together.

What to Test

RAG pipeline (retrieve → rerank → context building)
Multi-step workflows (load → process → vectorize → store)
Database connections and queries
Vector store operations (upsert, search)
LLM calls with retries and fallbacks
Cache functionality
Async operations and concurrent requests
Error handling across components

RAG Pipeline Integration Test

Python — tests/test_rag_pipeline.py

import pytest
from src.rag.pipeline import RAGPipeline

@pytest.fixture
def rag_pipeline():
    return RAGPipeline(
        embedding_model="sentence-transformers/all-MiniLM-L6-v2",
        vector_db="chroma",
        llm_provider="openai"
    )

@pytest.fixture
def test_documents():
    return [
        {"id": "1", "content": "Python is a programming language"},
        {"id": "2", "content": "Python was created in 1989"},
        {"id": "3", "content": "Java is another programming language"},
    ]

@pytest.mark.integration
def test_rag_indexing(rag_pipeline, test_documents):
    """Documents are indexed correctly"""
    rag_pipeline.index_documents(test_documents)
    assert rag_pipeline.doc_count() == 3

@pytest.mark.integration
def test_rag_retrieval(rag_pipeline, test_documents):
    """Query retrieves relevant documents"""
    rag_pipeline.index_documents(test_documents)
    results = rag_pipeline.retrieve("What is Python?", top_k=2)

    assert len(results) == 2
    assert results[0]['score'] > results[1]['score']
    assert "Python" in results[0]['content']

@pytest.mark.integration
def test_rag_end_to_end(rag_pipeline, test_documents):
    """Full RAG: index → retrieve → augment → generate"""
    rag_pipeline.index_documents(test_documents)
    response = rag_pipeline.query(
        query="When was Python created?",
        use_rag=True
    )
    assert response['answer'] is not None
    assert len(response['sources']) > 0

@pytest.mark.integration
def test_rag_error_handling_invalid_query(rag_pipeline, test_documents):
    """Handles invalid queries"""
    rag_pipeline.index_documents(test_documents)
    with pytest.raises(ValueError):
        rag_pipeline.retrieve("")
    with pytest.raises(TypeError):
        rag_pipeline.retrieve(None)

Database Integration Test

Python — tests/test_database.py

import pytest
from src.database import Database

@pytest.fixture
def db():
    """In-memory database for testing"""
    db = Database(":memory:")
    db.create_tables()
    yield db
    db.close()

@pytest.mark.integration
def test_insert_and_retrieve(db):
    """Data insert and retrieval"""
    db.insert_document(doc_id="1", content="Test document", embedding=[0.1, 0.2, 0.3])
    result = db.get_document("1")
    assert result is not None
    assert result['content'] == "Test document"

@pytest.mark.integration
def test_vector_search(db):
    """Vector similarity search"""
    db.insert_document("1", "Python programming", [0.1, 0.2, 0.3])
    db.insert_document("2", "Java programming", [0.1, 0.2, 0.4])
    results = db.search_similar([0.1, 0.2, 0.3], top_k=2)
    assert len(results) == 2
    assert results[0]['doc_id'] == "1"

@pytest.mark.integration
def test_batch_operations(db):
    """Batch insert and delete"""
    docs = [
        {"doc_id": "1", "content": "doc1", "embedding": [0.1]},
        {"doc_id": "2", "content": "doc2", "embedding": [0.2]},
        {"doc_id": "3", "content": "doc3", "embedding": [0.3]},
    ]
    db.batch_insert(docs)
    assert db.count() == 3
    db.batch_delete(["1", "2"])
    assert db.count() == 1

Async Integration Test

Python — tests/test_async_pipeline.py

import pytest
import asyncio
from src.async_pipeline import AsyncPipeline

@pytest.mark.asyncio
async def test_concurrent_requests():
    """Multiple async requests work concurrently"""
    pipeline = AsyncPipeline()
    tasks = [
        pipeline.process_async("query 1"),
        pipeline.process_async("query 2"),
        pipeline.process_async("query 3"),
    ]
    results = await asyncio.gather(*tasks)
    assert len(results) == 3
    assert all(r is not None for r in results)

@pytest.mark.asyncio
async def test_error_handling_async():
    """Async errors propagate correctly"""
    pipeline = AsyncPipeline()
    with pytest.raises(ValueError):
        await pipeline.process_async(None)

Chapter 05 — Model Testing

Phase 4: Model Testing

Evaluate LLM outputs and AI quality.

What to Test

Output Format

JSON structure, required keys, response schema, token counts and length limits.

Consistency

Same input → similar outputs, determinism at temperature=0, diversity at high temperature.

RAG Quality

Faithfulness, answer relevancy, context precision, context recall via RAGAS metrics.

Embedding Quality

Correct dimensions, unit norm, semantic similarity ordering, no NaN values.

LLM Output Format Tests

Python — tests/test_llm_output.py

import pytest
import json
from src.llm.client import LLMClient

@pytest.fixture
def llm_client():
    return LLMClient(provider="openai", model="gpt-4")

def test_json_output_format(llm_client):
    """LLM returns valid JSON when requested"""
    response = llm_client.generate(
        prompt="Extract entities from: 'John works at Apple'",
        response_format="json",
        schema={"type": "object", "properties": {
            "person": {"type": "string"},
            "company": {"type": "string"}
        }}
    )
    data = json.loads(response)
    assert data['person'] == "John"
    assert data['company'] == "Apple"

def test_structured_output_required_fields(llm_client):
    """All required fields present in output"""
    response = llm_client.generate(
        prompt="Classify: 'I love this product!'",
        response_format="json",
        schema={"required": ["sentiment", "confidence", "keywords"]}
    )
    data = json.loads(response)
    assert 'sentiment' in data
    assert 'confidence' in data
    assert 'keywords' in data

def test_response_within_length_limit(llm_client):
    """Response respects max_tokens"""
    response = llm_client.generate(prompt="Summarize in 2 sentences", max_tokens=50)
    assert len(response.split()) < 50

@pytest.mark.parametrize("invalid_prompt", [None, "", " " * 100])
def test_invalid_prompts_handled(llm_client, invalid_prompt):
    """Invalid prompts raise appropriate errors"""
    with pytest.raises((ValueError, TypeError)):
        llm_client.generate(invalid_prompt)

Consistency Testing

Python — tests/test_consistency.py

import pytest
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from src.llm.client import LLMClient

@pytest.fixture
def llm_client():
    return LLMClient(provider="openai", model="gpt-4")

def test_response_consistency(llm_client, embeddings_model):
    """Same prompt produces similar responses"""
    prompt = "What is the capital of France?"
    responses = [llm_client.generate(prompt, temperature=0.7) for _ in range(3)]
    embeddings = embeddings_model.encode(responses)
    similarity_matrix = cosine_similarity(embeddings)
    avg_similarity = np.mean([similarity_matrix[i][j]
        for i in range(3) for j in range(i+1, 3)])
    assert avg_similarity > 0.7, "Responses too inconsistent"

def test_deterministic_with_temperature_zero(llm_client):
    """Same prompt with temperature=0 is deterministic"""
    prompt = "Spell 'hello'"
    responses = [llm_client.generate(prompt, temperature=0) for _ in range(2)]
    assert responses[0] == responses[1]

RAG Quality Metrics (RAGAS)

Python — tests/test_rag_quality.py

import pytest
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision, context_recall
from datasets import Dataset

@pytest.fixture
def rag_eval_data():
    return Dataset.from_dict({
        "question": ["What is Python?", "When was Python created?"],
        "answer": [
            "Python is a high-level programming language.",
            "Python was created in 1989.",
        ],
        "contexts": [
            ["Python is a high-level, interpreted programming language.", "Created by Guido van Rossum"],
            ["Python was first released in 1991.", "Guido van Rossum started the project in 1989."]
        ],
        "ground_truth": [
            "Python is a programming language",
            "Python was created in 1989 and released in 1991",
        ]
    })

def test_rag_faithfulness(rag_eval_data):
    """Answers grounded in context — target > 0.7"""
    score = evaluate(rag_eval_data, metrics=[faithfulness])
    assert score['faithfulness'].score > 0.7

def test_rag_answer_relevancy(rag_eval_data):
    """Answers address the question — target > 0.7"""
    score = evaluate(rag_eval_data, metrics=[answer_relevancy])
    assert score['answer_relevancy'].score > 0.7

def test_rag_context_precision(rag_eval_data):
    """Retrieved context is relevant — target > 0.7"""
    score = evaluate(rag_eval_data, metrics=[context_precision])
    assert score['context_precision'].score > 0.7

RAGAS Metrics Reference Faithfulness: Are claims in the answer grounded in retrieved context?
Answer Relevancy: Does the answer actually address the question?
Context Precision: Is the retrieved context relevant to the question?
Context Recall: Does the retrieved context contain the ground truth?

Chapter 06 — End-to-End Testing

Phase 5: End-to-End Testing

Test complete user workflows from input to output.

What to Test

Full user workflows (user input → AI response)
Error scenarios and recovery
Timeout handling
Retry logic with exponential backoff
Fallback mechanisms
Rate limiting
Session persistence across requests
Multi-turn conversations
File uploads and processing

Full Workflow E2E Test

Python — tests/test_e2e.py

import pytest
from src.app import App

@pytest.fixture
def app():
    return App(env="test")

@pytest.mark.e2e
def test_user_query_happy_path(app):
    """Full: user sends query → gets relevant response"""
    session = app.create_session(user_id="test_user")
    response = session.ask("What is machine learning?")

    assert response['answer'] is not None
    assert len(response['answer']) > 0
    assert response['sources'] is not None
    assert response['latency_ms'] < 5000  # Under 5 seconds

@pytest.mark.e2e
def test_multi_turn_conversation(app):
    """Conversation maintains context across turns"""
    session = app.create_session(user_id="test_user")

    session.ask("My name is Alice")
    response = session.ask("What's my name?")

    assert "Alice" in response['answer']

@pytest.mark.e2e
def test_error_recovery(app):
    """App recovers from partial failures gracefully"""
    session = app.create_session(user_id="test_user")

    # Simulate a query that might stress the system
    response = session.ask("a" * 10000)  # Very long query
    assert response is not None
    assert 'error' in response or 'answer' in response

@pytest.mark.e2e
def test_rate_limiting(app):
    """Rate limiting triggers and returns proper error"""
    session = app.create_session(user_id="test_user")
    responses = [session.ask(f"Query {i}") for i in range(100)]
    rate_limited = [r for r in responses if r.get('status') == 429]
    assert len(rate_limited) > 0

API E2E with httpx

Python — tests/test_api_e2e.py

import pytest
import httpx

BASE_URL = "http://localhost:8000"

@pytest.mark.e2e
def test_health_endpoint():
    """Health check returns 200"""
    r = httpx.get(f"{BASE_URL}/health")
    assert r.status_code == 200
    assert r.json()["status"] == "ok"

@pytest.mark.e2e
def test_query_endpoint():
    """Query endpoint returns structured response"""
    payload = {"query": "What is RAG?", "user_id": "test"}
    r = httpx.post(f"{BASE_URL}/query", json=payload, timeout=10)
    assert r.status_code == 200
    data = r.json()
    assert "answer" in data
    assert "sources" in data

@pytest.mark.e2e
def test_invalid_query_returns_422():
    """Missing required fields → validation error"""
    r = httpx.post(f"{BASE_URL}/query", json={})
    assert r.status_code == 422

Chapter 07 — Performance Testing

Phase 6: Performance Testing

Measure latency, throughput, and cost.

Key Metrics

Latency

P50, P95, P99 response times. Time-to-first-token (TTFT) for streaming responses.

Throughput

Requests per second under load. Token generation speed (tokens/sec).

Cost

API token usage and estimated USD cost per query. Memory footprint during inference.

Latency Benchmarking

Python — tests/test_performance.py

import pytest
import time
import statistics
from src.llm.client import LLMClient

@pytest.fixture
def llm_client():
    return LLMClient(provider="openai", model="gpt-4")

@pytest.mark.slow
def test_latency_p95(llm_client):
    """P95 latency under 3 seconds"""
    latencies = []
    prompt = "Summarize AI in one sentence."
    for _ in range(20):
        start = time.perf_counter()
        llm_client.generate(prompt)
        latencies.append(time.perf_counter() - start)

    latencies.sort()
    p95 = latencies[int(len(latencies) * 0.95)]
    print(f"\nP50: {statistics.median(latencies):.2f}s")
    print(f"P95: {p95:.2f}s")
    assert p95 < 3.0, f"P95 latency {p95:.2f}s exceeds 3s threshold"

@pytest.mark.slow
def test_throughput_under_load(llm_client):
    """At least 5 req/s under concurrent load"""
    import concurrent.futures

    def single_request():
        start = time.perf_counter()
        llm_client.generate("What is 2+2?")
        return time.perf_counter() - start

    n = 20
    start = time.perf_counter()
    with concurrent.futures.ThreadPoolExecutor(max_workers=5) as ex:
        list(ex.map(lambda _: single_request(), range(n)))
    elapsed = time.perf_counter() - start
    rps = n / elapsed
    print(f"\nThroughput: {rps:.1f} req/s")
    assert rps >= 5, f"Throughput {rps:.1f} req/s below 5 req/s threshold"

@pytest.mark.slow
def test_token_cost_tracking(llm_client):
    """Track token usage and estimated cost"""
    response = llm_client.generate_with_usage("Explain embeddings briefly.")
    usage = response['usage']
    cost = (usage['prompt_tokens'] * 0.00003 +
            usage['completion_tokens'] * 0.00006)
    print(f"\nTokens: {usage['total_tokens']}, Cost: ${cost:.6f}")
    assert cost < 0.01, "Single query cost exceeds $0.01"

Memory Profiling

Python

import tracemalloc
from src.models.embedder import EmbeddingModel

def test_embedding_memory_usage():
    """Embedding model stays under 500MB"""
    tracemalloc.start()
    model = EmbeddingModel()
    texts = ["sample text"] * 1000
    model.embed_batch(texts)
    current, peak = tracemalloc.get_traced_memory()
    tracemalloc.stop()
    peak_mb = peak / 1024 / 1024
    print(f"\nPeak memory: {peak_mb:.1f} MB")
    assert peak_mb < 500, f"Memory usage {peak_mb:.1f}MB exceeds 500MB"

Performance Thresholds (recommended baseline) P95 latency < 3s · P99 latency < 5s · Throughput ≥ 5 req/s · Memory < 500MB · Cost < $0.01/query

Chapter 08 — Debugging & Breakpoints

Debugging & Breakpoints

Using breakpoints for exploration, converting insights to tests.

Breakpoint Workflow

1

Exploration

Set breakpoints to inspect state during development. Check variable values, data shapes, intermediate outputs.
2

Insight

Identify what's wrong or unexpected. Document the issue with concrete values and conditions.
3

Test

Convert your insight into a pytest test that would have caught this. Add to your test suite.
4

Fix

Implement the fix. Run the new test to confirm it passes. Run full suite to check for regressions.

Setting Breakpoints

Python

def preprocess_text(text: str) -> str:
    breakpoint()  # drops into interactive pdb debugger
    text = text.lower()
    tokens = text.split()
    return " ".join(tokens)

# pdb commands:
# n  → next line
# s  → step into function
# c  → continue to next breakpoint
# p variable  → print value
# pp variable → pretty-print value
# l  → list surrounding code
# q  → quit debugger

Shell

# Drop into pdb on any failure
pytest --pdb

# Drop into pdb at the first failure only
pytest --pdb -x

# Show local variable values in tracebacks
pytest --tb=long -v

# Run only the failing test, drop into pdb
pytest tests/test_data_loader.py::test_load_csv_success --pdb

Python

import logging

logging.basicConfig(
    level=logging.DEBUG,
    format="%(asctime)s [%(levelname)s] %(name)s — %(message)s"
)
logger = logging.getLogger(__name__)

def embed_documents(docs: list[str]) -> list:
    logger.debug(f"Embedding {len(docs)} documents")
    embeddings = model.encode(docs)
    logger.debug(f"Embedding shape: {embeddings.shape}")
    if embeddings.isnan().any():
        logger.error("NaN detected in embeddings!")
    return embeddings

# pytest.ini — capture logs during tests
# [pytest]
# log_cli = true
# log_cli_level = DEBUG

Breakpoints vs Tests Breakpoints are for exploration — remove them before committing. Every insight from a debugging session should be encoded as a permanent pytest test. If you found a bug with a breakpoint, write a test that would have caught it first.

Common Debugging Patterns for AI

Python — Common Inspection Snippets

# Check embedding shape and quality
print(f"Shape: {embeddings.shape}, NaN: {np.isnan(embeddings).sum()}, Norm: {np.linalg.norm(embeddings, axis=1).mean():.3f}")

# Inspect retrieval scores
for i, (doc, score) in enumerate(zip(results, scores)):
    print(f"[{i}] score={score:.4f} | {doc['content'][:80]}")

# Check prompt token count
import tiktoken
enc = tiktoken.encoding_for_model("gpt-4")
tokens = enc.encode(prompt)
print(f"Prompt tokens: {len(tokens)}")

# Inspect dataframe at a stage
print(df.describe())
print(df.dtypes)
print(df.isnull().sum())

Chapter 09 — Testing Checklist

Complete Testing Checklist

Use this before going to production.

Data Validation Phase

Schema validated (columns, types, order)

Null values handled in critical columns

Duplicates removed or flagged

Outliers detected and handled

Class balance checked (for classification)

Format validation (dates, emails, etc.)

Unit Testing Phase

Data loaders tested with fixtures

Preprocessing functions covered

Prompt templates tested with edge cases

Error cases tested (FileNotFoundError, ValueError, etc.)

Parameterized tests for varied inputs

Coverage > 80% for core modules

Integration Testing Phase

RAG pipeline end-to-end tested

Database insert / retrieve / search tested

Async operations verified concurrently

Error handling across components verified

Model Testing Phase

LLM output format (JSON, schema) validated

Response consistency tested (same prompt → similar output)

RAGAS metrics: faithfulness > 0.7

RAGAS metrics: answer relevancy > 0.7

Embedding dimensions and norms verified

End-to-End & Performance Phase

Full user workflow passes E2E test

Multi-turn conversation context maintained

P95 latency < 3 seconds

Throughput ≥ 5 req/s under load

Cost per query < $0.01

Memory usage < 500MB at peak

Chapter 10 — Tools & Resources

Tools & Resources

Essential testing libraries for AI projects.

Core Libraries

Tool	Purpose	Install
pytest	Testing framework	`pip install pytest`
pytest-cov	Coverage reporting	`pip install pytest-cov`
pytest-asyncio	Async test support	`pip install pytest-asyncio`
RAGAS	RAG quality evaluation	`pip install ragas`
LangSmith	LLM tracing & observability	`pip install langsmith`
deepeval	LLM evaluation framework	`pip install deepeval`
httpx	API E2E testing	`pip install httpx`
faker	Generate test data	`pip install faker`

Requirements Files

Text — requirements-dev.txt

# Testing
pytest>=7.0
pytest-cov>=4.0
pytest-asyncio>=0.21
pytest-mock>=3.10

# AI Evaluation
ragas>=0.1
deepeval>=0.20

# API Testing
httpx>=0.25

# Data Generation
faker>=19.0

# Profiling
memory-profiler>=0.61
py-spy>=0.3

CI/CD Integration

YAML — .github/workflows/test.yml

name: AI Project Tests

on: [push, pull_request]

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: "3.11"

      - name: Install dependencies
        run: pip install -r requirements-dev.txt

      - name: Run unit tests
        run: pytest -m "not integration and not slow" -v --cov=src

      - name: Run integration tests
        run: pytest -m integration -v
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}

      - name: Upload coverage
        uses: codecov/codecov-action@v4

Testing Strategy Summary Run unit tests on every commit (fast, no external deps). Run integration tests on PRs (need API keys). Run performance and E2E tests before releases. Gate production deploys on RAGAS score thresholds.

AI Project Testing Framework

Why Testing Matters in AI

Testing Pyramid for AI

Data Validation

Unit Testing

Integration Testing

Model Testing

End-to-End Testing

Performance Testing

Key Principles

Phase 1: Data Validation Testing

What to Validate

Schema & Types

Quality Checks

Statistical Checks

Volume Checks

Pandas Data Validation

Pytest for Data Validation

Common Data Issues

Phase 2: Unit Testing

What to Unit Test in AI

Project Structure

Unit Test Examples

Running Pytest

pytest.ini Configuration

Phase 3: Integration Testing

What to Test

RAG Pipeline Integration Test

Database Integration Test

Async Integration Test

Phase 4: Model Testing

What to Test

Output Format

Consistency

RAG Quality

Embedding Quality

LLM Output Format Tests

Consistency Testing

RAG Quality Metrics (RAGAS)

Phase 5: End-to-End Testing

What to Test

Full Workflow E2E Test

API E2E with httpx

Phase 6: Performance Testing

Key Metrics

Latency

Throughput

Cost

Latency Benchmarking

Memory Profiling

Debugging & Breakpoints

Breakpoint Workflow

Exploration

Insight

Test

Fix

Setting Breakpoints

Common Debugging Patterns for AI

Complete Testing Checklist

Data Validation Phase

Unit Testing Phase

Integration Testing Phase

Model Testing Phase

End-to-End & Performance Phase

Tools & Resources

Core Libraries

Requirements Files

CI/CD Integration