AI Project Testing Framework
Complete testing strategy from data pipelines to model evaluation.
Why Testing Matters in AI
AI projects have unique testing challenges: data quality affects model performance, silent failures in data pipelines are common, and debugging requires both code inspection and data flow validation. This guide covers the full testing lifecycle.
Testing Pyramid for AI
-
1
Data Validation
Check data quality, schema, nulls, outliers, and distributions before any processing.
-
2
Unit Testing
Test individual functions: data loaders, preprocessors, vectorizers, prompts.
-
3
Integration Testing
Test components together: RAG pipelines, multi-step workflows, API interactions.
-
4
Model Testing
Evaluate LLM outputs, RAG quality, vector search accuracy, response consistency.
-
5
End-to-End Testing
Full user workflows from input to output; error handling and edge cases.
-
6
Performance Testing
Latency, throughput, cost tracking, memory usage, concurrent users.
Key Principles
- Test data flow first: Data issues cascade downstream. Validate early and often.
- Breakpoints are debugging tools, not tests: Use them to inspect state during development, then convert insights into automated tests.
- Pytest for automation: All repeatable tests belong in pytest. Breakpoints are for exploration.
- Test happy path AND edge cases: Off-by-one errors, null values, empty inputs, malformed data.
- Separate concerns: Keep unit tests, integration tests, and model tests in different files.
- Use fixtures for data: Reusable test data across multiple test functions.
Phase 1: Data Validation Testing
Ensure data quality before it enters your pipeline.
What to Validate
Schema & Types
Column names, types, order, data type correctness (int vs float vs string), format consistency for dates, emails, phone numbers.
Quality Checks
Missing values (nulls, empty strings, NaN), duplication and uniqueness constraints, referential integrity (foreign keys).
Statistical Checks
Value ranges (min/max, outliers), distribution checks (skew, balance), class balance for classification tasks.
Volume Checks
Minimum dataset size, row counts within expected bounds, batch completeness for time-series or streaming data.
Pandas Data Validation
import pandas as pd
import numpy as np
# Load data
df = pd.read_csv('data.csv')
# Schema validation
expected_columns = ['id', 'text', 'label', 'timestamp']
assert list(df.columns) == expected_columns, "Column mismatch"
# Check nulls
print(df.isnull().sum()) # Should be 0 for critical columns
assert df['id'].isnull().sum() == 0, "ID has null values"
# Check data types
assert df['id'].dtype == 'int64', "ID should be integer"
assert df['label'].isin([0, 1]).all(), "Label must be 0 or 1"
# Check duplicates
assert df['id'].is_unique, "ID has duplicates"
# Value range checks
assert df['id'].min() > 0, "ID should be positive"
assert len(df) > 100, "Dataset too small"
print("✓ All validation checks passed")
import pandas as pd
from scipy import stats
import re
# Distribution check
skewness = stats.skew(df['numeric_col'])
print(f"Skewness: {skewness:.2f}")
assert abs(skewness) < 2, "Data too skewed"
# Class balance (for classification)
class_counts = df['label'].value_counts()
ratio = class_counts.max() / class_counts.min()
print(f"Class imbalance ratio: {ratio:.2f}x")
assert ratio < 3, "Classes severely imbalanced"
# Format validation (regex)
email_pattern = r'^[\w\.-]+@[\w\.-]+\.\w+$'
assert df['email'].str.match(email_pattern).all(), \
"Invalid email format"
# Outlier detection (IQR method)
Q1 = df['value'].quantile(0.25)
Q3 = df['value'].quantile(0.75)
IQR = Q3 - Q1
outliers = df[(df['value'] < Q1 - 1.5*IQR) |
(df['value'] > Q3 + 1.5*IQR)]
print(f"Outliers: {len(outliers)}/{len(df)}")
# Date validation
df['date'] = pd.to_datetime(df['date'])
assert (df['date'] > '2020-01-01').all(), "Dates too old"
print("✓ Advanced validation passed")
Pytest for Data Validation
import pytest
import pandas as pd
from src.data.loader import load_data
@pytest.fixture
def sample_data():
return pd.DataFrame({
'id': [1, 2, 3],
'text': ['hello', 'world', 'test'],
'label': [0, 1, 0],
})
def test_no_null_values(sample_data):
"""Critical columns should have no nulls"""
assert sample_data['id'].isnull().sum() == 0
assert sample_data['text'].isnull().sum() == 0
def test_correct_schema(sample_data):
"""Columns and types must match spec"""
expected = ['id', 'text', 'label']
assert list(sample_data.columns) == expected
assert sample_data['id'].dtype == 'int64'
def test_no_duplicates(sample_data):
"""Primary key must be unique"""
assert sample_data['id'].is_unique
def test_label_values_valid(sample_data):
"""Labels must be in valid range"""
assert sample_data['label'].isin([0, 1]).all()
def test_text_length_reasonable(sample_data):
"""Text field should have content"""
assert (sample_data['text'].str.len() > 0).all()
assert (sample_data['text'].str.len() < 10000).all()
def test_load_data_from_file():
"""Integration: test actual data loading"""
df = load_data('data/sample.csv')
assert len(df) > 0
assert 'id' in df.columns
Common Data Issues
| Issue | Detection | Fix |
|---|---|---|
| Missing values | df.isnull().sum() | Drop rows, forward-fill, or use mean |
| Type mismatch | df.dtypes | pd.to_numeric(), astype() |
| Duplicates | df.duplicated().sum() | df.drop_duplicates() |
| Outliers | IQR or z-score | Clip, cap, or remove |
| Encoding issues | Garbled text, unicode errors | df['col'].str.encode('utf-8') |
Phase 2: Unit Testing
Test individual functions with pytest.
What to Unit Test in AI
- Data loaders and preprocessing functions
- Tokenizers and text cleaners
- Feature extraction and embedding functions
- Prompt templates and prompt formatting
- Configuration loading and validation
- Utility functions (parsers, formatters, validators)
- Vector search queries
- Post-processing and output formatting
Project Structure
ai_project/ ├── src/ │ ├── data/ │ │ ├── loader.py │ │ └── preprocessor.py │ ├── models/ │ │ └── embedder.py │ ├── prompts/ │ │ └── templates.py │ └── utils/ │ └── helpers.py ├── tests/ │ ├── test_data_loader.py │ ├── test_preprocessor.py │ ├── test_embedder.py │ └── test_prompts.py ├── requirements.txt ├── requirements-dev.txt └── pytest.ini
Unit Test Examples
import pytest
from pathlib import Path
from src.data.loader import load_csv, clean_text
@pytest.fixture
def test_data_dir(tmp_path):
"""Create temporary test data"""
csv_file = tmp_path / "test.csv"
csv_file.write_text("id,text,label\n1,hello,0\n2,world,1")
return tmp_path
@pytest.fixture
def sample_texts():
return [
"Hello world!",
"Test with spaces",
"Special chars: @#$%",
" leading/trailing ",
]
def test_load_csv_success(test_data_dir):
"""CSV loads correctly"""
df = load_csv(test_data_dir / "test.csv")
assert len(df) == 2
assert list(df.columns) == ['id', 'text', 'label']
def test_load_csv_file_not_found():
"""Missing file raises error"""
with pytest.raises(FileNotFoundError):
load_csv("nonexistent.csv")
def test_clean_text_removes_special_chars(sample_texts):
"""Special characters are removed"""
result = clean_text(sample_texts[2])
assert not any(c in result for c in "@#$%")
def test_clean_text_normalizes_spaces(sample_texts):
"""Multiple spaces become single"""
result = clean_text(sample_texts[1])
assert " " not in result
def test_clean_text_strips_whitespace(sample_texts):
"""Leading/trailing spaces removed"""
result = clean_text(sample_texts[3])
assert not result.startswith(" ")
assert not result.endswith(" ")
import pytest
import numpy as np
from src.data.preprocessor import tokenize, vectorize, normalize_embeddings
@pytest.fixture
def sample_texts():
return ["hello world", "test data", "example text"]
def test_tokenize_basic(sample_texts):
"""Tokenizer splits text correctly"""
tokens = tokenize(sample_texts[0])
assert tokens == ["hello", "world"]
def test_tokenize_empty_string():
"""Empty strings handled gracefully"""
tokens = tokenize("")
assert tokens == []
def test_vectorize_output_shape(sample_texts):
"""Vectorizer returns correct shape"""
vectors = vectorize(sample_texts)
assert vectors.shape == (3, 384) # 3 texts, 384-dim embeddings
def test_vectorize_no_nan(sample_texts):
"""Vectors contain no NaN values"""
vectors = vectorize(sample_texts)
assert not np.isnan(vectors).any()
def test_normalize_embeddings_magnitude():
"""Unit norm after normalization"""
vectors = np.random.randn(5, 384)
normalized = normalize_embeddings(vectors)
norms = np.linalg.norm(normalized, axis=1)
np.testing.assert_array_almost_equal(norms, np.ones(5))
import pytest
from src.prompts.templates import format_rag_prompt, format_system_message
@pytest.fixture
def sample_context():
return {
"query": "What is AI?",
"documents": [
"AI is artificial intelligence",
"Machine learning is a subset of AI"
],
"user_name": "Alice"
}
def test_format_rag_prompt_includes_query(sample_context):
"""Query is included in prompt"""
prompt = format_rag_prompt(**sample_context)
assert sample_context["query"] in prompt
def test_format_rag_prompt_includes_documents(sample_context):
"""Retrieved documents are in prompt"""
prompt = format_rag_prompt(**sample_context)
for doc in sample_context["documents"]:
assert doc in prompt
def test_format_rag_prompt_empty_documents():
"""Empty documents handled gracefully"""
prompt = format_rag_prompt(query="test", documents=[], user_name="Test")
assert len(prompt) > 0
@pytest.mark.parametrize("query", [
"short",
"a" * 5000,
"query with special chars: @#$%",
])
def test_format_rag_prompt_various_queries(query):
"""Handles various query formats"""
prompt = format_rag_prompt(query=query, documents=["doc1"], user_name="Test")
assert query in prompt
Running Pytest
# Run all tests pytest # Run specific file pytest tests/test_data_loader.py # Run specific test pytest tests/test_data_loader.py::test_load_csv_success # Verbose output pytest -v # Coverage report pytest --cov=src --cov-report=html # Run only tests matching pattern pytest -k "test_clean_text" # Stop at first failure pytest -x # Show print statements pytest -s
pytest.ini Configuration
[pytest]
minversion = 7.0
addopts =
-v
--strict-markers
--tb=short
testpaths = tests
python_files = test_*.py
python_classes = Test*
python_functions = test_*
markers =
slow: marks tests as slow
integration: marks tests as integration
unit: marks tests as unit
Phase 3: Integration Testing
Test components working together.
What to Test
- RAG pipeline (retrieve → rerank → context building)
- Multi-step workflows (load → process → vectorize → store)
- Database connections and queries
- Vector store operations (upsert, search)
- LLM calls with retries and fallbacks
- Cache functionality
- Async operations and concurrent requests
- Error handling across components
RAG Pipeline Integration Test
import pytest
from src.rag.pipeline import RAGPipeline
@pytest.fixture
def rag_pipeline():
return RAGPipeline(
embedding_model="sentence-transformers/all-MiniLM-L6-v2",
vector_db="chroma",
llm_provider="openai"
)
@pytest.fixture
def test_documents():
return [
{"id": "1", "content": "Python is a programming language"},
{"id": "2", "content": "Python was created in 1989"},
{"id": "3", "content": "Java is another programming language"},
]
@pytest.mark.integration
def test_rag_indexing(rag_pipeline, test_documents):
"""Documents are indexed correctly"""
rag_pipeline.index_documents(test_documents)
assert rag_pipeline.doc_count() == 3
@pytest.mark.integration
def test_rag_retrieval(rag_pipeline, test_documents):
"""Query retrieves relevant documents"""
rag_pipeline.index_documents(test_documents)
results = rag_pipeline.retrieve("What is Python?", top_k=2)
assert len(results) == 2
assert results[0]['score'] > results[1]['score']
assert "Python" in results[0]['content']
@pytest.mark.integration
def test_rag_end_to_end(rag_pipeline, test_documents):
"""Full RAG: index → retrieve → augment → generate"""
rag_pipeline.index_documents(test_documents)
response = rag_pipeline.query(
query="When was Python created?",
use_rag=True
)
assert response['answer'] is not None
assert len(response['sources']) > 0
@pytest.mark.integration
def test_rag_error_handling_invalid_query(rag_pipeline, test_documents):
"""Handles invalid queries"""
rag_pipeline.index_documents(test_documents)
with pytest.raises(ValueError):
rag_pipeline.retrieve("")
with pytest.raises(TypeError):
rag_pipeline.retrieve(None)
Database Integration Test
import pytest
from src.database import Database
@pytest.fixture
def db():
"""In-memory database for testing"""
db = Database(":memory:")
db.create_tables()
yield db
db.close()
@pytest.mark.integration
def test_insert_and_retrieve(db):
"""Data insert and retrieval"""
db.insert_document(doc_id="1", content="Test document", embedding=[0.1, 0.2, 0.3])
result = db.get_document("1")
assert result is not None
assert result['content'] == "Test document"
@pytest.mark.integration
def test_vector_search(db):
"""Vector similarity search"""
db.insert_document("1", "Python programming", [0.1, 0.2, 0.3])
db.insert_document("2", "Java programming", [0.1, 0.2, 0.4])
results = db.search_similar([0.1, 0.2, 0.3], top_k=2)
assert len(results) == 2
assert results[0]['doc_id'] == "1"
@pytest.mark.integration
def test_batch_operations(db):
"""Batch insert and delete"""
docs = [
{"doc_id": "1", "content": "doc1", "embedding": [0.1]},
{"doc_id": "2", "content": "doc2", "embedding": [0.2]},
{"doc_id": "3", "content": "doc3", "embedding": [0.3]},
]
db.batch_insert(docs)
assert db.count() == 3
db.batch_delete(["1", "2"])
assert db.count() == 1
Async Integration Test
import pytest
import asyncio
from src.async_pipeline import AsyncPipeline
@pytest.mark.asyncio
async def test_concurrent_requests():
"""Multiple async requests work concurrently"""
pipeline = AsyncPipeline()
tasks = [
pipeline.process_async("query 1"),
pipeline.process_async("query 2"),
pipeline.process_async("query 3"),
]
results = await asyncio.gather(*tasks)
assert len(results) == 3
assert all(r is not None for r in results)
@pytest.mark.asyncio
async def test_error_handling_async():
"""Async errors propagate correctly"""
pipeline = AsyncPipeline()
with pytest.raises(ValueError):
await pipeline.process_async(None)
Phase 4: Model Testing
Evaluate LLM outputs and AI quality.
What to Test
Output Format
JSON structure, required keys, response schema, token counts and length limits.
Consistency
Same input → similar outputs, determinism at temperature=0, diversity at high temperature.
RAG Quality
Faithfulness, answer relevancy, context precision, context recall via RAGAS metrics.
Embedding Quality
Correct dimensions, unit norm, semantic similarity ordering, no NaN values.
LLM Output Format Tests
import pytest
import json
from src.llm.client import LLMClient
@pytest.fixture
def llm_client():
return LLMClient(provider="openai", model="gpt-4")
def test_json_output_format(llm_client):
"""LLM returns valid JSON when requested"""
response = llm_client.generate(
prompt="Extract entities from: 'John works at Apple'",
response_format="json",
schema={"type": "object", "properties": {
"person": {"type": "string"},
"company": {"type": "string"}
}}
)
data = json.loads(response)
assert data['person'] == "John"
assert data['company'] == "Apple"
def test_structured_output_required_fields(llm_client):
"""All required fields present in output"""
response = llm_client.generate(
prompt="Classify: 'I love this product!'",
response_format="json",
schema={"required": ["sentiment", "confidence", "keywords"]}
)
data = json.loads(response)
assert 'sentiment' in data
assert 'confidence' in data
assert 'keywords' in data
def test_response_within_length_limit(llm_client):
"""Response respects max_tokens"""
response = llm_client.generate(prompt="Summarize in 2 sentences", max_tokens=50)
assert len(response.split()) < 50
@pytest.mark.parametrize("invalid_prompt", [None, "", " " * 100])
def test_invalid_prompts_handled(llm_client, invalid_prompt):
"""Invalid prompts raise appropriate errors"""
with pytest.raises((ValueError, TypeError)):
llm_client.generate(invalid_prompt)
Consistency Testing
import pytest
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from src.llm.client import LLMClient
@pytest.fixture
def llm_client():
return LLMClient(provider="openai", model="gpt-4")
def test_response_consistency(llm_client, embeddings_model):
"""Same prompt produces similar responses"""
prompt = "What is the capital of France?"
responses = [llm_client.generate(prompt, temperature=0.7) for _ in range(3)]
embeddings = embeddings_model.encode(responses)
similarity_matrix = cosine_similarity(embeddings)
avg_similarity = np.mean([similarity_matrix[i][j]
for i in range(3) for j in range(i+1, 3)])
assert avg_similarity > 0.7, "Responses too inconsistent"
def test_deterministic_with_temperature_zero(llm_client):
"""Same prompt with temperature=0 is deterministic"""
prompt = "Spell 'hello'"
responses = [llm_client.generate(prompt, temperature=0) for _ in range(2)]
assert responses[0] == responses[1]
RAG Quality Metrics (RAGAS)
import pytest
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision, context_recall
from datasets import Dataset
@pytest.fixture
def rag_eval_data():
return Dataset.from_dict({
"question": ["What is Python?", "When was Python created?"],
"answer": [
"Python is a high-level programming language.",
"Python was created in 1989.",
],
"contexts": [
["Python is a high-level, interpreted programming language.", "Created by Guido van Rossum"],
["Python was first released in 1991.", "Guido van Rossum started the project in 1989."]
],
"ground_truth": [
"Python is a programming language",
"Python was created in 1989 and released in 1991",
]
})
def test_rag_faithfulness(rag_eval_data):
"""Answers grounded in context — target > 0.7"""
score = evaluate(rag_eval_data, metrics=[faithfulness])
assert score['faithfulness'].score > 0.7
def test_rag_answer_relevancy(rag_eval_data):
"""Answers address the question — target > 0.7"""
score = evaluate(rag_eval_data, metrics=[answer_relevancy])
assert score['answer_relevancy'].score > 0.7
def test_rag_context_precision(rag_eval_data):
"""Retrieved context is relevant — target > 0.7"""
score = evaluate(rag_eval_data, metrics=[context_precision])
assert score['context_precision'].score > 0.7
Answer Relevancy: Does the answer actually address the question?
Context Precision: Is the retrieved context relevant to the question?
Context Recall: Does the retrieved context contain the ground truth?
Phase 5: End-to-End Testing
Test complete user workflows from input to output.
What to Test
- Full user workflows (user input → AI response)
- Error scenarios and recovery
- Timeout handling
- Retry logic with exponential backoff
- Fallback mechanisms
- Rate limiting
- Session persistence across requests
- Multi-turn conversations
- File uploads and processing
Full Workflow E2E Test
import pytest
from src.app import App
@pytest.fixture
def app():
return App(env="test")
@pytest.mark.e2e
def test_user_query_happy_path(app):
"""Full: user sends query → gets relevant response"""
session = app.create_session(user_id="test_user")
response = session.ask("What is machine learning?")
assert response['answer'] is not None
assert len(response['answer']) > 0
assert response['sources'] is not None
assert response['latency_ms'] < 5000 # Under 5 seconds
@pytest.mark.e2e
def test_multi_turn_conversation(app):
"""Conversation maintains context across turns"""
session = app.create_session(user_id="test_user")
session.ask("My name is Alice")
response = session.ask("What's my name?")
assert "Alice" in response['answer']
@pytest.mark.e2e
def test_error_recovery(app):
"""App recovers from partial failures gracefully"""
session = app.create_session(user_id="test_user")
# Simulate a query that might stress the system
response = session.ask("a" * 10000) # Very long query
assert response is not None
assert 'error' in response or 'answer' in response
@pytest.mark.e2e
def test_rate_limiting(app):
"""Rate limiting triggers and returns proper error"""
session = app.create_session(user_id="test_user")
responses = [session.ask(f"Query {i}") for i in range(100)]
rate_limited = [r for r in responses if r.get('status') == 429]
assert len(rate_limited) > 0
API E2E with httpx
import pytest
import httpx
BASE_URL = "http://localhost:8000"
@pytest.mark.e2e
def test_health_endpoint():
"""Health check returns 200"""
r = httpx.get(f"{BASE_URL}/health")
assert r.status_code == 200
assert r.json()["status"] == "ok"
@pytest.mark.e2e
def test_query_endpoint():
"""Query endpoint returns structured response"""
payload = {"query": "What is RAG?", "user_id": "test"}
r = httpx.post(f"{BASE_URL}/query", json=payload, timeout=10)
assert r.status_code == 200
data = r.json()
assert "answer" in data
assert "sources" in data
@pytest.mark.e2e
def test_invalid_query_returns_422():
"""Missing required fields → validation error"""
r = httpx.post(f"{BASE_URL}/query", json={})
assert r.status_code == 422
Phase 6: Performance Testing
Measure latency, throughput, and cost.
Key Metrics
Latency
P50, P95, P99 response times. Time-to-first-token (TTFT) for streaming responses.
Throughput
Requests per second under load. Token generation speed (tokens/sec).
Cost
API token usage and estimated USD cost per query. Memory footprint during inference.
Latency Benchmarking
import pytest
import time
import statistics
from src.llm.client import LLMClient
@pytest.fixture
def llm_client():
return LLMClient(provider="openai", model="gpt-4")
@pytest.mark.slow
def test_latency_p95(llm_client):
"""P95 latency under 3 seconds"""
latencies = []
prompt = "Summarize AI in one sentence."
for _ in range(20):
start = time.perf_counter()
llm_client.generate(prompt)
latencies.append(time.perf_counter() - start)
latencies.sort()
p95 = latencies[int(len(latencies) * 0.95)]
print(f"\nP50: {statistics.median(latencies):.2f}s")
print(f"P95: {p95:.2f}s")
assert p95 < 3.0, f"P95 latency {p95:.2f}s exceeds 3s threshold"
@pytest.mark.slow
def test_throughput_under_load(llm_client):
"""At least 5 req/s under concurrent load"""
import concurrent.futures
def single_request():
start = time.perf_counter()
llm_client.generate("What is 2+2?")
return time.perf_counter() - start
n = 20
start = time.perf_counter()
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as ex:
list(ex.map(lambda _: single_request(), range(n)))
elapsed = time.perf_counter() - start
rps = n / elapsed
print(f"\nThroughput: {rps:.1f} req/s")
assert rps >= 5, f"Throughput {rps:.1f} req/s below 5 req/s threshold"
@pytest.mark.slow
def test_token_cost_tracking(llm_client):
"""Track token usage and estimated cost"""
response = llm_client.generate_with_usage("Explain embeddings briefly.")
usage = response['usage']
cost = (usage['prompt_tokens'] * 0.00003 +
usage['completion_tokens'] * 0.00006)
print(f"\nTokens: {usage['total_tokens']}, Cost: ${cost:.6f}")
assert cost < 0.01, "Single query cost exceeds $0.01"
Memory Profiling
import tracemalloc
from src.models.embedder import EmbeddingModel
def test_embedding_memory_usage():
"""Embedding model stays under 500MB"""
tracemalloc.start()
model = EmbeddingModel()
texts = ["sample text"] * 1000
model.embed_batch(texts)
current, peak = tracemalloc.get_traced_memory()
tracemalloc.stop()
peak_mb = peak / 1024 / 1024
print(f"\nPeak memory: {peak_mb:.1f} MB")
assert peak_mb < 500, f"Memory usage {peak_mb:.1f}MB exceeds 500MB"
Debugging & Breakpoints
Using breakpoints for exploration, converting insights to tests.
Breakpoint Workflow
-
1
Exploration
Set breakpoints to inspect state during development. Check variable values, data shapes, intermediate outputs.
-
2
Insight
Identify what's wrong or unexpected. Document the issue with concrete values and conditions.
-
3
Test
Convert your insight into a pytest test that would have caught this. Add to your test suite.
-
4
Fix
Implement the fix. Run the new test to confirm it passes. Run full suite to check for regressions.
Setting Breakpoints
def preprocess_text(text: str) -> str:
breakpoint() # drops into interactive pdb debugger
text = text.lower()
tokens = text.split()
return " ".join(tokens)
# pdb commands:
# n → next line
# s → step into function
# c → continue to next breakpoint
# p variable → print value
# pp variable → pretty-print value
# l → list surrounding code
# q → quit debugger
# Drop into pdb on any failure pytest --pdb # Drop into pdb at the first failure only pytest --pdb -x # Show local variable values in tracebacks pytest --tb=long -v # Run only the failing test, drop into pdb pytest tests/test_data_loader.py::test_load_csv_success --pdb
import logging
logging.basicConfig(
level=logging.DEBUG,
format="%(asctime)s [%(levelname)s] %(name)s — %(message)s"
)
logger = logging.getLogger(__name__)
def embed_documents(docs: list[str]) -> list:
logger.debug(f"Embedding {len(docs)} documents")
embeddings = model.encode(docs)
logger.debug(f"Embedding shape: {embeddings.shape}")
if embeddings.isnan().any():
logger.error("NaN detected in embeddings!")
return embeddings
# pytest.ini — capture logs during tests
# [pytest]
# log_cli = true
# log_cli_level = DEBUG
Common Debugging Patterns for AI
# Check embedding shape and quality
print(f"Shape: {embeddings.shape}, NaN: {np.isnan(embeddings).sum()}, Norm: {np.linalg.norm(embeddings, axis=1).mean():.3f}")
# Inspect retrieval scores
for i, (doc, score) in enumerate(zip(results, scores)):
print(f"[{i}] score={score:.4f} | {doc['content'][:80]}")
# Check prompt token count
import tiktoken
enc = tiktoken.encoding_for_model("gpt-4")
tokens = enc.encode(prompt)
print(f"Prompt tokens: {len(tokens)}")
# Inspect dataframe at a stage
print(df.describe())
print(df.dtypes)
print(df.isnull().sum())
Complete Testing Checklist
Use this before going to production.
Data Validation Phase
Unit Testing Phase
Integration Testing Phase
Model Testing Phase
End-to-End & Performance Phase
Tools & Resources
Essential testing libraries for AI projects.
Core Libraries
| Tool | Purpose | Install |
|---|---|---|
| pytest | Testing framework | pip install pytest |
| pytest-cov | Coverage reporting | pip install pytest-cov |
| pytest-asyncio | Async test support | pip install pytest-asyncio |
| RAGAS | RAG quality evaluation | pip install ragas |
| LangSmith | LLM tracing & observability | pip install langsmith |
| deepeval | LLM evaluation framework | pip install deepeval |
| httpx | API E2E testing | pip install httpx |
| faker | Generate test data | pip install faker |
Requirements Files
# Testing pytest>=7.0 pytest-cov>=4.0 pytest-asyncio>=0.21 pytest-mock>=3.10 # AI Evaluation ragas>=0.1 deepeval>=0.20 # API Testing httpx>=0.25 # Data Generation faker>=19.0 # Profiling memory-profiler>=0.61 py-spy>=0.3
CI/CD Integration
name: AI Project Tests
on: [push, pull_request]
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: "3.11"
- name: Install dependencies
run: pip install -r requirements-dev.txt
- name: Run unit tests
run: pytest -m "not integration and not slow" -v --cov=src
- name: Run integration tests
run: pytest -m integration -v
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
- name: Upload coverage
uses: codecov/codecov-action@v4