01 Single Server Setup (All-in-One)
Everything running on one machine: web app, database, cache.
Problem 1.1
Single Point of Failure (SPOF)
Server goes down → entire website/app becomes inaccessible.
✓ No immediate solution (accept risk in MVP)
At single-server stage, focus on feature velocity. Monitor uptime and treat failures as learning.
Contingency: Set up automated backups to external storage.
Risk ManagementMVP Stage
Problem 1.2
Traffic Bottleneck & Scaling Limit
As traffic grows, single server cannot handle simultaneous users. Response times degrade, requests fail.
✓ Vertical Scaling (Temporary)
Add CPU, RAM, and disk storage to the single server.
Cost-effective short-term solution (~weeks to months).
Buys time before reaching hardware limits (max ~256 GB RAM, 64 vCPUs on most commodity hardware).
Cost-EffectiveQuick Implementation
Problem 1.3
No Redundancy or Backup
Data loss risk if hardware fails. No way to serve users while recovering.
✓ Automated Backups + Manual Recovery
Set up daily/hourly backups to external storage (S3, backup service).
Document recovery procedure for manual failover.
Recovery time: hours to days (RTO) — not acceptable for production.
Data ProtectionManual Failover
02a Separate Web Tier from Data Tier
Move database to its own server, web traffic to another.
Problem 2.1
Web & Database Compete for Resources
Heavy database queries slow down web request handling; web traffic spikes starve the DB.
✓ Independent Tier Separation
Move database server to separate machine (10.0.0.100).
Web server handles requests on 10.0.0.1, database serves data independently.
Can scale each tier independently: add CPU to web without touching DB, and vice versa.
Resource IsolationIndependent Scaling
Problem 2.2
Both Tiers Still Have SPOF
Web server down → no requests served; DB server down → no data access.
✓ Each Tier Gets Failover in Next Stages
Web tier: Add load balancer + multiple web servers (Stage 3).
DB tier: Add database replication — master/slave (Stage 4).
Roadmap
02b Choose Database Type
SQL vs NoSQL: structural vs performance trade-off.
Decision Point
Which Database Fits Your Data Model?
Wrong database choice = costly migration later.
✓ Use SQL (Relational) If:
Data is structured (users, posts, comments with relationships).
Need ACID transactions (financial data, orders).
Need JOIN operations across tables.
Data model is known upfront.
Examples: MySQL, PostgreSQL, Oracle.
ReliabilityConsistency
✓ Use NoSQL If:
Data is unstructured or semi-structured (logs, events, documents).
Need super-low latency (<10 ms response).
Horizontal scaling is critical (millions of records/sec).
Data model evolves rapidly (schema flexibility).
Examples: MongoDB, DynamoDB, Cassandra, Redis.
PerformanceScalability
02c Choose Scaling Approach
Vertical vs Horizontal scaling strategy.
Problem 2.3
Single Server Cannot Scale Beyond Hardware Limits
Max CPU, RAM, disk on commodity hardware ≈ 256 GB RAM, 64 vCPUs. Millions of users need more.
✓ Vertical Scaling (Scale Up) — Short-term
Add more CPU, RAM, disk to single server.
Cheap, simple, works for moderate growth.
BUT: hardware limit reached (AWS max ~768 GB RAM, 96 vCPUs); cost becomes exponential; still SPOF.
QuickSimple
✓ Horizontal Scaling (Scale Out) — Long-term
Add multiple servers (10.0.0.1, 10.0.0.2, 10.0.0.3…).
Unlimited scalability: can keep adding servers as needed.
Provides redundancy: if one server fails, others still serve traffic.
Requires load balancer to distribute requests.
Unlimited ScaleRedundancyRecommended
03 Multiple Web Servers (Horizontal Scaling)
Added multiple web servers (10.0.0.1, 10.0.0.2, 10.0.0.3).
Problem 3.1
How Do Users Know Which Server to Connect To?
With 3 web servers, users need a single entry point. Which IP do they use?
✓ Load Balancer (Public IP 88.88.88.1)
Single public IP that users connect to.
Load balancer routes requests to private IPs using round-robin, least-connections, or other algorithms.
Users never see private IPs; load balancer handles routing transparently.
Single Entry PointAbstraction
Problem 3.2
Traffic Not Evenly Distributed
Without a load balancer, all users might hit Server 1, leaving Servers 2 & 3 idle. Server 1 overloads.
✓ Load Balancer Distributes Traffic Evenly
Round-robin: Server 1, then 2, then 3, then repeat.
Least-connections: route to server with fewest active connections.
IP hash: same client always routes to same server (if sticky sessions needed).
Load Distribution
Problem 3.3
No Failover If One Server Goes Down
Server 1 crashes → load balancer still tries to route to it → requests timeout/fail.
✓ Load Balancer Health Checks
Balancer pings each server every 10 seconds: "Are you alive?"
If Server 1 doesn't respond, balancer marks it DOWN and removes it from rotation.
Requests now route only to Servers 2 & 3; users don't see the outage.
Ops team can add a replacement Server 4; balancer detects it and adds to rotation.
Automatic FailoverHigh Availability
Problem 3.4
Load Balancer Itself Is a SPOF
Load balancer goes down → users can't reach any web server (even if all 3 are healthy).
✓ Redundant Load Balancers
Deploy 2+ load balancers in active-passive or active-active config.
DNS points to primary; if primary fails, DNS fails over to secondary.
Or: use managed load balancer service (AWS ALB, GCP LB) — provider handles redundancy.
High Availability
04 Database Replication (Master-Slave)
While the web tier has redundancy, the database is still a single point of failure.
Problem 4.1
Master Database Is a SPOF
Master DB crashes → no reads, no writes → entire app becomes unusable (even if web servers are healthy).
✓ Master-Slave Database Replication
Master DB (10.0.0.100): handles ALL writes, updates, deletes.
Slave DB(s) (10.0.0.101, 10.0.0.102): receive copies of data from master in real-time.
Web servers send reads to any slave, writes to master only.
If master fails: promote one slave to become new master.
Replication lag: typically <100 ms; eventual consistency.
High AvailabilityRead Scaling
Problem 4.2
Master Promotion Is Manual & Error-Prone
When master fails, ops engineer must manually SSH, run promotion scripts, update configs. Takes 15+ minutes. Data loss possible.
✓ Automated Failover (Advanced)
Tools: MGR (MySQL Group Replication), Patroni (PostgreSQL), etc.
Automatically detect master failure, promote best slave, update all clients.
Failover time: <30 seconds; minimises data loss.
Production-Grade
Problem 4.3
Slave Data Might Be Out of Sync with Master
Replication lag: master gets an update, but slave takes 500 ms to apply it. Web server reads slave → gets stale data.
✓ Accept Eventual Consistency (or use stronger guarantees)
For most apps, eventual consistency is fine: <100 ms lag is imperceptible.
Critical reads (financial, current user profile): route to MASTER only.
Less critical reads (user feed, analytics): read from SLAVE (stale is OK).
Pragmatic Trade-off
05 Cache Layer (Redis / Memcached)
Database is now highly available, but still bottlenecked by repeated read requests.
Problem 5.1
Database Is Hit by Thousands of Repeated Read Queries
3 web servers × 100 users = 300 identical DB queries/sec → DB CPU maxes out, latency spikes.
✓ Cache Layer (Redis / Memcached at 10.0.0.50)
Before hitting DB, web server checks cache: "Is this data in memory?"
Cache hit (data found): return in <1 ms.
Cache miss (data not found): query DB, store result in cache with TTL, return.
Typical hit rate: 90%+ → 90% of requests bypass the database entirely.
Result: DB load drops 10×, latency drops 100×.
10-100× FasterDB Load Reduction
Problem 5.2
Cache Holds Stale Data
User updates profile → written to DB. But cache still has old data for 30 minutes → next read gets stale profile.
✓ TTL (Time-To-Live) Policy
Cache all data with expiration: cache.set("user:123", {...}, ttl=60)
After 60 s, cache entry expires automatically; next read hits DB for fresh data.
Invalidation: proactively delete from cache on write: cache.delete("user:123")
Freshness Guarantee
Problem 5.3
Cache Is a SPOF
Cache server goes down → all requests become cache misses → DB gets hammered → cascading failure ("thundering herd").
✓ Redundant Cache Servers (Cluster)
Deploy cache as cluster (2–3 replicas) instead of single instance.
If one node fails, cluster routes to others.
Overprovision capacity: cache 10 GB data, allocate 15 GB (50% headroom).
High AvailabilityResilience
Problem 5.4
Cache Memory Is Full
Cache can hold only ~16 GB data. After reaching 16 GB, where do new entries go?
✓ Cache Eviction Policy
LRU (Least-Recently-Used): delete oldest-accessed entry to make room.
LFU (Least-Frequently-Used): delete least-popular entry.
FIFO (First-In-First-Out): delete oldest entry.
Most common: LRU (hottest data stays, cold data gets evicted).
Memory Management
Considerations for Using Cache
When to use cache. Consider using cache when data is read frequently but modified infrequently. Since cached data is stored in volatile memory, a cache server is not ideal for persisting data. If a cache server restarts, all data in memory is lost. Important data should always be saved in persistent data stores.
Expiration policy. It is good practice to implement an expiration policy. Once cached data expires, it is removed from the cache. Without an expiration policy, cached data remains in memory permanently. The expiration date should not be too short (causes frequent database reloads) or too long (data can become stale).
Consistency. Keeping the data store and the cache in sync is critical. Inconsistency can occur because data-modifying operations on the data store and cache are not in a single transaction. When scaling across multiple regions, maintaining consistency is especially challenging. For further details, refer to "Scaling Memcache at Facebook" published by Facebook.
Mitigating failures. A single cache server represents a potential single point of failure (SPOF) — a part of a system that, if it fails, will stop the entire system from working. Multiple cache servers across different data centres are recommended to avoid SPOF. Another approach is to overprovision the required memory by a certain percentage, providing a buffer as memory usage increases.
Eviction policy. Once the cache is full, any request to add items may cause existing items to be removed (cache eviction). Least-Recently-Used (LRU) is the most popular policy. Other policies such as Least Frequently Used (LFU) or First In First Out (FIFO) can be adopted to satisfy different use cases.
Best PracticesCache DesignProduction Readiness
06 Content Delivery Network (CDN)
Database and app are optimised, but static assets (JS, CSS, images) are still served from origin.
Problem 6.1
Static Assets Downloaded from Origin Every Time
User in London requests logo.png from US server → 120 ms latency × thousands of users = network congestion.
✓ CDN at the Edge
CDN is a global network of servers physically located near users.
Static assets cached on CDN edges near users.
User in London → served from CDN UK edge (30 ms) instead of US origin (120 ms).
4× FasterGlobal Reach
Problem 6.2
CDN Serves Stale Assets
Release new logo.jpg but CDN edges still serve old cached version for hours.
✓ Cache Invalidation + Versioning
Set TTL on CDN: cache static assets for 1 hour, then re-fetch from origin.
Use versioning in URL: logo.jpg?v=2 → new cache key, CDN fetches fresh.
Purge CDN manually via API when deploying new assets.
Asset Freshness
Problem 6.3
CDN Outage Breaks Static Assets
CDN provider has a regional outage → users can't load CSS/JS → site is broken.
✓ CDN Fallback (Origin Fallback)
HTML includes fallback: if CDN request fails, browser requests asset from origin.
Ensures graceful degradation: worse latency, but site stays functional.
Resilience
07 Stateless Web Tier
Web tier is scaled horizontally, but session state makes servers interdependent.
Problem 7.1
Session State Couples Servers to Users (Sticky Sessions)
User A logs in on Server 1 → session stored in Server 1 memory. Load balancer MUST always route User A to Server 1, or auth fails.
✓ Move Session State to Persistent Store
Store session data in Redis/Memcached or database, NOT in server memory.
Load balancer can route User A to ANY server; all servers fetch session from Redis.
Servers become stateless: can be killed/restarted without losing user sessions.
Auto-scaling now works: spin up 10 servers, kill 5, users don't notice.
StatelessAuto-Scaling
Problem 7.2
Cannot Add/Remove Servers Without Breaking Users
When session is tied to Server 1, removing Server 1 logs out all its users.
✓ Auto-Scaling Becomes Possible
With sessions in Redis, can spin up/down servers freely without user impact.
Monitor CPU/memory; if >80%, add servers; if <20%, remove servers.
Tools: Kubernetes auto-scaler, AWS auto-scaling groups.
ElasticityCost Efficiency
08 Multi-Data Center Setup
Single data center is now a regional SPOF; global users have high latency.
Problem 8.1
Single Data Center Failure = Total Outage
Fire in US-East data center → all servers, DBs, cache offline → millions of users unreachable.
✓ Deploy Redundant Data Centers (DC1 + DC2)
Deploy identical infrastructure in US-East (DC1) and US-West (DC2).
Each DC has: web servers, cache, master DB, workers, etc.
If US-East fails, GeoDNS routes 100% traffic to DC2. RTO: <30 s.
Disaster RecoveryHigh Availability
Problem 8.2
High Latency for Users Far from Data Center
User in Australia requests from US-only server → 200 ms latency (vs 30 ms from local edge).
✓ GeoDNS Routing + Multiple DCs
GeoDNS resolves domain based on user location: US-East → DC1, US-West → DC2.
Route to nearest DC → lower latency, better user experience.
Low LatencyGlobal Reach
Problem 8.3
Data Inconsistency Across DCs
User updates profile in DC1 → replicated to DC2 asynchronously (100–500 ms lag) → user logs in from DC2, sees old profile.
✓ Asynchronous Multi-DC Replication
Master DB in each DC handles local writes; replicates to other DCs asynchronously.
Eventual consistency: within seconds/minutes, all DCs converge to same state.
For critical data (financial): stick to single DC for writes.
Eventual Consistency
Problem 8.4
Complex Operational Burden
Deploying to 2 DCs means 2× infrastructure, 2× monitoring, 2× backups, 2× troubleshooting.
✓ Infrastructure as Code (IaC) + Automation
Terraform/Bicep: define entire DC infrastructure once, deploy to both regions with one command.
Centralised monitoring: one dashboard showing both DCs' health.
Operational Simplicity
REF Correct End-to-End Architecture
User DNS / Geo-routing (Route53 / Cloudflare) CDN (optional cache) Load Balancer Stateless Servers (API layer) Cache (Redis) Databases / Storage (SQL, NoSQL, S3)

Key Corrections

Statement Status Correction
LB distributes across servers Correct
LB distributes across databases ⚠️ Usually via DB proxies, not LB
LB chooses data center DNS / geo-routing does this
Data centers store data Also host compute + networking
Servers host backend/frontend Correct

Advanced Note — Multi-Region Production Systems

Layer 1: DNS decides region
Layer 2: LB distributes within region
Layer 3: DB replication handles data locality

Final Takeaway

Load Balancer
Intra-region traffic distribution (servers)
DNS / CDN
Inter-region routing (data centers)
Data Center
Full stack (compute + storage)
Servers
Execution layer (APIs, apps, jobs)
09 Message Queue & Async Workers
Web servers handling long-running tasks (photo processing, email) block user requests.
Problem 9.1
Long-Running Tasks Block User Requests
User uploads photo → web server processes it (crop, resize, blur = 10 s) → request blocked → user sees timeout.
✓ Message Queue (RabbitMQ / Kafka at 10.0.0.60)
Web server publishes task to queue: queue.publish("photo_process", {photo_id: 123})
Returns immediately: "Your photo is processing" (200 ms response).
Worker picks up task from queue asynchronously and processes offline.
Non-BlockingBetter UX
Problem 9.2
Cannot Scale Processing Tasks Independently
10,000 photo uploads → should spin up 100 workers, each handling 100 photos. Not 10,000 servers.
✓ Decouple Web Tier from Worker Tier
Web servers only handle requests and publish to queue.
Workers independently consume from queue and process photos.
If queue backlog grows, spin up more workers. If queue empties, kill workers.
Web and worker tiers scale independently.
Independent ScalingCost Efficient
Problem 9.3
Worker Crashes Lose Tasks
Worker picks up task, starts processing, then crashes → task lost, photo never processed.
✓ Durable Message Queue + Acknowledgments
Queue persists all messages to disk before worker processes them.
If crash before ACK, queue re-delivers to another worker.
Guarantees: "at least once" or "exactly once" (more complex, slower).
DurabilityFault Tolerance
Problem 9.4
Queue Server Is a SPOF
Queue crashes → all task publishing fails → web servers can't offload → requests timeout.
✓ Distributed Queue Cluster
Deploy queue as 3-node cluster (RabbitMQ cluster, Kafka brokers).
If one node fails, cluster rebalances; publishers/consumers reconnect to other nodes.
High Availability
10 Database Sharding (Horizontal Scaling)
Database has grown to massive scale; single master can't handle read/write volume.
Problem 10.1
Single Master Database Hits Scalability Ceiling
Millions of concurrent users, billions of rows → single master can't handle throughput. 5 s queries.
✓ Vertical Scaling (Temporary, ~weeks)
Add massive CPU/RAM to single server: 512 GB RAM, 96 vCPUs.
Works for medium scale (~100 M records).
BUT: costs exponential; hits hardware limits; still SPOF.
Quick Fix
✓ Horizontal Database Scaling = Sharding
Partition data across multiple databases (Shard 0, 1, 2, 3).
Sharding key: user_id. Hash function: user_id % 4.
Each shard holds 1/4 of total data; queries 4× faster than monolith.
Can add more shards as data grows; unlimited horizontal scale.
Linear ScalingUnlimited Growth
Problem 10.2
Sharding Introduces Complexity
Cannot JOIN across shards; resharding when data grows; hotspot keys overload one shard.
✓ Accept Complexity or Mitigate
Avoid cross-shard JOINs: denormalise data (duplicate in each shard).
Resharding: split Shard 0 into 0a, 0b; move half the data.
Hotspot keys: split that shard further, allocate dedicated replicas.
Planned Complexity
Problem 10.3
Choosing Wrong Sharding Key Is Fatal
Shard by country → US shard has 100 B rows, EU shard has 1 B rows → uneven, one shard overloaded.
✓ Choose Sharding Key Carefully
Shard key must be: (1) immutable, (2) even distribution, (3) queries mostly within shard.
Good: user_id (every user unique, query by user common).
Bad: timestamp (new data all goes to latest shard, old shards empty).
Design Decision
11 Monitoring, Logging & Cost Tracking
System is complex (2 DCs, multiple DBs, caches, queues, workers); hard to debug issues at scale.
Problem 11.1
Cannot Debug Production Issues
3 web servers, 2 DBs, 1 cache, 1 queue, 3 workers across 2 DCs. User reports "photo not processing." Where is the bug?
✓ Centralised Logging (ELK, Splunk, CloudWatch)
Every component logs to central location.
Search all logs by user_id: see every request, DB query, worker task.
Alerts: if error rate > 5%, page on-call engineer.
Observability
Problem 11.2
No Visibility into System Health
Is the system healthy? Are users happy? Is the DB slow? No idea until customers complain.
✓ Metrics & Dashboards (Prometheus, Grafana, Datadog)
Collect metrics: CPU, memory, disk, request latency, error rate, queue size, DB query time.
Dashboard shows: latency, queue backlog, cache hit rate — spot problems before users notice.
Proactive Monitoring
Problem 11.3
Cloud Costs Are Spiralling
2 DCs × 10 servers + DB + cache + queue + worker tier = $50K/month. Is this efficient?
✓ Cost Tracking & Automation
Tag every resource by service: "photo_service", "payment_service", "worker_tier".
Track: cost per service, cost per user, cost per transaction.
Alerts: if service cost > budget, shut down unused resources.
Tools: LangSmith for AI workload tracking; Kubecost for Kubernetes; CloudHealth for AWS.
Financial Control
Problem 11.4
Deployments Are Manual and Error-Prone
Deploy new code: SSH to 6 servers, restart services, hope nothing breaks. Takes 30 minutes.
✓ CI/CD Automation
Continuous Integration: every commit runs tests automatically.
Continuous Deployment: tests pass → deploy to staging → approve → auto-deploy to production.
Result: deploy in 2 minutes with 0 human steps; rollback if needed.
VelocityReliability
Complete Problems & Solutions Flowchart
System Design: Problems & Solutions Flowchart Complete flowchart showing all scaling problems and their solutions from single server to millions of users SYSTEM DESIGN: PROBLEMS & SOLUTIONS FLOWCHART STAGE 1: SINGLE SERVER SETUP Single Server (All-in-One) Web + DB + Cache on one machine Problem 1.1: Single Point of Failure (SPOF) Server down → entire app down Solution 1.1: Automated Backups (MVP Stage) Daily backups to S3. Manual recovery. RTO = hours Problem 1.2: Traffic Bottleneck Many users → slow response Vertical Scaling Add CPU, RAM Hits ~256GB limit Horizontal Later Add more servers Unlimited scale Problem 1.3: No Redundancy Hardware fails → data loss Backups External storage Recovery = hours STAGE 2: SEPARATE TIERS + DATABASE TYPE & SCALING CHOICE Separate Web Tier from Data Tier Web (10.0.0.1) ← → Database (10.0.0.100) Problem 2.1: Resource Contention Web & DB fight for CPU, memory, disk Solution 2.1: Independent Tier Separation Each tier scales independently. No more resource conflict Stage 2b: SQL or NoSQL? Structured? Need ACID? Speed or consistency? Use SQL If: Structured, ACID, JOINs MySQL, PostgreSQL, Oracle Use NoSQL If: Unstructured, low latency MongoDB, DynamoDB, Cassandra Stage 2c: Scaling Strategy Vertical or Horizontal? Growth rate? Vertical (Scale Up) Add CPU, RAM. Hits limits Still SPOF. Exponential cost Horizontal (Scale Out) Multiple servers. Unlimited Needs load balancer (Stage 3) STAGE 3: LOAD BALANCER & TRAFFIC DISTRIBUTION Multiple Web Servers: 10.0.0.1, 10.0.0.2, 10.0.0.3 Question: How do users find them? Which server to connect? Problem 3.1: No Single Entry Point Traffic uneven, no failover if server goes down Solution 3.1-3.3: Load Balancer (Public IP 88.88.88.1) Routes to private IPs (10.0.0.1-3). Health checks. Round-robin distribution. Server down → automatic failover. Traffic evenly distributed. STAGE 4: DATABASE REPLICATION (MASTER-SLAVE) Problem 4.1: Master Database Is SPOF Master crashes → no reads/writes → app down Solution 4.1: Master-Slave Replication Master (10.0.0.100): ALL writes. Slaves (10.0.0.101-102): read copies If master fails → promote slave. Replication lag <100ms. Eventual consistency. Problem 4.2: Manual Failover 15+ min downtime, data loss risk Automated Failover MGR/Patroni. Failover <30s Problem 4.3: Stale Slave Reads Replication lag 500ms → stale data Eventual Consistency Critical reads → master only STAGE 5: CACHE LAYER (REDIS/MEMCACHED) Problem 5.1: Database Hammered by Repeated Reads 3 servers x 100 users = 300 identical queries/sec → DB overload Solution 5.1: Cache Layer (Redis/Memcached at 10.0.0.50) Cache hit <1ms. Cache miss → DB, store in cache. Hit rate 90%+ Result: DB load drops 10x, latency 100x faster Problem 5.2: Stale Cache Cache holds old data, user sees stale profile TTL & Invalidation Short TTL or proactive delete on write Problem 5.3: Cache Is SPOF Cache down → all misses → DB hammered Cache Cluster 2-3 replicas. 50% capacity headroom Problem 5.4: Cache Full Cache at capacity. Which entries evict? Eviction Policy LRU: delete least-recently-used STAGE 6: CDN (CONTENT DELIVERY NETWORK) Problem 6.1: Static Assets Slow Globally User in London hits US server → 120ms vs 30ms from local edge Solution 6.1: CDN (Global Edge Servers) Serve static assets from edge near users. 30ms instead of 120ms Bandwidth offload. Result: 4x faster, less origin load Problem 6.2: Stale CDN CDN serves old assets for hours Cache Invalidation TTL or versioning: logo.jpg?v=2 Problem 6.3: CDN Outage CDN provider down → site broken CDN Fallback Request from origin if CDN fails STAGE 7: STATELESS WEB TIER Problem 7.1: Session State Couples Servers User A logs in on Server 1 → load balancer MUST route to 1 always Can't add/remove servers without logging out users Solution 7.1: Move State to Persistent Store Store sessions in Redis/DB (10.0.0.50), NOT server memory Any server can fetch session. LB routes to ANY server. Auto-scaling works! STAGE 8: MULTI-DATA CENTER + STAGE 9: MESSAGE QUEUE + STAGE 10: SHARDING Problem 8.1: DC Failure Single DC down → millions offline Multi-DC Setup DC1 (US-East) + DC2 (US-West) Problem 9.1: Long Tasks Block Photo processing 10s blocks request Message Queue Web publishes, workers process async Problem 10.1: DB Scaling Limit Billions of rows, master can't handle Database Sharding Partition by user_id % 4. Unlimited scale STAGE 11: MONITORING, LOGGING & COST TRACKING Problems 11.1-11.2: No Visibility Can't debug. No health dashboard Centralized Logging ELK, Splunk, Prometheus dashboards Problem 11.3: Spiraling Costs 2 DCs, servers, DB, cache = $50K/mo Cost Tracking LangSmith, Kubecost, CloudHealth Problem 11.4: Manual Deployments SSH to 6 servers, 30 min, error-prone CI/CD Automation Auto-test, auto-deploy. 2 minutes SYSTEM NOW SCALES TO MILLIONS OF USERS High availability - Low latency - Automated operations - Disaster recovery LEGEND: Problem (Red) Solution (Green) FLOWCHART STATS: 11 Stages | 20+ Problems | 20+ Solutions Single Server → Millions of Users