System Design: Problems & Solutions

01 Single Server Setup (All-in-One)

Everything running on one machine: web app, database, cache.

Problem 1.1

Single Point of Failure (SPOF)

Server goes down → entire website/app becomes inaccessible.

✓ No immediate solution (accept risk in MVP)

At single-server stage, focus on feature velocity. Monitor uptime and treat failures as learning.

Contingency: Set up automated backups to external storage.

Risk ManagementMVP Stage

Problem 1.2

Traffic Bottleneck & Scaling Limit

As traffic grows, single server cannot handle simultaneous users. Response times degrade, requests fail.

✓ Vertical Scaling (Temporary)

Add CPU, RAM, and disk storage to the single server.

Cost-effective short-term solution (~weeks to months).

Buys time before reaching hardware limits (max ~256 GB RAM, 64 vCPUs on most commodity hardware).

Cost-EffectiveQuick Implementation

Problem 1.3

No Redundancy or Backup

Data loss risk if hardware fails. No way to serve users while recovering.

✓ Automated Backups + Manual Recovery

Set up daily/hourly backups to external storage (S3, backup service).

Document recovery procedure for manual failover.

Recovery time: hours to days (RTO) — not acceptable for production.

Data ProtectionManual Failover

02a Separate Web Tier from Data Tier

Move database to its own server, web traffic to another.

Problem 2.1

Web & Database Compete for Resources

Heavy database queries slow down web request handling; web traffic spikes starve the DB.

✓ Independent Tier Separation

Move database server to separate machine (10.0.0.100).

Web server handles requests on 10.0.0.1, database serves data independently.

Can scale each tier independently: add CPU to web without touching DB, and vice versa.

Resource IsolationIndependent Scaling

Problem 2.2

Both Tiers Still Have SPOF

Web server down → no requests served; DB server down → no data access.

✓ Each Tier Gets Failover in Next Stages

Web tier: Add load balancer + multiple web servers (Stage 3).

DB tier: Add database replication — master/slave (Stage 4).

Roadmap

02b Choose Database Type

SQL vs NoSQL: structural vs performance trade-off.

Decision Point

Which Database Fits Your Data Model?

Wrong database choice = costly migration later.

✓ Use SQL (Relational) If:

Data is structured (users, posts, comments with relationships).

Need ACID transactions (financial data, orders).

Need JOIN operations across tables.

Data model is known upfront.

Examples: MySQL, PostgreSQL, Oracle.

ReliabilityConsistency

✓ Use NoSQL If:

Data is unstructured or semi-structured (logs, events, documents).

Need super-low latency (<10 ms response).

Horizontal scaling is critical (millions of records/sec).

Data model evolves rapidly (schema flexibility).

Examples: MongoDB, DynamoDB, Cassandra, Redis.

PerformanceScalability

02c Choose Scaling Approach

Vertical vs Horizontal scaling strategy.

Problem 2.3

Single Server Cannot Scale Beyond Hardware Limits

Max CPU, RAM, disk on commodity hardware ≈ 256 GB RAM, 64 vCPUs. Millions of users need more.

✓ Vertical Scaling (Scale Up) — Short-term

Add more CPU, RAM, disk to single server.

Cheap, simple, works for moderate growth.

BUT: hardware limit reached (AWS max ~768 GB RAM, 96 vCPUs); cost becomes exponential; still SPOF.

QuickSimple

✓ Horizontal Scaling (Scale Out) — Long-term

Add multiple servers (10.0.0.1, 10.0.0.2, 10.0.0.3…).

Unlimited scalability: can keep adding servers as needed.

Provides redundancy: if one server fails, others still serve traffic.

Requires load balancer to distribute requests.

Unlimited ScaleRedundancyRecommended

03 Multiple Web Servers (Horizontal Scaling)

Added multiple web servers (10.0.0.1, 10.0.0.2, 10.0.0.3).

Problem 3.1

How Do Users Know Which Server to Connect To?

With 3 web servers, users need a single entry point. Which IP do they use?

✓ Load Balancer (Public IP 88.88.88.1)

Single public IP that users connect to.

Load balancer routes requests to private IPs using round-robin, least-connections, or other algorithms.

Users never see private IPs; load balancer handles routing transparently.

Single Entry PointAbstraction

Problem 3.2

Traffic Not Evenly Distributed

Without a load balancer, all users might hit Server 1, leaving Servers 2 & 3 idle. Server 1 overloads.

✓ Load Balancer Distributes Traffic Evenly

Round-robin: Server 1, then 2, then 3, then repeat.

Least-connections: route to server with fewest active connections.

IP hash: same client always routes to same server (if sticky sessions needed).

Load Distribution

Problem 3.3

No Failover If One Server Goes Down

Server 1 crashes → load balancer still tries to route to it → requests timeout/fail.

✓ Load Balancer Health Checks

Balancer pings each server every 10 seconds: "Are you alive?"

If Server 1 doesn't respond, balancer marks it DOWN and removes it from rotation.

Requests now route only to Servers 2 & 3; users don't see the outage.

Ops team can add a replacement Server 4; balancer detects it and adds to rotation.

Automatic FailoverHigh Availability

Problem 3.4

Load Balancer Itself Is a SPOF

Load balancer goes down → users can't reach any web server (even if all 3 are healthy).

✓ Redundant Load Balancers

Deploy 2+ load balancers in active-passive or active-active config.

DNS points to primary; if primary fails, DNS fails over to secondary.

Or: use managed load balancer service (AWS ALB, GCP LB) — provider handles redundancy.

High Availability

04 Database Replication (Master-Slave)

While the web tier has redundancy, the database is still a single point of failure.

Problem 4.1

Master Database Is a SPOF

Master DB crashes → no reads, no writes → entire app becomes unusable (even if web servers are healthy).

✓ Master-Slave Database Replication

Master DB (10.0.0.100): handles ALL writes, updates, deletes.

Slave DB(s) (10.0.0.101, 10.0.0.102): receive copies of data from master in real-time.

Web servers send reads to any slave, writes to master only.

If master fails: promote one slave to become new master.

Replication lag: typically <100 ms; eventual consistency.

High AvailabilityRead Scaling

Problem 4.2

Master Promotion Is Manual & Error-Prone

When master fails, ops engineer must manually SSH, run promotion scripts, update configs. Takes 15+ minutes. Data loss possible.

✓ Automated Failover (Advanced)

Tools: MGR (MySQL Group Replication), Patroni (PostgreSQL), etc.

Automatically detect master failure, promote best slave, update all clients.

Failover time: <30 seconds; minimises data loss.

Production-Grade

Problem 4.3

Slave Data Might Be Out of Sync with Master

Replication lag: master gets an update, but slave takes 500 ms to apply it. Web server reads slave → gets stale data.

✓ Accept Eventual Consistency (or use stronger guarantees)

For most apps, eventual consistency is fine: <100 ms lag is imperceptible.

Critical reads (financial, current user profile): route to MASTER only.

Less critical reads (user feed, analytics): read from SLAVE (stale is OK).

Pragmatic Trade-off

05 Cache Layer (Redis / Memcached)

Database is now highly available, but still bottlenecked by repeated read requests.

Problem 5.1

Database Is Hit by Thousands of Repeated Read Queries

3 web servers × 100 users = 300 identical DB queries/sec → DB CPU maxes out, latency spikes.

✓ Cache Layer (Redis / Memcached at 10.0.0.50)

Before hitting DB, web server checks cache: "Is this data in memory?"

Cache hit (data found): return in <1 ms.

Cache miss (data not found): query DB, store result in cache with TTL, return.

Typical hit rate: 90%+ → 90% of requests bypass the database entirely.

Result: DB load drops 10×, latency drops 100×.

10-100× FasterDB Load Reduction

Problem 5.2

Cache Holds Stale Data

User updates profile → written to DB. But cache still has old data for 30 minutes → next read gets stale profile.

✓ TTL (Time-To-Live) Policy

Cache all data with expiration: cache.set("user:123", {...}, ttl=60)

After 60 s, cache entry expires automatically; next read hits DB for fresh data.

Invalidation: proactively delete from cache on write: cache.delete("user:123")

Freshness Guarantee

Problem 5.3

Cache Is a SPOF

Cache server goes down → all requests become cache misses → DB gets hammered → cascading failure ("thundering herd").

✓ Redundant Cache Servers (Cluster)

Deploy cache as cluster (2–3 replicas) instead of single instance.

If one node fails, cluster routes to others.

Overprovision capacity: cache 10 GB data, allocate 15 GB (50% headroom).

High AvailabilityResilience

Problem 5.4

Cache Memory Is Full

Cache can hold only ~16 GB data. After reaching 16 GB, where do new entries go?

✓ Cache Eviction Policy

LRU (Least-Recently-Used): delete oldest-accessed entry to make room.

LFU (Least-Frequently-Used): delete least-popular entry.

FIFO (First-In-First-Out): delete oldest entry.

Most common: LRU (hottest data stays, cold data gets evicted).

Memory Management

Considerations for Using Cache

When to use cache. Consider using cache when data is read frequently but modified infrequently. Since cached data is stored in volatile memory, a cache server is not ideal for persisting data. If a cache server restarts, all data in memory is lost. Important data should always be saved in persistent data stores.

Expiration policy. It is good practice to implement an expiration policy. Once cached data expires, it is removed from the cache. Without an expiration policy, cached data remains in memory permanently. The expiration date should not be too short (causes frequent database reloads) or too long (data can become stale).

Consistency. Keeping the data store and the cache in sync is critical. Inconsistency can occur because data-modifying operations on the data store and cache are not in a single transaction. When scaling across multiple regions, maintaining consistency is especially challenging. For further details, refer to "Scaling Memcache at Facebook" published by Facebook.

Mitigating failures. A single cache server represents a potential single point of failure (SPOF) — a part of a system that, if it fails, will stop the entire system from working. Multiple cache servers across different data centres are recommended to avoid SPOF. Another approach is to overprovision the required memory by a certain percentage, providing a buffer as memory usage increases.

Eviction policy. Once the cache is full, any request to add items may cause existing items to be removed (cache eviction). Least-Recently-Used (LRU) is the most popular policy. Other policies such as Least Frequently Used (LFU) or First In First Out (FIFO) can be adopted to satisfy different use cases.

Best PracticesCache DesignProduction Readiness

06 Content Delivery Network (CDN)

Database and app are optimised, but static assets (JS, CSS, images) are still served from origin.

Problem 6.1

Static Assets Downloaded from Origin Every Time

User in London requests logo.png from US server → 120 ms latency × thousands of users = network congestion.

✓ CDN at the Edge

CDN is a global network of servers physically located near users.

Static assets cached on CDN edges near users.

User in London → served from CDN UK edge (30 ms) instead of US origin (120 ms).

4× FasterGlobal Reach

Problem 6.2

CDN Serves Stale Assets

Release new logo.jpg but CDN edges still serve old cached version for hours.

✓ Cache Invalidation + Versioning

Set TTL on CDN: cache static assets for 1 hour, then re-fetch from origin.

Use versioning in URL: logo.jpg?v=2 → new cache key, CDN fetches fresh.

Purge CDN manually via API when deploying new assets.

Asset Freshness

Problem 6.3

CDN Outage Breaks Static Assets

CDN provider has a regional outage → users can't load CSS/JS → site is broken.

✓ CDN Fallback (Origin Fallback)

HTML includes fallback: if CDN request fails, browser requests asset from origin.

Ensures graceful degradation: worse latency, but site stays functional.

Resilience

07 Stateless Web Tier

Web tier is scaled horizontally, but session state makes servers interdependent.

Problem 7.1

Session State Couples Servers to Users (Sticky Sessions)

User A logs in on Server 1 → session stored in Server 1 memory. Load balancer MUST always route User A to Server 1, or auth fails.

✓ Move Session State to Persistent Store

Store session data in Redis/Memcached or database, NOT in server memory.

Load balancer can route User A to ANY server; all servers fetch session from Redis.

Servers become stateless: can be killed/restarted without losing user sessions.

Auto-scaling now works: spin up 10 servers, kill 5, users don't notice.

StatelessAuto-Scaling

Problem 7.2

Cannot Add/Remove Servers Without Breaking Users

When session is tied to Server 1, removing Server 1 logs out all its users.

✓ Auto-Scaling Becomes Possible

With sessions in Redis, can spin up/down servers freely without user impact.

Monitor CPU/memory; if >80%, add servers; if <20%, remove servers.

Tools: Kubernetes auto-scaler, AWS auto-scaling groups.

ElasticityCost Efficiency

08 Multi-Data Center Setup

Single data center is now a regional SPOF; global users have high latency.

Problem 8.1

Single Data Center Failure = Total Outage

Fire in US-East data center → all servers, DBs, cache offline → millions of users unreachable.

✓ Deploy Redundant Data Centers (DC1 + DC2)

Deploy identical infrastructure in US-East (DC1) and US-West (DC2).

Each DC has: web servers, cache, master DB, workers, etc.

If US-East fails, GeoDNS routes 100% traffic to DC2. RTO: <30 s.

Disaster RecoveryHigh Availability

Problem 8.2

High Latency for Users Far from Data Center

User in Australia requests from US-only server → 200 ms latency (vs 30 ms from local edge).

✓ GeoDNS Routing + Multiple DCs

GeoDNS resolves domain based on user location: US-East → DC1, US-West → DC2.

Route to nearest DC → lower latency, better user experience.

Low LatencyGlobal Reach

Problem 8.3

Data Inconsistency Across DCs

User updates profile in DC1 → replicated to DC2 asynchronously (100–500 ms lag) → user logs in from DC2, sees old profile.

✓ Asynchronous Multi-DC Replication

Master DB in each DC handles local writes; replicates to other DCs asynchronously.

Eventual consistency: within seconds/minutes, all DCs converge to same state.

For critical data (financial): stick to single DC for writes.

Eventual Consistency

Problem 8.4

Complex Operational Burden

Deploying to 2 DCs means 2× infrastructure, 2× monitoring, 2× backups, 2× troubleshooting.

✓ Infrastructure as Code (IaC) + Automation

Terraform/Bicep: define entire DC infrastructure once, deploy to both regions with one command.

Centralised monitoring: one dashboard showing both DCs' health.

Operational Simplicity

REF Correct End-to-End Architecture

              User
              ↓
              DNS / Geo-routing (Route53 / Cloudflare)
              ↓
              CDN (optional cache)
              ↓
              Load Balancer
              ↓
              Stateless Servers (API layer)
              ↓
              Cache (Redis)
              ↓
              Databases / Storage (SQL, NoSQL, S3)
            

Key Corrections

Statement	Status	Correction
LB distributes across servers	✅	Correct
LB distributes across databases	⚠️	Usually via DB proxies, not LB
LB chooses data center	❌	DNS / geo-routing does this
Data centers store data	✅	Also host compute + networking
Servers host backend/frontend	✅	Correct

Advanced Note — Multi-Region Production Systems

Layer 1: DNS decides region

Layer 2: LB distributes within region

Layer 3: DB replication handles data locality

Final Takeaway

Load Balancer

Intra-region traffic distribution (servers)

DNS / CDN

Inter-region routing (data centers)

Data Center

Full stack (compute + storage)

Servers

Execution layer (APIs, apps, jobs)

09 Message Queue & Async Workers

Web servers handling long-running tasks (photo processing, email) block user requests.

Problem 9.1

Long-Running Tasks Block User Requests

User uploads photo → web server processes it (crop, resize, blur = 10 s) → request blocked → user sees timeout.

✓ Message Queue (RabbitMQ / Kafka at 10.0.0.60)

Web server publishes task to queue: queue.publish("photo_process", {photo_id: 123})

Returns immediately: "Your photo is processing" (200 ms response).

Worker picks up task from queue asynchronously and processes offline.

Non-BlockingBetter UX

Problem 9.2

Cannot Scale Processing Tasks Independently

10,000 photo uploads → should spin up 100 workers, each handling 100 photos. Not 10,000 servers.

✓ Decouple Web Tier from Worker Tier

Web servers only handle requests and publish to queue.

Workers independently consume from queue and process photos.

If queue backlog grows, spin up more workers. If queue empties, kill workers.

Web and worker tiers scale independently.

Independent ScalingCost Efficient

Problem 9.3

Worker Crashes Lose Tasks

Worker picks up task, starts processing, then crashes → task lost, photo never processed.

✓ Durable Message Queue + Acknowledgments

Queue persists all messages to disk before worker processes them.

If crash before ACK, queue re-delivers to another worker.

Guarantees: "at least once" or "exactly once" (more complex, slower).

DurabilityFault Tolerance

Problem 9.4

Queue Server Is a SPOF

Queue crashes → all task publishing fails → web servers can't offload → requests timeout.

✓ Distributed Queue Cluster

Deploy queue as 3-node cluster (RabbitMQ cluster, Kafka brokers).

If one node fails, cluster rebalances; publishers/consumers reconnect to other nodes.

High Availability

10 Database Sharding (Horizontal Scaling)

Database has grown to massive scale; single master can't handle read/write volume.

Problem 10.1

Single Master Database Hits Scalability Ceiling

Millions of concurrent users, billions of rows → single master can't handle throughput. 5 s queries.

✓ Vertical Scaling (Temporary, ~weeks)

Add massive CPU/RAM to single server: 512 GB RAM, 96 vCPUs.

Works for medium scale (~100 M records).

BUT: costs exponential; hits hardware limits; still SPOF.

Quick Fix

✓ Horizontal Database Scaling = Sharding

Partition data across multiple databases (Shard 0, 1, 2, 3).

Sharding key: user_id. Hash function: user_id % 4.

Each shard holds 1/4 of total data; queries 4× faster than monolith.

Can add more shards as data grows; unlimited horizontal scale.

Linear ScalingUnlimited Growth

Problem 10.2

Sharding Introduces Complexity

Cannot JOIN across shards; resharding when data grows; hotspot keys overload one shard.

✓ Accept Complexity or Mitigate

Avoid cross-shard JOINs: denormalise data (duplicate in each shard).

Resharding: split Shard 0 into 0a, 0b; move half the data.

Hotspot keys: split that shard further, allocate dedicated replicas.

Planned Complexity

Problem 10.3

Choosing Wrong Sharding Key Is Fatal

Shard by country → US shard has 100 B rows, EU shard has 1 B rows → uneven, one shard overloaded.

✓ Choose Sharding Key Carefully

Shard key must be: (1) immutable, (2) even distribution, (3) queries mostly within shard.

Good: user_id (every user unique, query by user common).

Bad: timestamp (new data all goes to latest shard, old shards empty).

Design Decision

11 Monitoring, Logging & Cost Tracking

System is complex (2 DCs, multiple DBs, caches, queues, workers); hard to debug issues at scale.

Problem 11.1

Cannot Debug Production Issues

3 web servers, 2 DBs, 1 cache, 1 queue, 3 workers across 2 DCs. User reports "photo not processing." Where is the bug?

✓ Centralised Logging (ELK, Splunk, CloudWatch)

Every component logs to central location.

Search all logs by user_id: see every request, DB query, worker task.

Alerts: if error rate > 5%, page on-call engineer.

Observability

Problem 11.2

No Visibility into System Health

Is the system healthy? Are users happy? Is the DB slow? No idea until customers complain.

✓ Metrics & Dashboards (Prometheus, Grafana, Datadog)

Collect metrics: CPU, memory, disk, request latency, error rate, queue size, DB query time.

Dashboard shows: latency, queue backlog, cache hit rate — spot problems before users notice.

Proactive Monitoring

Problem 11.3

Cloud Costs Are Spiralling

2 DCs × 10 servers + DB + cache + queue + worker tier = $50K/month. Is this efficient?

✓ Cost Tracking & Automation

Tag every resource by service: "photo_service", "payment_service", "worker_tier".

Track: cost per service, cost per user, cost per transaction.

Alerts: if service cost > budget, shut down unused resources.

Tools: LangSmith for AI workload tracking; Kubecost for Kubernetes; CloudHealth for AWS.

Financial Control

Problem 11.4

Deployments Are Manual and Error-Prone

Deploy new code: SSH to 6 servers, restart services, hope nothing breaks. Takes 30 minutes.

✓ CI/CD Automation

Continuous Integration: every commit runs tests automatically.

Continuous Deployment: tests pass → deploy to staging → approve → auto-deploy to production.

Result: deploy in 2 minutes with 0 human steps; rollback if needed.

VelocityReliability