AiTechWorlds
AiTechWorlds
It is 2018. You are in a conference room at Google's San Francisco office. The interviewer writes three words on the whiteboard and steps back.
"Design Twitter."
They are not asking you to write code. They are not asking for a data structure or an algorithm. They want to watch how you think about building a system used by 400 million people simultaneously — a system that must accept 6,000 tweets per second, store them durably, and deliver them to followers across the world in under 200 milliseconds.
This question separates engineers who can code from engineers who can architect. System design is the discipline of making high-level structural decisions before a single line of production code is written.
System design is the process of defining the architecture, components, interfaces, and data flows of a system to satisfy specified requirements — particularly non-functional requirements like scale, availability, and latency.
It answers questions that no individual class or function can answer:
Experienced engineers follow a consistent process when tackling system design questions — in interviews and in real projects.
Step 1 — Clarify Requirements: Before drawing anything, ask questions. What exactly must the system do? How many users? What is the read-to-write ratio? Do we need strong consistency or eventual consistency? Real interviews fail because candidates assume instead of ask.
Step 2 — Estimate Scale: Back-of-envelope calculations anchor the design in reality. A system for 1,000 users is radically different from one for 1 billion.
Step 3 — High-Level Design: Identify the major components and how data flows between them. Boxes and arrows.
Step 4 — Detailed Design: Deep-dive into the 2–3 most critical or interesting components.
Step 5 — Identify Bottlenecks: Where will the system fail? How does it recover? How does it scale?
Estimation is a required skill. Nobody expects exact numbers — they want to see how you reason.
Useful constants to memorise:
| Metric | Value |
|---|---|
| Seconds per day | 86,400 (~10^5) |
| 1 million users × 1 request/day | ~12 requests/second |
| 1 KB text | 1,000 bytes |
| 1 image (compressed) | ~300 KB |
| 1 minute HD video (compressed) | ~100 MB |
| 1 TB storage | ~1 billion KB |
Twitter estimation example:
These are the dimensions that determine how the system is built, not what it does.
The percentage of time a system is operational.
| SLA | Annual Downtime | Example |
|---|---|---|
| 99% ("two nines") | 87.6 hours | Internal tools |
| 99.9% ("three nines") | 8.7 hours | Small businesses |
| 99.99% ("four nines") | 52 minutes | AWS EC2 SLA |
| 99.999% ("five nines") | 5 minutes | Telecom, payment processors |
Amazon estimated in 2012 that every 100ms of added latency costs 1% in sales. Google found that a 500ms delay in search results causes a 20% drop in traffic. Availability and latency are not academic — they are directly tied to revenue.
The ability to handle increasing load. There are two approaches:
In a distributed system, all replicas should agree on the current state of data. But achieving perfect consistency means slower responses and lower availability.
Eric Brewer's CAP theorem (2000, formally proven 2002): in a distributed system experiencing a network partition, you must choose between:
| Choice | Sacrifice | Examples | Use When |
|---|---|---|---|
| CP (Consistent + Partition Tolerant) | Availability | MongoDB, HBase, Zookeeper | Banking transactions, inventory (correctness critical) |
| AP (Available + Partition Tolerant) | Strong Consistency | Cassandra, DynamoDB, CouchDB | Social feeds, shopping carts (availability critical) |
| CA (Consistent + Available) | Partition Tolerance | Traditional RDBMS (single-node) | Not viable in distributed systems — partitions always happen |
Real example: Amazon's DynamoDB (AP) allows a shopping cart to accept items even when some replicas are unavailable. Two replicas might briefly disagree on cart contents — Amazon accepts this eventual consistency trade-off because a cart that silently refuses items loses sales.
Every large-scale system assembles from a standard toolkit of components:
| Component | Role | Examples |
|---|---|---|
| DNS | Translates domain names to IP addresses | Route 53, Cloudflare |
| CDN | Serves static assets from edge locations near users | Cloudflare, AWS CloudFront, Akamai |
| Load Balancer | Distributes requests across app servers | AWS ALB, Nginx, HAProxy |
| API Gateway | Single entry point for clients; handles auth, rate limiting | AWS API Gateway, Kong |
| Cache | In-memory store for frequent reads | Redis, Memcached |
| Message Queue | Asynchronous communication between services | Kafka, AWS SQS, RabbitMQ |
| Database | Durable data storage | PostgreSQL, MySQL, MongoDB, DynamoDB |
Applying the process to Twitter's core features (post tweet, read home timeline):
Functional requirements clarified: Post tweets, follow users, read home timeline (tweets from followed users, most recent first).
Scale estimates (from above): ~700 writes/sec, ~70,000 reads/sec — this is a read-heavy system (100:1 read/write ratio).
Key design decisions:
| Non-Functional Req | Metric | How to Achieve | Trade-off |
|---|---|---|---|
| Availability | 99.99% | Multi-region deployment, read replicas | Higher cost and complexity |
| Scalability | 10× traffic spike | Horizontal app servers, auto-scaling | Stateless services required |
| Consistency | Eventual (seconds) | AP design (Cassandra for timelines) | Old timeline briefly visible |
| Latency | p99 < 200ms | Redis caching, CDN, read replicas | Cache invalidation complexity |
| Durability | No tweet loss | Replication factor 3, write-ahead log | Storage cost |
Get this course's notes on Telegram!
Free cheat sheets, summaries & practice exercises