AiTechWorlds
AiTechWorlds
The year is 2007. Amazon's engineers publish a research paper titled "Dynamo: Amazon's Highly Available Key-Value Store." In it, they describe a radical design decision: in order to ensure that a shopping cart never fails to accept an item, they deliberately sacrificed strict data consistency.
A traditional relational database would refuse a write if it could not guarantee correctness across all replicas. Amazon chose the opposite: always accept the write, reconcile conflicts later. The shopping cart might briefly show a duplicate item, but it would never show a confusing error to a customer trying to buy something.
This paper shocked the database world. It said, plainly, that the decades-old assumption — that every database must be ACID-compliant and strongly consistent — was wrong for certain problems. Within five years, dozens of new database systems had been built on Dynamo's ideas. The NoSQL revolution had begun. Understanding the trade-off Amazon chose is the foundation of modern database design.
Relational databases, invented by Edgar Codd at IBM in 1970, organise data into tables with rows and columns. Relationships between tables are expressed through foreign keys. Queries are written in SQL (Structured Query Language).
The defining guarantee of relational databases is ACID:
These guarantees make relational databases the right choice for financial systems, e-commerce orders, and any data where correctness is non-negotiable.
When to use SQL:
Major systems: PostgreSQL, MySQL, Oracle, Microsoft SQL Server.
Scaling SQL:
SQL databases scale vertically (bigger servers) by default, but several techniques extend their capacity:
"NoSQL" is not one thing — it is a family of four different data models, each optimised for different access patterns.
Store data as JSON-like documents. Each document can have a different structure — no fixed schema.
Examples: MongoDB, Google Firestore, CouchDB.
Best for: Product catalogues (each product has different attributes), user profiles, content management systems, applications where the schema evolves frequently.
Example document:
{
"user_id": "u_8472",
"name": "Priya Sharma",
"preferences": { "theme": "dark", "language": "en" },
"addresses": [
{ "type": "home", "city": "Bangalore" }
]
}
The simplest model: a dictionary. Look up any value instantly by its key.
Examples: Redis, Amazon DynamoDB, Riak.
Best for: Session storage, shopping carts, caching, feature flags, leaderboards. Any access pattern that is always "get item by ID."
Store data in tables, but each row can have a different set of columns. Designed for massive scale — billions of rows distributed across hundreds of servers.
Examples: Apache Cassandra (originally built at Facebook), Google Bigtable, HBase.
Best for: Time-series data, IoT sensor readings, messaging (Instagram's message history used Cassandra), write-heavy workloads distributed globally.
Model data as nodes (entities) and edges (relationships). Finding deeply connected data is fast because relationships are first-class citizens, not JOINs.
Examples: Neo4j, Amazon Neptune.
Best for: Social networks ("friends of friends"), fraud detection (unusual relationship patterns), recommendation engines, knowledge graphs.
| Use Case | Recommended Type | Example System |
|---|---|---|
| Banking transactions | SQL | PostgreSQL |
| User profiles | Document | MongoDB |
| Session storage | Key-Value | Redis |
| Social media posts at scale | Wide-Column | Cassandra |
| Friend recommendations | Graph | Neo4j |
| Product catalogue | Document | Firestore |
| IoT sensor data | Wide-Column | HBase |
When a single database server cannot hold all your data, you shard — split the data across multiple servers, each responsible for a portion.
Three sharding strategies:
Hash sharding: Apply a hash function to the partition key (e.g., hash(user_id) % 4). Distributes data evenly. Problem: adding a new shard requires resharding everything — consistent hashing solves this.
Range sharding: Shard by value ranges. Shard 1 holds orders from 2020, Shard 2 holds orders from 2021. Simple, but creates hot spots if one range receives disproportionate traffic.
Directory sharding: A lookup table maps each key to its shard. Flexible but the lookup table itself becomes a bottleneck and single point of failure.
Facebook's MySQL at scale: Facebook has sharded MySQL to handle over 10 billion rows per table. Each shard is a MySQL primary with replicas. Facebook also built TAO — a distributed cache layer sitting above MySQL — to serve billions of social graph queries per second without hitting the database.
Replication means keeping copies of the same data on multiple servers. It serves two purposes: high availability (if one server dies, another takes over) and read scaling (distribute reads across replicas).
| Model | How It Works | Trade-Off |
|---|---|---|
| Primary-Replica | Primary accepts all writes; replicas receive copies and serve reads | Replicas may lag — stale reads possible |
| Multi-Primary | Multiple nodes accept writes simultaneously | Conflict resolution required when same record modified on two nodes |
| Synchronous | Primary waits for replica to confirm write before acknowledging client | Strong consistency, higher write latency |
| Asynchronous | Primary acknowledges client immediately; replica catches up later | Lower latency, risk of data loss on primary crash |
Most production systems use asynchronous primary-replica replication — trading the small risk of data loss for significantly better write performance.
| Database | Type | ACID? | Scale | Consistency | Use Case | Example Company |
|---|---|---|---|---|---|---|
| PostgreSQL | Relational | Yes | Vertical + read replicas | Strong | Transactions, reporting | Instagram, Shopify |
| MySQL | Relational | Yes | Vertical + sharding | Strong | Web applications | Facebook, Twitter |
| MongoDB | Document | Partial | Horizontal | Eventual | Flexible schemas | Airbnb, Forbes |
| Redis | Key-Value | No (by default) | Horizontal (cluster) | Eventual | Caching, sessions | Twitter, GitHub |
| DynamoDB | Key-Value | Configurable | Fully managed | Tunable | High-scale web apps | Amazon, Lyft |
| Cassandra | Wide-Column | No | Horizontal (native) | Eventual | High-write, global | Instagram, Netflix |
| Neo4j | Graph | Yes | Limited | Strong | Social graphs, fraud | LinkedIn, eBay |
In 2000, Eric Brewer proposed the CAP theorem: a distributed database can guarantee at most two of three properties:
Since network partitions are unavoidable in real distributed systems, the practical choice is between CP (consistent but may become unavailable during partitions) and AP (always available but may return stale data).
Amazon chose AP for DynamoDB. Traditional SQL databases choose CP. Understanding this trade-off — which Amazon's 2007 Dynamo paper made concrete — is what separates junior engineers from senior architects.
Amazon's 2007 decision to sacrifice consistency for availability was not a mistake — it was a deliberate engineering trade-off matched to a specific business requirement. Shopping carts that silently show a duplicate item are a minor inconvenience. Shopping carts that refuse to accept items lose sales.
The lesson is not "use NoSQL." The lesson is: match your database choice to your data model and consistency requirements. Financial transactions demand ACID. Social feeds tolerate eventual consistency. A product catalogue needs flexible schemas. Every major system you use today — Instagram, Netflix, Uber, Spotify — uses multiple database types simultaneously, each chosen for the specific trade-offs that match its workload.
Get this course's notes on Telegram!
Free cheat sheets, summaries & practice exercises