AiTechWorlds
AiTechWorlds
A social network stores millions of users, and every user has a different combination of profile fields, privacy settings, and connected friends. A product catalog has phones with specs like "battery capacity" and shoes with specs like "heel height" — completely different attributes. A gaming leaderboard needs to update 50,000 scores per second.
Try to model all three in a relational schema. The phone-shoe problem means either dozens of NULL columns or a complex EAV (Entity-Attribute-Value) anti-pattern. The social graph means recursive self-joins that bring PostgreSQL to its knees. The leaderboard means locking and write contention that kills response times.
Not all data fits in tables. NoSQL databases were built for the shapes that relational models handle poorly.
| Limitation | Relational Approach | The Problem |
|---|---|---|
| Horizontal scaling | Vertical scaling (bigger server) | One machine has physical limits |
| Schema flexibility | Rigid schema, ALTER TABLE needed | Product catalogs, user profiles vary by row |
| Write throughput | ACID transactions with locks | Social media, IoT need millions of writes/sec |
| Graph traversal | Recursive SQL joins | 6 degrees of separation = catastrophic query |
| Semi-structured data | Normalize into many tables | JSON APIs, documents need direct storage |
Model: Collections of JSON-like documents. Each document can have its own schema.
// products collection
{
"_id": "prod_001",
"name": "iPhone 15 Pro",
"category": "phone",
"specs": {
"battery_mah": 3274,
"chip": "A17 Pro",
"storage_options": [128, 256, 512, 1024]
},
"price": 999.00,
"tags": ["apple", "smartphone", "5G"]
}
{
"_id": "prod_002",
"name": "Air Max 90",
"category": "shoe",
"specs": {
"heel_height_mm": 32,
"material": "leather/mesh",
"sizes": [7, 8, 9, 10, 11, 12]
},
"price": 110.00
}
Query example:
// Find all phones under $800 with storage over 256GB
db.products.find({
category: "phone",
price: { $lt: 800 },
"specs.storage_options": { $gt: 256 }
})
Best use cases: Product catalogs, content management systems (CMS), user profiles, event logging, mobile app backends.
Model: A giant hash map. Every value is stored and retrieved by a unique key. Values can be strings, lists, sets, hashes, sorted sets.
SET session:user_4821 '{"user_id":4821,"role":"admin","expires":1720000000}'
GET session:user_4821
→ '{"user_id":4821,"role":"admin","expires":1720000000}'
SETEX rate_limit:ip_192.168.1.1 60 "42" -- expires in 60 seconds
INCR rate_limit:ip_192.168.1.1 -- atomic increment
ZADD leaderboard 98500 "player_alice"
ZADD leaderboard 87200 "player_bob"
ZREVRANGE leaderboard 0 9 WITHSCORES -- Top 10 players
→ 1) "player_alice" 2) "98500"
3) "player_bob" 4) "87200"
Speed: Redis operates entirely in RAM — typical operations complete in under 1 millisecond.
Best use cases: Session storage, caching (cache-aside pattern), rate limiting, real-time leaderboards, pub/sub messaging queues.
Model: Data is organized by rows with dynamic columns. Rows are grouped into partitions identified by a partition key, optimized for fast writes and time-series reads.
-- Cassandra CQL (similar to SQL)
CREATE TABLE sensor_readings (
device_id UUID,
recorded_at TIMESTAMP,
temperature FLOAT,
humidity FLOAT,
PRIMARY KEY (device_id, recorded_at)
) WITH CLUSTERING ORDER BY (recorded_at DESC);
-- Write millions of readings per second
INSERT INTO sensor_readings (device_id, recorded_at, temperature, humidity)
VALUES (uuid(), toTimestamp(now()), 22.4, 61.2);
-- Fast range read for one device
SELECT * FROM sensor_readings
WHERE device_id = 550e8400-e29b-41d4-a716-446655440000
AND recorded_at >= '2024-01-01'
AND recorded_at < '2024-02-01';
Cassandra distributes data across a ring of nodes. There is no single point of failure. Writes go to multiple nodes simultaneously for fault tolerance.
Best use cases: IoT sensor data, time-series metrics, audit logs, messaging systems, analytics at massive scale (Netflix, Apple, Instagram use Cassandra).
Model: Data is stored as nodes (entities) and edges (relationships), each with properties. Relationships are first-class citizens, not foreign keys.
Nodes: (Alice:Person), (Bob:Person), (Python:Language), (DataCo:Company)
Edges: Alice -[FRIENDS_WITH]-> Bob
Alice -[KNOWS]-> Python
Bob -[WORKS_AT]-> DataCo
DataCo-[USES]-> Python
Cypher Query — Find Alice's friends who work at companies using Python:
MATCH (alice:Person {name: "Alice"})
-[:FRIENDS_WITH]->(friend:Person)
-[:WORKS_AT]->(company:Company)
-[:USES]->(lang:Language {name: "Python"})
RETURN friend.name, company.name
-- Output:
-- friend.name | company.name
-- -------------|-------------
-- Bob | DataCo
The same query in SQL requires 4 self-joins with recursive traversal — orders of magnitude slower for deep graph queries.
Best use cases: Social networks, recommendation engines, fraud detection, knowledge graphs, network topology, identity and access management.
Eric Brewer's CAP Theorem states that a distributed system can guarantee at most two of these three properties simultaneously:
Consistency (C)
/\
/ \
/ \
/ \
/ \
Availability(A) ---- Partition
Tolerance (P)
CA: Traditional RDBMS (single node) — Not partition tolerant
CP: MongoDB, HBase, Zookeeper — Sacrifices availability during partitions
AP: Cassandra, CouchDB, DynamoDB — Sacrifices consistency (eventual)
In a distributed system, network partitions will happen. You must choose between Consistency and Availability when they do.
Most NoSQL databases follow BASE semantics instead of ACID:
| ACID | BASE |
|---|---|
| Atomic | Basically Available |
| Consistent | Soft state |
| Isolated | Eventually consistent |
| Durable |
Eventual consistency means: if no new writes are made, all replicas will eventually converge to the same value. You might read stale data immediately after a write — but you will read the correct data within milliseconds to seconds.
For a social media "like" count, this is perfectly acceptable. For a bank balance, it is not.
| Property | RDBMS (PostgreSQL) | Document (MongoDB) | Key-Value (Redis) | Column (Cassandra) | Graph (Neo4j) |
|---|---|---|---|---|---|
| Data model | Tables & rows | JSON documents | Key → value | Wide rows/partitions | Nodes & edges |
| Schema | Rigid (defined) | Flexible | None | Flexible columns | Flexible |
| ACID support | Full | Document-level | Limited | Tunable | Full |
| Scaling | Vertical (mainly) | Horizontal | Horizontal | Horizontal | Vertical (mainly) |
| Query language | SQL | MQL / aggregation | Commands | CQL (SQL-like) | Cypher |
| Best for | Transactions, reports | Content, catalogs | Cache, sessions | Time-series, IoT | Connected data |
| Consistency | Strong | Configurable | Strong (single) | Eventual | Strong |
| Joins | Excellent | Limited ($lookup) | None | None | Native (traversal) |
Choosing a database is choosing a tradeoff. The best database is the one whose tradeoffs align with your application's actual requirements — not the most popular one.
Get this course's notes on Telegram!
Free cheat sheets, summaries & practice exercises