AiTechWorlds
AiTechWorlds
A new school has 100 lockers, numbered 0 to 99. The school assigns lockers using a formula based on student ID numbers. Most students get a unique locker. But one day, two students — ID 204 and ID 304 — both compute to locker number 4. They arrive, and both try to open the same locker.
The school needs a plan. How do you handle two things that need the same spot?
This is a collision in a hash table, and every real hash table must have a strategy to deal with it.
Collision: A situation where two different keys produce the same hash value (and therefore map to the same array index).
Even with a perfect hash function, collisions are mathematically unavoidable if you have more possible keys than array slots. The birthday paradox shows that with just 23 people, there is a 50% chance two share a birthday — hash collisions happen far more frequently than intuition suggests.
# Simple demonstration of a collision with a small table
def simple_hash(key, size=10):
return hash(key) % size
print(simple_hash("listen", 10)) # Might output: 5
print(simple_hash("silent", 10)) # Also might output: 5 — collision!
When two keys land at the same index, the table must have a plan. There are several well-established strategies.
Chaining stores multiple items at the same index using a linked list (or in Python's case, a list). Every slot holds not just one item but a chain of items that all hashed to that index.
Index 4: → [("student_204", "Alice")] → [("student_304", "Bob")] → None
class ChainingHashTable:
def __init__(self, size=10):
self.size = size
self.table = [[] for _ in range(self.size)] # Each slot holds a list
def _hash(self, key):
return hash(key) % self.size
def set(self, key, value):
index = self._hash(key)
chain = self.table[index]
for i, (k, v) in enumerate(chain):
if k == key:
chain[i] = (key, value) # Update existing key
return
chain.append((key, value)) # No existing key — add new pair
def get(self, key):
index = self._hash(key)
for k, v in self.table[index]: # Search the chain at this index
if k == key:
return v
return None
Chaining pros: Simple to implement, handles high load factors gracefully.
Chaining cons: Extra memory for pointers; cache-unfriendly (linked list nodes scattered in memory).
Instead of a linked list at each slot, open addressing stores only one element per slot. When a collision occurs, the algorithm probes for the next available empty slot.
Try index + 1, then index + 2, and so on until an empty slot is found:
class OpenAddressingHashTable:
def __init__(self, size=10):
self.size = size
self.table = [None] * self.size
self.DELETED = "<DELETED>" # Tombstone marker for deleted slots
def _hash(self, key):
return hash(key) % self.size
def set(self, key, value):
index = self._hash(key)
while self.table[index] is not None and self.table[index] != self.DELETED:
if self.table[index][0] == key: # Found existing key — update
self.table[index] = (key, value)
return
index = (index + 1) % self.size # Probe next slot (wrap around)
self.table[index] = (key, value)
def get(self, key):
index = self._hash(key)
while self.table[index] is not None:
if self.table[index] != self.DELETED and self.table[index][0] == key:
return self.table[index][1]
index = (index + 1) % self.size
return None
Open addressing pros: Better cache performance (all data in one array), lower memory overhead.
Open addressing cons: Clustering — many collisions bunch up nearby, slowing future lookups. Deletions are tricky (need tombstone markers).
A smarter variant of open addressing. When inserting and a collision occurs, the new element "steals" the slot from whichever element is furthest from its ideal position (the "rich" one), and the displaced element continues probing. This balances probe lengths across all elements, keeping performance more consistent. Many modern high-performance hash tables (including Rust's HashMap) use this strategy.
Python's dict uses open addressing with pseudo-random probing — not simple linear probing. The probing sequence mixes in bits of the hash value to avoid clustering:
next_index = (5 * current_index + 1 + perturb) % table_size
Key facts about CPython dicts:
d = {}
d["banana"] = 2
d["apple"] = 5
d["cherry"] = 1
for key, value in d.items():
print(key, value)
Output:
banana 2
apple 5
cherry 1
The order is insertion order — not sorted, not random. This is guaranteed behavior in Python 3.7 and later.
A malicious attacker could craft inputs where all keys hash to the same index, turning O(1) lookups into O(n) — effectively a denial of service. Python defends against this with hash randomization (PYTHONHASHSEED), which adds a random seed to string hashes at startup, making collisions unpredictable for attackers.
A Python set is simply a hash table where each slot stores only a key — no associated value. This makes membership testing (in) run in O(1):
names = {"Alice", "Bob", "Charlie"}
print("Alice" in names) # O(1) — True
print("Diana" in names) # O(1) — False
Sets are perfect for deduplication and fast membership checks.
words = ["apple", "banana", "apple", "cherry", "banana", "apple"]
freq = {}
for word in words:
freq[word] = freq.get(word, 0) + 1 # Increment count, default 0
print(freq)
# Output: {'apple': 3, 'banana': 2, 'cherry': 1}
students = [("Alice", "Math"), ("Bob", "Science"), ("Carol", "Math"), ("Dave", "Science")]
groups = {}
for name, subject in students:
groups.setdefault(subject, []).append(name)
print(groups)
# Output: {'Math': ['Alice', 'Carol'], 'Science': ['Bob', 'Dave']}
cache = {}
def fibonacci(n):
if n in cache:
return cache[n] # O(1) lookup — already computed
if n <= 1:
return n
result = fibonacci(n-1) + fibonacci(n-2)
cache[n] = result # Store result for future calls
return result
print(fibonacci(50)) # 12586269025 — fast because of caching
# Build a dict from a list in one line
squares = {x: x**2 for x in range(6)}
print(squares)
# Output: {0: 0, 1: 1, 2: 4, 3: 9, 4: 16, 5: 25}
# Filter while building
even_squares = {x: x**2 for x in range(10) if x % 2 == 0}
print(even_squares)
# Output: {0: 0, 2: 4, 4: 16, 6: 36, 8: 64}
Collisions are inevitable in hash tables. Chaining handles them with linked lists per slot; open addressing probes for the next free slot. Python uses a sophisticated open addressing scheme with random probing, resizes before getting too full, and has maintained insertion order since Python 3.7. Understanding these internals explains why Python dicts are so fast, why you should use them for frequency counting and caching, and why you should never use a mutable object (like a list) as a dict key.
Get this course's notes on Telegram!
Free cheat sheets, summaries & practice exercises