Neural Networks from Scratch
Neural Networks from Scratch
Neural networks are the foundation of all deep learning — and building one from scratch is the best way to understand what's actually happening inside PyTorch or TensorFlow. Once you've built a neural network using only NumPy, frameworks become obvious rather than magical.
The Biological Inspiration (and Why It Doesn't Matter)
Yes, neural networks were inspired by the brain. But thinking of them as brain simulations is more confusing than helpful. Think of them as this:
A neural network is a function approximator — it learns a mapping from inputs to outputs by adjusting millions of parameters based on examples.
The Architecture: Layers of Neurons
Input Layer → Hidden Layer(s) → Output Layer
[x1] \
[x2] → [neuron][neuron][neuron] → [neuron][neuron] → [output]
[x3] /
Each "neuron" does two things:
- Computes a weighted sum:
z = w₁x₁ + w₂x₂ + ... + wₙxₙ + b - Applies a non-linear activation function:
a = f(z)
Without activation functions, a neural network is just a fancy linear model — it cannot learn non-linear patterns.
Activation Functions
import numpy as np
# ReLU — most common for hidden layers
def relu(x):
return np.maximum(0, x)
# Sigmoid — for binary output (0 to 1)
def sigmoid(x):
return 1 / (1 + np.exp(-x))
# Softmax — for multi-class output (probabilities summing to 1)
def softmax(x):
exp_x = np.exp(x - np.max(x)) # Subtract max for numerical stability
return exp_x / exp_x.sum(axis=1, keepdims=True)
ReLU is the default choice for hidden layers. It's simple, fast, and avoids the "vanishing gradient" problem that plagued older activation functions like sigmoid.
Forward Propagation
class NeuralNetwork:
def __init__(self, input_size, hidden_size, output_size):
np.random.seed(42)
# He initialization — better for ReLU
self.W1 = np.random.randn(input_size, hidden_size) * np.sqrt(2 / input_size)
self.b1 = np.zeros((1, hidden_size))
self.W2 = np.random.randn(hidden_size, output_size) * np.sqrt(2 / hidden_size)
self.b2 = np.zeros((1, output_size))
def forward(self, X):
# Layer 1: linear + ReLU
self.z1 = X @ self.W1 + self.b1 # Matrix multiply
self.a1 = np.maximum(0, self.z1) # ReLU activation
# Layer 2 (output): linear + sigmoid
self.z2 = self.a1 @ self.W2 + self.b2
self.a2 = 1 / (1 + np.exp(-self.z2)) # Sigmoid for binary output
return self.a2
For a network with 3 inputs, 4 hidden neurons, and 1 output:
W1shape: (3, 4) — each input connects to each hidden neuronW2shape: (4, 1) — each hidden neuron connects to the output- Training means finding the right values for W1, b1, W2, b2
The Loss Function
The loss function measures how wrong the predictions are:
def binary_cross_entropy(y_true, y_pred):
# Clip to prevent log(0)
y_pred = np.clip(y_pred, 1e-15, 1 - 1e-15)
return -np.mean(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))
Backpropagation
Backpropagation computes the gradient of the loss with respect to each weight — how much should each weight change to reduce the loss?
def backward(self, X, y, learning_rate=0.01):
m = X.shape[0] # Batch size
# Output layer gradient
dz2 = self.a2 - y # ∂Loss/∂z2
dW2 = (self.a1.T @ dz2) / m # ∂Loss/∂W2
db2 = dz2.mean(axis=0, keepdims=True)
# Hidden layer gradient
da1 = dz2 @ self.W2.T
dz1 = da1 * (self.z1 > 0) # ReLU derivative (1 if z>0, 0 otherwise)
dW1 = (X.T @ dz1) / m
db1 = dz1.mean(axis=0, keepdims=True)
# Update weights using gradient descent
self.W2 -= learning_rate * dW2
self.b2 -= learning_rate * db2
self.W1 -= learning_rate * dW1
self.b1 -= learning_rate * db1
Complete Training Loop
import numpy as np
from sklearn.datasets import make_circles
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
# Generate non-linearly separable data (logistic regression would fail here)
X, y = make_circles(n_samples=1000, noise=0.1, random_state=42)
y = y.reshape(-1, 1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
# Train
nn = NeuralNetwork(input_size=2, hidden_size=8, output_size=1)
losses = []
for epoch in range(1000):
# Forward pass
y_pred = nn.forward(X_train)
loss = binary_cross_entropy(y_train, y_pred)
losses.append(loss)
# Backward pass + update
nn.backward(X_train, y_train, learning_rate=0.1)
if epoch % 100 == 0:
print(f"Epoch {epoch}: Loss = {loss:.4f}")
# Evaluate
y_test_pred = (nn.forward(X_test) > 0.5).astype(int)
accuracy = (y_test_pred == y_test).mean()
print(f"\nTest Accuracy: {accuracy:.3f}")
What the Network Learns
import matplotlib.pyplot as plt
# Visualize decision boundary
xx, yy = np.meshgrid(np.linspace(-2, 2, 100), np.linspace(-2, 2, 100))
grid = np.column_stack([xx.ravel(), yy.ravel()])
Z = nn.forward(grid).reshape(xx.shape)
plt.contourf(xx, yy, Z, alpha=0.4, cmap='RdBu')
plt.scatter(X_test[:, 0], X_test[:, 1], c=y_test.ravel(), cmap='RdBu', edgecolors='k')
plt.title('Neural Network Decision Boundary')
plt.show()
The network learns a non-linear boundary — something logistic regression can't do.
Key Concepts to Remember
| Concept | What It Does |
|---|---|
| Forward propagation | Computes predictions from input to output |
| Loss function | Measures prediction error |
| Backpropagation | Computes gradients of loss w.r.t. weights |
| Gradient descent | Updates weights to reduce loss |
| Learning rate | How big each weight update step is |
| Activation function | Adds non-linearity so the network can learn complex patterns |
Next lesson: Backpropagation Explained — understanding the math that makes training work.
Get this course's notes on Telegram!
Free cheat sheets, summaries & practice exercises