Neural Networks from Scratch

Neural networks are the foundation of all deep learning — and building one from scratch is the best way to understand what's actually happening inside PyTorch or TensorFlow. Once you've built a neural network using only NumPy, frameworks become obvious rather than magical.

The Biological Inspiration (and Why It Doesn't Matter)

Yes, neural networks were inspired by the brain. But thinking of them as brain simulations is more confusing than helpful. Think of them as this:

A neural network is a function approximator — it learns a mapping from inputs to outputs by adjusting millions of parameters based on examples.

The Architecture: Layers of Neurons

Input Layer → Hidden Layer(s) → Output Layer

[x1]  \                        
[x2]  → [neuron][neuron][neuron] → [neuron][neuron] → [output]
[x3]  /

Each "neuron" does two things:

Computes a weighted sum: z = w₁x₁ + w₂x₂ + ... + wₙxₙ + b
Applies a non-linear activation function: a = f(z)

Without activation functions, a neural network is just a fancy linear model — it cannot learn non-linear patterns.

Activation Functions

import numpy as np

# ReLU — most common for hidden layers
def relu(x):
    return np.maximum(0, x)

# Sigmoid — for binary output (0 to 1)
def sigmoid(x):
    return 1 / (1 + np.exp(-x))

# Softmax — for multi-class output (probabilities summing to 1)
def softmax(x):
    exp_x = np.exp(x - np.max(x))  # Subtract max for numerical stability
    return exp_x / exp_x.sum(axis=1, keepdims=True)

ReLU is the default choice for hidden layers. It's simple, fast, and avoids the "vanishing gradient" problem that plagued older activation functions like sigmoid.

Forward Propagation

class NeuralNetwork:
    def __init__(self, input_size, hidden_size, output_size):
        np.random.seed(42)
        # He initialization — better for ReLU
        self.W1 = np.random.randn(input_size, hidden_size) * np.sqrt(2 / input_size)
        self.b1 = np.zeros((1, hidden_size))
        self.W2 = np.random.randn(hidden_size, output_size) * np.sqrt(2 / hidden_size)
        self.b2 = np.zeros((1, output_size))
    
    def forward(self, X):
        # Layer 1: linear + ReLU
        self.z1 = X @ self.W1 + self.b1      # Matrix multiply
        self.a1 = np.maximum(0, self.z1)     # ReLU activation
        
        # Layer 2 (output): linear + sigmoid
        self.z2 = self.a1 @ self.W2 + self.b2
        self.a2 = 1 / (1 + np.exp(-self.z2))  # Sigmoid for binary output
        return self.a2

For a network with 3 inputs, 4 hidden neurons, and 1 output:

W1 shape: (3, 4) — each input connects to each hidden neuron
W2 shape: (4, 1) — each hidden neuron connects to the output
Training means finding the right values for W1, b1, W2, b2

The Loss Function

The loss function measures how wrong the predictions are:

def binary_cross_entropy(y_true, y_pred):
    # Clip to prevent log(0)
    y_pred = np.clip(y_pred, 1e-15, 1 - 1e-15)
    return -np.mean(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))

Backpropagation

Backpropagation computes the gradient of the loss with respect to each weight — how much should each weight change to reduce the loss?

def backward(self, X, y, learning_rate=0.01):
    m = X.shape[0]  # Batch size
    
    # Output layer gradient
    dz2 = self.a2 - y                  # ∂Loss/∂z2
    dW2 = (self.a1.T @ dz2) / m       # ∂Loss/∂W2
    db2 = dz2.mean(axis=0, keepdims=True)
    
    # Hidden layer gradient
    da1 = dz2 @ self.W2.T
    dz1 = da1 * (self.z1 > 0)         # ReLU derivative (1 if z>0, 0 otherwise)
    dW1 = (X.T @ dz1) / m
    db1 = dz1.mean(axis=0, keepdims=True)
    
    # Update weights using gradient descent
    self.W2 -= learning_rate * dW2
    self.b2 -= learning_rate * db2
    self.W1 -= learning_rate * dW1
    self.b1 -= learning_rate * db1

Complete Training Loop

import numpy as np
from sklearn.datasets import make_circles
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Generate non-linearly separable data (logistic regression would fail here)
X, y = make_circles(n_samples=1000, noise=0.1, random_state=42)
y = y.reshape(-1, 1)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Train
nn = NeuralNetwork(input_size=2, hidden_size=8, output_size=1)

losses = []
for epoch in range(1000):
    # Forward pass
    y_pred = nn.forward(X_train)
    loss = binary_cross_entropy(y_train, y_pred)
    losses.append(loss)
    
    # Backward pass + update
    nn.backward(X_train, y_train, learning_rate=0.1)
    
    if epoch % 100 == 0:
        print(f"Epoch {epoch}: Loss = {loss:.4f}")

# Evaluate
y_test_pred = (nn.forward(X_test) > 0.5).astype(int)
accuracy = (y_test_pred == y_test).mean()
print(f"\nTest Accuracy: {accuracy:.3f}")

What the Network Learns

import matplotlib.pyplot as plt

# Visualize decision boundary
xx, yy = np.meshgrid(np.linspace(-2, 2, 100), np.linspace(-2, 2, 100))
grid = np.column_stack([xx.ravel(), yy.ravel()])
Z = nn.forward(grid).reshape(xx.shape)

plt.contourf(xx, yy, Z, alpha=0.4, cmap='RdBu')
plt.scatter(X_test[:, 0], X_test[:, 1], c=y_test.ravel(), cmap='RdBu', edgecolors='k')
plt.title('Neural Network Decision Boundary')
plt.show()

The network learns a non-linear boundary — something logistic regression can't do.

Key Concepts to Remember

Concept	What It Does
Forward propagation	Computes predictions from input to output
Loss function	Measures prediction error
Backpropagation	Computes gradients of loss w.r.t. weights
Gradient descent	Updates weights to reduce loss
Learning rate	How big each weight update step is
Activation function	Adds non-linearity so the network can learn complex patterns

Next lesson: Backpropagation Explained — understanding the math that makes training work.