Neural Networks Explained: From Perceptron to Deep Learning
Neural networks explained clearly — how they actually work, from the single perceptron to deep learning, with visual intuitions and the math you actually need to understand them.
Get more content like this on Telegram!
Daily AI tips, notes & resources — free
Neural Networks Explained: From Perceptron to Deep Learning
My first attempt to understand neural networks from a textbook ended in frustration. The math notation was dense, the diagrams were abstract, and the connection between the equations and the magical pattern-recognition I'd heard about was completely unclear.
What finally made it click was a different mental model: stop thinking about neurons and brains, and think about function composition. A neural network is a chain of mathematical functions that gradually transform your input into your output — and through training, those functions get tuned to make the transformation accurate.
This guide gives you that mental model, built bottom-up from the simplest possible neural network to the deep learning systems behind modern AI. No assumed math background beyond high school algebra, though we'll reference calculus where it matters.
The Simplest Neural Network: A Single Perceptron
The perceptron, invented in 1957, is the foundation of every neural network built since.
It takes inputs, multiplies each by a weight, sums the result, adds a bias, then passes through an activation function:
Inputs: x₁ = 0.5, x₂ = 0.3, x₃ = 0.8
Weights: w₁ = 0.4, w₂ = -0.2, w₃ = 0.7
Weighted sum: z = (0.5 × 0.4) + (0.3 × -0.2) + (0.8 × 0.7) + bias
z = 0.20 + (-0.06) + 0.56 + 0.1 = 0.80
Activation: output = sigmoid(0.80) = 1 / (1 + e^-0.80) ≈ 0.69
The output (0.69) can be interpreted as a probability or confidence. If you're classifying emails as spam or not-spam, an output of 0.69 might mean "69% confident this is spam."
What the Weights Represent
The weights are the learned parameters. A positive weight means "this input increases the probability of spam." A negative weight means "this input decreases the probability." The magnitude tells you how strongly.
After training on thousands of emails, a spam classifier's weights encode: "the word 'Nigerian' strongly predicts spam, but the word 'meeting' slightly predicts not-spam."
The limitation of a single perceptron: It can only learn linear decision boundaries — it can separate data that's linearly separable, but not data that requires a curved or complex boundary. The XOR problem (output 1 when inputs differ, 0 when they're the same) famously cannot be solved by a single perceptron.
Solving XOR: Why We Need Layers
The XOR problem requires a curved decision boundary. To make curved boundaries, we stack perceptrons.
Input Layer → Hidden Layer → Output Layer
x₁, x₂ → [neuron_1] → output
→ [neuron_2] →
Each neuron in the hidden layer creates its own linear boundary. Together, they combine into a non-linear boundary. With enough hidden neurons, any decision boundary is achievable.
This insight — that layering simple functions creates complex functions — is the core of deep learning.
The Network as a Function Composition
A neural network with 3 layers applies three functions in sequence:
Input x
↓
Layer 1: f₁(x) = activation(W₁·x + b₁)
↓
Layer 2: f₂(z₁) = activation(W₂·z₁ + b₂)
↓
Output: f₃(z₂) = activation(W₃·z₂ + b₃)
Final output = f₃(f₂(f₁(x)))
W are weight matrices, b are bias vectors. The "learning" is finding the W and b values that make the final output match the targets.
Activation Functions: Adding Non-Linearity
If we remove activation functions, all the layers collapse into one linear transformation. Non-linear activation functions are what allow the network to learn non-linear patterns.
Common Activation Functions
Sigmoid:
σ(x) = 1 / (1 + e^(-x))
Range: (0, 1) — useful for output probabilities in binary classification
Problem: Vanishing gradients in deep networks — gradients shrink exponentially
as they propagate backward through many sigmoid layers
ReLU (Rectified Linear Unit) — the standard:
ReLU(x) = max(0, x)
If x > 0: output = x
If x ≤ 0: output = 0
Advantages:
- Computationally simple (fast)
- No vanishing gradient problem for positive values
- Empirically works better than sigmoid in hidden layers
GELU (Gaussian Error Linear Unit) — used in Transformers:
GELU(x) ≈ x × Φ(x) (where Φ is the Gaussian CDF)
Smoother than ReLU; used in BERT, GPT, and most modern Transformers
Softmax — output layer for classification:
softmax(x)ᵢ = e^xᵢ / Σⱼ e^xⱼ
Converts raw scores to probabilities that sum to 1.
Used when you need to predict one of N classes.
Example: [2.0, 1.0, 0.5] → [0.61, 0.23, 0.16]
"Cat: 61%, Dog: 23%, Bird: 16%"
Forward Pass: How a Network Makes Predictions
The forward pass is the sequence of computations that transforms input into output. In code:
import numpy as np
def sigmoid(x):
return 1 / (1 + np.exp(-x))
def relu(x):
return np.maximum(0, x)
class SimpleNeuralNetwork:
def __init__(self):
# Initialize weights randomly (small values to start)
np.random.seed(42)
self.W1 = np.random.randn(4, 3) * 0.01 # 4 neurons, 3 input features
self.b1 = np.zeros((4, 1))
self.W2 = np.random.randn(1, 4) * 0.01 # 1 output neuron
self.b2 = np.zeros((1, 1))
def forward(self, X):
# Layer 1: linear transformation + ReLU
self.Z1 = np.dot(self.W1, X) + self.b1
self.A1 = relu(self.Z1)
# Layer 2: linear transformation + sigmoid (binary output)
self.Z2 = np.dot(self.W2, self.A1) + self.b2
self.A2 = sigmoid(self.Z2)
return self.A2 # Probability between 0 and 1
# Example
network = SimpleNeuralNetwork()
X = np.array([[1.0], [0.5], [0.3]]) # 3 features, 1 example
output = network.forward(X)
print(f"Output probability: {output[0][0]:.4f}")
Training: How the Network Learns
The Loss Function
The loss (or cost) function measures how wrong the network's predictions are. Common choices:
Binary Cross-Entropy (binary classification):
Loss = -[y × log(ŷ) + (1-y) × log(1-ŷ)]
y = true label (0 or 1)
ŷ = predicted probability
If the true label is 1 and we predict 0.9: loss = -log(0.9) ≈ 0.105 (small — good prediction)
If the true label is 1 and we predict 0.1: loss = -log(0.1) ≈ 2.30 (large — bad prediction)
Mean Squared Error (regression):
Loss = (1/n) × Σ(yᵢ - ŷᵢ)²
Average squared difference between true and predicted values.
Gradient Descent
Gradient descent is the optimization algorithm that reduces the loss. The gradient is the direction of steepest increase in loss — by moving opposite to the gradient (steepest decrease), we reduce the loss.
Weight update rule:
w_new = w_old - learning_rate × ∂Loss/∂w
learning_rate (e.g., 0.001): how big a step to take
∂Loss/∂w: the gradient — how much loss changes when we change w
The learning rate is critical. Too large: the loss oscillates or diverges. Too small: training is extremely slow.
Backpropagation
Backpropagation computes the gradient of the loss with respect to every weight in the network using the chain rule. The chain rule allows us to compute how changes in early layer weights affect the final loss:
Chain rule example (simplified):
∂Loss/∂W₁ = ∂Loss/∂A₂ × ∂A₂/∂Z₂ × ∂Z₂/∂A₁ × ∂A₁/∂Z₁ × ∂Z₁/∂W₁
The "backprop" part is that these gradients propagate backward from output to input — we compute them in reverse layer order and use them to update all weights simultaneously.
Deep Learning: Why Depth Matters
"Deep" learning refers to networks with many layers (typically more than 3). Deep networks can learn hierarchical representations:
Image recognition example (Convolutional Neural Network):
Layer 1: Learns edges and gradients
Layer 2: Combines edges into shapes (corners, curves)
Layer 3: Combines shapes into textures (fur, fabric, skin)
Layer 4: Combines textures into object parts (eyes, wheels, leaves)
Layer 5: Combines parts into objects (cats, cars, trees)
No one programmed these features — the network discovered them by finding patterns that minimize the classification loss across millions of images.
This hierarchical feature learning is why deep networks outperform traditional ML on complex data:
- Images: each pixel is a raw feature; deep networks learn which pixels matter and how they combine
- Text: each word is a token; deep networks learn semantic meaning and syntax
- Audio: each time-frequency point is raw; deep networks learn phonemes, words, speakers
Key Architectures in 2025
Convolutional Neural Networks (CNNs): For image data. Use convolutional layers that detect local patterns regardless of position. ResNet, EfficientNet, and Vision Transformers are current best-practice architectures.
Recurrent Neural Networks (RNNs) and LSTMs: For sequence data. Maintain a hidden state that carries information through the sequence. Largely superseded by Transformers but still used in some settings.
Transformers: The dominant architecture for NLP and increasingly for vision and other modalities. Uses attention mechanisms to weigh the importance of each input element when processing each output. The foundation of GPT, BERT, Claude, Gemini, and essentially every major AI system.
Graph Neural Networks (GNNs): For graph-structured data (molecular structures, social networks, knowledge graphs). Growing area of research with significant practical applications.
Implementing a Neural Network with PyTorch
import torch
import torch.nn as nn
import torch.optim as optim
class ClassificationNetwork(nn.Module):
def __init__(self, input_size, hidden_size, output_size):
super().__init__()
self.network = nn.Sequential(
nn.Linear(input_size, hidden_size),
nn.ReLU(),
nn.Dropout(0.3), # Regularization: randomly drop neurons
nn.Linear(hidden_size, hidden_size),
nn.ReLU(),
nn.Dropout(0.3),
nn.Linear(hidden_size, output_size)
)
def forward(self, x):
return self.network(x)
# Initialize
model = ClassificationNetwork(input_size=10, hidden_size=64, output_size=2)
optimizer = optim.Adam(model.parameters(), lr=0.001)
criterion = nn.CrossEntropyLoss()
# Training loop
for epoch in range(100):
# Forward pass
outputs = model(X_train)
loss = criterion(outputs, y_train)
# Backward pass
optimizer.zero_grad()
loss.backward() # Compute gradients
optimizer.step() # Update weights
if epoch % 10 == 0:
print(f"Epoch {epoch}, Loss: {loss.item():.4f}")
Common Failure Modes
Overfitting: The network memorizes training data but doesn't generalize. Signs: training accuracy much higher than validation accuracy. Fixes: add dropout, use less data, L2 regularization, early stopping.
Underfitting: The network is too simple or undertrained. Signs: both training and validation accuracy are poor. Fixes: more layers/neurons, longer training, better features.
Vanishing gradients: In deep networks, gradients shrink exponentially as they propagate backward — early layers learn very slowly or not at all. Fixes: ReLU activation, batch normalization, residual connections.
Exploding gradients: Gradients grow exponentially — weights become very large and training diverges. Fixes: gradient clipping, careful weight initialization, smaller learning rate.
Conclusion
Neural networks are function approximators. The "intelligence" emerges not from any individual neuron but from the collective effect of millions of weights learned through gradient descent across massive datasets.
Understanding the forward pass (how predictions are made) and the backward pass (how weights are updated) gives you the foundation to understand every modern architecture — CNNs, RNNs, Transformers, and whatever comes next.
The math is accessible: matrix multiplication, the chain rule, and some basic calculus. The concepts are learnable with patience and good examples. The path from understanding a perceptron to understanding GPT-4 is longer than it looks, but it's a connected path — each concept builds on the last.
For hands-on practice, see our scikit-learn tutorial for traditional ML implementation and our machine learning beginners guide for the full learning path.
Frequently Asked Questions
What is a neural network in simple terms?
A mathematical system of layered functions that learns patterns from data by adjusting millions of numerical weights. The weights encode learned patterns — a spam-detecting neural network doesn't have spam rules; it has weights that respond strongly to spam-like patterns discovered during training.
How many neurons and layers does a neural network need?
Depends on problem complexity. Start small (2–3 layers, 64–128 neurons), then add complexity if the model underfits. Deep learning for images may use 10–50+ layers; language models use hundreds. More isn't always better — simpler models are faster and less prone to overfitting.
What activation function should I use?
ReLU for hidden layers (fast, avoids vanishing gradients). Sigmoid for binary classification outputs. Softmax for multi-class outputs. GELU in Transformer architectures. Avoid sigmoid and tanh in hidden layers of deep networks.
What is backpropagation and why does it matter?
The algorithm that trains neural networks by computing how much each weight contributed to prediction error, then adjusting all weights proportionally. Uses the chain rule from calculus to efficiently propagate gradients backward through layers. Without backpropagation, training deep networks is computationally intractable.
How is deep learning different from traditional machine learning?
Traditional ML requires manual feature engineering — you decide which features matter. Deep learning automatically learns features from raw data. Deep learning requires more data and compute but often produces better results on complex unstructured data (images, text, audio). Traditional ML often works better with small datasets, interpretability requirements, or tabular data.
Frequently Asked Questions
AiTechWorlds Team
✓ Verified WriterThe AiTechWorlds team is passionate about AI, technology, and education. We create high-quality, research-backed content to help you learn, grow, and succeed in the modern digital world.
Related Articles
Best Machine Learning Courses in 2025: Ranked After Taking Them All
The best machine learning courses in 2025 — ranked by a practitioner who completed them. Honest assessments of Coursera, Fast.ai, Kaggle, and 7 others with cost and time required.
Computer Vision Tutorial: Build an Image Classifier from Scratch
Computer vision tutorial for beginners — build a real image classifier using CNNs and PyTorch, understand how computers see images, and learn transfer learning for production results.
Feature Engineering Guide: Turn Raw Data into Powerful ML Inputs
Feature engineering guide for machine learning — practical techniques to create, transform, and select features that improve model accuracy, with Python code examples for every method.
Kaggle Competition Guide: How to Rank in the Top 10% Every Time
Kaggle competition guide — the systematic approach to finishing in the top 10%, from EDA and baseline models to ensembling and post-competition learning, used by Kaggle Masters.