Your brain contains roughly 86 billion neurons. Each one is a tiny biological switch. It receives electrical signals from other neurons through branch-like dendrites, accumulates those signals in its cell body, and if the total signal exceeds a threshold, it fires — sending a pulse down its axon to the next set of neurons. This firing or not-firing, happening billions of times per second across a vast web of connections, produces thought, emotion, and consciousness.

Artificial neural networks borrow exactly this idea and strip it down to mathematics. Forget the biology. Keep the logic: receive inputs, weight them, sum them up, decide whether to activate. Chain thousands of these simple units together and something remarkable emerges — a system that can learn to recognize handwritten digits, detect tumors, translate languages, and compose music.

From Biological Neuron to Perceptron

The perceptron, invented by Frank Rosenblatt in 1958, is the mathematical neuron. It computes one thing:

output = activation_function(w₁x₁ + w₂x₂ + ... + wₙxₙ + b)

Where:

x₁, x₂, ..., xₙ are the input features
w₁, w₂, ..., wₙ are learned weights (how important is each input?)
b is the bias (a learned offset, like the neuron's baseline threshold)
activation_function introduces non-linearity

The weights and bias are the only things that change during training. Everything else — the architecture, the activation function, the data — is fixed by design.

The bias term is critical. Without it, the decision boundary always passes through the origin. Bias lets the model shift that boundary to wherever the data demands.

Activation Functions: Why Non-Linearity Matters

If you remove the activation function, a neural network with 100 layers is still just a linear model. Every layer output is a linear combination of the previous, and stacking linear operations gives you... another linear operation. You cannot fit a curve.

Activation functions break this by introducing non-linearity at every neuron.

Sigmoid:

σ(z) = 1 / (1 + e^(-z))

Outputs between 0 and 1. Historically used for hidden layers. Now mostly reserved for output layers in binary classification. Problem: gradient vanishes for very large or very small inputs (the "vanishing gradient" problem).

Tanh (Hyperbolic Tangent):

tanh(z) = (e^z - e^(-z)) / (e^z + e^(-z))

Outputs between -1 and 1. Zero-centered, which is better than sigmoid for training. Still suffers from vanishing gradients in deep networks.

ReLU (Rectified Linear Unit):

ReLU(z) = max(0, z)

The modern default for hidden layers. Simple: if input is positive, pass it through; if negative, output 0. ReLU does not saturate for large positive values — gradients flow freely. This is why deep networks became trainable. Faster to compute, works extremely well in practice.

Multi-Layer Perceptron: Stacking Layers

A Multi-Layer Perceptron (MLP) stacks neurons into layers:

Input layer: One node per feature. No computation — just passes data forward.

Hidden layer(s): Each neuron receives all outputs from the previous layer, applies weighted sum + bias + activation. You can have one hidden layer or many. Each layer can have any number of neurons (this is a hyperparameter).

Output layer: One neuron per class (classification) or one neuron (regression). The output layer activation depends on the task:

Binary classification: Sigmoid → probability between 0 and 1
Multi-class classification: Softmax → probabilities summing to 1
Regression: No activation (or linear)

A network with two hidden layers of 64 neurons each, reading 30 features and classifying into 2 classes, has this structure:

Input (30) → Hidden₁ (64, ReLU) → Hidden₂ (64, ReLU) → Output (2, Softmax)

The Forward Pass

Training begins with a forward pass: data flows from input to output through every layer. Each layer transforms its inputs by computing weighted sums and applying activations. At the end, the output layer produces a prediction.

You compare this prediction to the true label using a loss function (e.g., cross-entropy for classification). The loss measures how wrong the prediction is. Then learning kicks in — but that is the backpropagation lesson.

Universal Approximation Theorem

A neural network with just one hidden layer and a sufficient number of neurons can approximate any continuous function on a bounded domain to any desired precision. This was proven by Cybenko in 1989 and Hornik in 1991.

This is a theoretical result — it does not tell you how many neurons you need, or how to train the network. But it answers the fundamental question: can a neural network in principle learn the function I need? Yes. Always yes.

sklearn MLPClassifier Example

import numpy as np
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import classification_report, accuracy_score

# Load digits dataset: 1797 samples, 64 features (8x8 pixel images), 10 classes
digits = load_digits()
X, y = digits.data, digits.target
print(f"Dataset shape: {X.shape}")
print(f"Classes: {np.unique(y)}")
# Output:
# Dataset shape: (1797, 64)
# Classes: [0 1 2 3 4 5 6 7 8 9]

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Standardize (always for neural networks)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test  = scaler.transform(X_test)

# Build MLP: two hidden layers (128 neurons, 64 neurons), ReLU activation
mlp = MLPClassifier(
    hidden_layer_sizes=(128, 64),
    activation='relu',
    solver='adam',
    max_iter=500,
    random_state=42,
    early_stopping=True,    # Stop if validation score stops improving
    validation_fraction=0.1
)

mlp.fit(X_train, y_train)
y_pred = mlp.predict(X_test)

print(f"\nTest Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print(f"Training stopped at epoch: {mlp.n_iter_}")
# Output:
# Test Accuracy: 0.9778
# Training stopped at epoch: 87

print("\nClassification Report:")
print(classification_report(y_test, y_pred))
# Output shows per-digit precision/recall/F1 all above 0.96
# Overall accuracy: ~97.8%

# Architecture summary
print(f"\nNetwork architecture:")
print(f"  Input neurons:  {X_train.shape[1]}")
for i, size in enumerate(mlp.hidden_layer_sizes):
    print(f"  Hidden layer {i+1}: {size} neurons (ReLU)")
print(f"  Output neurons: {len(mlp.classes_)} (one per digit class)")
# Output:
# Input neurons:  64
# Hidden layer 1: 128 neurons (ReLU)
# Hidden layer 2: 64 neurons (ReLU)
# Output neurons: 10 (one per digit class)

97.8% accuracy on handwritten digits — with 26 lines of code. This is what a well-designed neural network achieves even with sklearn's relatively basic MLP implementation.

Activation Function Comparison

Function	Range	Best For	Problem
Sigmoid	(0, 1)	Binary output layer	Vanishing gradient in deep nets
Tanh	(-1, 1)	Shallow hidden layers	Vanishing gradient (less severe)
ReLU	[0, ∞)	Hidden layers (default choice)	"Dead ReLU" (neurons stuck at 0)
Softmax	(0,1), sums to 1	Multi-class output layer	Only for output, not hidden layers
Linear	(-∞, ∞)	Regression output layer	No non-linearity

Key Takeaways

A perceptron is the mathematical neuron: weighted inputs + bias + activation function.
Activation functions are essential. Without them, any deep network collapses to a linear model.
ReLU is the default activation for hidden layers in modern networks — fast, effective, gradient-friendly.
An MLP stacks layers: input → one or more hidden layers → output.
The Universal Approximation Theorem guarantees that a neural network can, in principle, learn any function — the challenge is in training it effectively.
Always standardize inputs before feeding data into a neural network.
Start with a small architecture and grow it only if needed — more neurons add training cost and overfitting risk.

💬 DiscussionPowered by GitHub Discussions

📱

Get this course's notes on Telegram!

Free cheat sheets, summaries & practice exercises

Get Notes Free →

35 minLesson 16 of 19

Course Contents(19 lessons)

▾

Chapter 1: ML Foundations

What Is Machine Learning? Types and Real Applications22 min

The ML Workflow: Data → Model → Prediction25 min

Chapter 2: Data Preprocessing

Data Preprocessing: Cleaning, Imputation, Encoding35 min

Feature Engineering and Feature Selection32 min

Chapter 3: Supervised Learning — Regression

Linear Regression: Predicting Continuous Values35 min

Polynomial Regression and Overfitting30 min

Regularization: Ridge, Lasso, and ElasticNet28 min

Chapter 4: Supervised Learning — Classification

Logistic Regression: Binary and Multi-Class32 min

Decision Trees: How Machines Make Decisions32 min

Random Forests and Ensemble Methods30 min

Support Vector Machines (SVM)32 min

Chapter 5: Unsupervised Learning

K-Means Clustering: Grouping Without Labels30 min

PCA: Reducing Dimensions Without Losing Information32 min

Chapter 6: Model Evaluation

Train/Test Split, Cross-Validation, and Bias-Variance32 min

Evaluation Metrics: Confusion Matrix, F1, ROC-AUC30 min

Chapter 7: Neural Networks

Neural Networks: Neurons, Layers, Activation Functions35 min

Backpropagation and Gradient Descent Explained38 min

Introduction to Deep Learning and CNNs35 min

Chapter 8: Final Project

Final Project: End-to-End ML Pipeline50 min