AiTechWorlds
AiTechWorlds
Your brain contains roughly 86 billion neurons. Each one is a tiny biological switch. It receives electrical signals from other neurons through branch-like dendrites, accumulates those signals in its cell body, and if the total signal exceeds a threshold, it fires — sending a pulse down its axon to the next set of neurons. This firing or not-firing, happening billions of times per second across a vast web of connections, produces thought, emotion, and consciousness.
Artificial neural networks borrow exactly this idea and strip it down to mathematics. Forget the biology. Keep the logic: receive inputs, weight them, sum them up, decide whether to activate. Chain thousands of these simple units together and something remarkable emerges — a system that can learn to recognize handwritten digits, detect tumors, translate languages, and compose music.
The perceptron, invented by Frank Rosenblatt in 1958, is the mathematical neuron. It computes one thing:
output = activation_function(w₁x₁ + w₂x₂ + ... + wₙxₙ + b)
Where:
The weights and bias are the only things that change during training. Everything else — the architecture, the activation function, the data — is fixed by design.
The bias term is critical. Without it, the decision boundary always passes through the origin. Bias lets the model shift that boundary to wherever the data demands.
If you remove the activation function, a neural network with 100 layers is still just a linear model. Every layer output is a linear combination of the previous, and stacking linear operations gives you... another linear operation. You cannot fit a curve.
Activation functions break this by introducing non-linearity at every neuron.
Sigmoid:
σ(z) = 1 / (1 + e^(-z))
Outputs between 0 and 1. Historically used for hidden layers. Now mostly reserved for output layers in binary classification. Problem: gradient vanishes for very large or very small inputs (the "vanishing gradient" problem).
Tanh (Hyperbolic Tangent):
tanh(z) = (e^z - e^(-z)) / (e^z + e^(-z))
Outputs between -1 and 1. Zero-centered, which is better than sigmoid for training. Still suffers from vanishing gradients in deep networks.
ReLU (Rectified Linear Unit):
ReLU(z) = max(0, z)
The modern default for hidden layers. Simple: if input is positive, pass it through; if negative, output 0. ReLU does not saturate for large positive values — gradients flow freely. This is why deep networks became trainable. Faster to compute, works extremely well in practice.
A Multi-Layer Perceptron (MLP) stacks neurons into layers:
Input layer: One node per feature. No computation — just passes data forward.
Hidden layer(s): Each neuron receives all outputs from the previous layer, applies weighted sum + bias + activation. You can have one hidden layer or many. Each layer can have any number of neurons (this is a hyperparameter).
Output layer: One neuron per class (classification) or one neuron (regression). The output layer activation depends on the task:
A network with two hidden layers of 64 neurons each, reading 30 features and classifying into 2 classes, has this structure:
Input (30) → Hidden₁ (64, ReLU) → Hidden₂ (64, ReLU) → Output (2, Softmax)
Training begins with a forward pass: data flows from input to output through every layer. Each layer transforms its inputs by computing weighted sums and applying activations. At the end, the output layer produces a prediction.
You compare this prediction to the true label using a loss function (e.g., cross-entropy for classification). The loss measures how wrong the prediction is. Then learning kicks in — but that is the backpropagation lesson.
A neural network with just one hidden layer and a sufficient number of neurons can approximate any continuous function on a bounded domain to any desired precision. This was proven by Cybenko in 1989 and Hornik in 1991.
This is a theoretical result — it does not tell you how many neurons you need, or how to train the network. But it answers the fundamental question: can a neural network in principle learn the function I need? Yes. Always yes.
import numpy as np
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import classification_report, accuracy_score
# Load digits dataset: 1797 samples, 64 features (8x8 pixel images), 10 classes
digits = load_digits()
X, y = digits.data, digits.target
print(f"Dataset shape: {X.shape}")
print(f"Classes: {np.unique(y)}")
# Output:
# Dataset shape: (1797, 64)
# Classes: [0 1 2 3 4 5 6 7 8 9]
# Train/test split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
# Standardize (always for neural networks)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
# Build MLP: two hidden layers (128 neurons, 64 neurons), ReLU activation
mlp = MLPClassifier(
hidden_layer_sizes=(128, 64),
activation='relu',
solver='adam',
max_iter=500,
random_state=42,
early_stopping=True, # Stop if validation score stops improving
validation_fraction=0.1
)
mlp.fit(X_train, y_train)
y_pred = mlp.predict(X_test)
print(f"\nTest Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print(f"Training stopped at epoch: {mlp.n_iter_}")
# Output:
# Test Accuracy: 0.9778
# Training stopped at epoch: 87
print("\nClassification Report:")
print(classification_report(y_test, y_pred))
# Output shows per-digit precision/recall/F1 all above 0.96
# Overall accuracy: ~97.8%
# Architecture summary
print(f"\nNetwork architecture:")
print(f" Input neurons: {X_train.shape[1]}")
for i, size in enumerate(mlp.hidden_layer_sizes):
print(f" Hidden layer {i+1}: {size} neurons (ReLU)")
print(f" Output neurons: {len(mlp.classes_)} (one per digit class)")
# Output:
# Input neurons: 64
# Hidden layer 1: 128 neurons (ReLU)
# Hidden layer 2: 64 neurons (ReLU)
# Output neurons: 10 (one per digit class)
97.8% accuracy on handwritten digits — with 26 lines of code. This is what a well-designed neural network achieves even with sklearn's relatively basic MLP implementation.
| Function | Range | Best For | Problem |
|---|---|---|---|
| Sigmoid | (0, 1) | Binary output layer | Vanishing gradient in deep nets |
| Tanh | (-1, 1) | Shallow hidden layers | Vanishing gradient (less severe) |
| ReLU | [0, ∞) | Hidden layers (default choice) | "Dead ReLU" (neurons stuck at 0) |
| Softmax | (0,1), sums to 1 | Multi-class output layer | Only for output, not hidden layers |
| Linear | (-∞, ∞) | Regression output layer | No non-linearity |
Get this course's notes on Telegram!
Free cheat sheets, summaries & practice exercises