In 1951, Stephen Kleene — a student of Alonzo Church and one of the founders of recursion theory — was studying the mathematical properties of neural networks. To describe the patterns of nerve activation, he needed a concise notation for sets of strings. He invented what he called regular events, notated with three operations: union, concatenation, and a starred repetition operator.

He could not have imagined that his notation would, seventy years later, be embedded in every programming language, every text editor, and every search engine on Earth.

When Google's crawlers look for email addresses, when your IDE highlights syntax errors in red, when a spam filter examines message patterns — they are all computing Kleene's "regular events." But the practical tools you use (Python's re, JavaScript's /regex/, Perl's pattern matching) differ from the formal definition in important ways. Understanding the formal definition reveals exactly what is, and is not, computable with a regex.

Formal Definition

A regular expression over alphabet Σ is defined recursively:

Basis Cases (atomic expressions)

Expression	Denotes	Language
`ε`	The empty string	`{ε}` — the set containing just the empty string
`∅`	No strings at all	`∅` — the empty language
`a` for each `a ∈ Σ`	Single symbol	`{a}` — the set containing just that character

Inductive Cases (building larger expressions)

If R₁ and R₂ are regular expressions, then so are:

Operation	Expression	Language	Read as
Union	R₁ ∪ R₂	L(R₁) ∪ L(R₂)	"R₁ or R₂"
Concatenation	R₁ · R₂	L(R₁) · L(R₂)	"R₁ followed by R₂"
Kleene star	R₁*	L(R₁)*	"zero or more copies of R₁"

That is all. Formal regular expressions have exactly three operations. No more.

The Kleene Star in Detail

R* denotes zero or more concatenations of strings from R:

R* = ε ∪ R ∪ RR ∪ RRR ∪ ...

More precisely: L(R)* = {ε} ∪ L(R) ∪ L(R)·L(R) ∪ ...

Derived operations (not primitive, but definable):

R+ = R · R* — one or more repetitions of R
R? = ε ∪ R — zero or one copy of R (optional)

These are shorthand — they do not add expressive power. Every R+ and R? can be rewritten using only ∪, ·, and *.

Building Expressions: Examples

Binary strings with even number of 0s:

The key insight: between any two consecutive 0s, there can be any number of 1s. Between the start and first 0, any number of 1s. After the last 0, any number of 1s. The 0s come in pairs.

L = (1* 0 1* 0)* 1*

Read: zero or more pairs of 0s (with 1s allowed anywhere), followed by trailing 1s.

Simplified email pattern:

[a-zA-Z0-9.]+ @ [a-zA-Z0-9.]+ \. [a-zA-Z]{2,}

(This is practical regex syntax. In formal notation, character classes like [a-z] are shorthand for a finite union: a ∪ b ∪ ... ∪ z.)

Binary strings divisible by 3:

A binary number is divisible by 3 based on its bit pattern — trackable by a DFA with 3 states (one per remainder). The corresponding regex is: 0* | (0*1(01*0)*10*)+

This is complex but exists — divisibility by 3 is a regular language.

Kleene's Theorem

The central result connecting regular expressions to automata:

Kleene's Theorem (1951): A language L is regular if and only if it can be described by a regular expression.

This theorem has two directions, each proven constructively:

Direction 1: Regex → NFA (Thompson's Construction)

Every regular expression can be converted to an NFA using Thompson's construction (as detailed in Lesson 4). The NFA is built bottom-up, matching the recursive structure of the regex.

For regex ab*c:

The ε-transitions around q3 allow zero repetitions of b (for the case b* = ε).

Direction 2: DFA → Regex (State Elimination)

Every DFA can be converted to a regular expression using the state elimination algorithm:

Add a new start state s with ε-transition to the original start state
Add a new accept state t with ε-transitions from all original accept states
Eliminate states one by one, updating transition labels to be sub-regexes
When only s and t remain, the label on the s→t transition is the regex

This proves that DFAs cannot recognize anything that regular expressions cannot describe — and vice versa.

Operation Precedence

Regular expression precedence (highest to lowest):

* (Kleene star) — binds most tightly
· (concatenation)
∪ (union) — binds least tightly

So ab* ∪ c means (a(b*)) ∪ c, not a(b*(∪c)).

Parentheses override precedence: (ab)* ∪ c means "zero or more abs, or the string c."

Formal vs Practical: Where They Diverge

Modern "regex" engines — Python re, JavaScript, Perl, Java Pattern — implement far more than formal regular expressions. Most extensions are shorthand (no extra power), but one critical extension breaks the formal model:

Feature	Formal Regex	Practical Regex	Regular?
`a\|b`, `ab`, `a*`	Core operations	Same	Yes
`a+`, `a?`	Shorthand via core	Same	Yes
`[a-z]`	Union of 26 chars	Character class	Yes
`{m,n}`	Repeated concatenation	Bounded quantifier	Yes
`.`	Union of all Σ chars	Any character	Yes
`(?=...)`	Not expressible	Lookahead	No (strictly more power)
`\1`	Not expressible	Backreference	NO — not regular!

Backreferences are the critical exception. The pattern (.+)\1 matches any string that is its own repetition — like "abab" or "hellohello." The language {ww | w ∈ Σ*} is not regular (we will prove this with the pumping lemma). Backreferences make practical regex engines capable of recognizing non-regular languages — but at the cost of worst-case exponential runtime (no efficient DFA exists for them).

This is why Google's RE2 library deliberately omits backreferences: without them, all patterns compile to DFAs with guaranteed linear-time matching.

Regular Expressions in Python: Under the Hood

import re

# Compiling a pattern builds the DFA/NFA
pattern = re.compile(r'^[0-9]{10}$')

# Matching runs the automaton on the input
result = pattern.match("5551234567")  # ACCEPT
result = pattern.match("555-123-456")  # REJECT

# The formal regex for this pattern (expanded):
# (0|1|2|3|4|5|6|7|8|9)(0|1|2|3|4|5|6|7|8|9)...(×10)
# Shorthand: [0-9]{10}

Python's re module uses a backtracking NFA simulation (not a true DFA), which can exhibit catastrophic backtracking on pathological inputs. The re2 Python binding uses the true DFA approach.

Examples of Regular Language Descriptions

Language	Regular Expression	Notes
Strings over {0,1} with even length	`(00\|01\|10\|11)*`	Pair up all characters
Binary strings representing even numbers	`(0\|1)*0`	End in 0
Strings not containing "11"	`0(10+)1?`	Careful: no two consecutive 1s
All strings over {a,b}	`(a\|b)*`	Σ*
Empty language	`∅`	No string accepted
Only the empty string	`ε`	One string: ""
IPv4 octet (0-255)	`(25[0-5]\|2[0-4][0-9]\|[01]?[0-9]{1,2})`	Practical syntax

Why Formal Regular Expressions Matter

Understanding the formal definition of regular expressions does something the practical tools cannot: it gives you a proof of what is and is not possible.

When you know that {0ⁿ1ⁿ} is not regular, you know that no regex — no matter how clever — can validate balanced parentheses. You do not need to try. You know to use a parser (context-free grammar) instead.

When you know that backreferences break the regular model, you understand why RE2 excludes them and why Perl's regex engine can sometimes hang for minutes on pathological input.

When you know Kleene's Theorem, you know that switching between regex notation and automaton diagrams is lossless — one can always be converted to the other without losing any expressive power.

Formal theory is not a replacement for practical tools. It is the lens that shows you exactly what those tools can and cannot do.

The next lesson gives you the weapon for proving languages non-regular: the Pumping Lemma.

💬 DiscussionPowered by GitHub Discussions

📱

Get this course's notes on Telegram!

Free cheat sheets, summaries & practice exercises

Get Notes Free →

32 minLesson 6 of 16

Course Contents(16 lessons)

▾

Chapter 1: Mathematical Foundations

What Is Theory of Computation? Why It Matters20 min

Mathematical Foundations: Sets, Functions, Proofs28 min

Chapter 2: Finite Automata

Deterministic Finite Automata (DFA): States and Transitions35 min

Non-Deterministic Finite Automata (NFA)32 min

NFA to DFA: Subset Construction Algorithm30 min

Chapter 3: Regular Languages

Regular Expressions: Pattern Matching Formalized32 min

Pumping Lemma: Proving Languages Are Not Regular28 min

Closure Properties of Regular Languages25 min

Chapter 4: Context-Free Languages

Context-Free Grammars: Defining Programming Languages35 min

Pushdown Automata: Adding a Stack to Finite Automata32 min

CNF and the CYK Parsing Algorithm30 min

Pumping Lemma for Context-Free Languages25 min

Chapter 5: Computability

Turing Machines: The Universal Computer38 min

Decidability, Undecidability, and the Halting Problem32 min

Chapter 6: Computational Complexity

P vs NP: The Million-Dollar Question35 min

NP-Completeness and Polynomial Reductions38 min