AiTechWorlds
AiTechWorlds
In 1951, Stephen Kleene — a student of Alonzo Church and one of the founders of recursion theory — was studying the mathematical properties of neural networks. To describe the patterns of nerve activation, he needed a concise notation for sets of strings. He invented what he called regular events, notated with three operations: union, concatenation, and a starred repetition operator.
He could not have imagined that his notation would, seventy years later, be embedded in every programming language, every text editor, and every search engine on Earth.
When Google's crawlers look for email addresses, when your IDE highlights syntax errors in red, when a spam filter examines message patterns — they are all computing Kleene's "regular events." But the practical tools you use (Python's re, JavaScript's /regex/, Perl's pattern matching) differ from the formal definition in important ways. Understanding the formal definition reveals exactly what is, and is not, computable with a regex.
A regular expression over alphabet Σ is defined recursively:
| Expression | Denotes | Language |
|---|---|---|
ε | The empty string | {ε} — the set containing just the empty string |
∅ | No strings at all | ∅ — the empty language |
a for each a ∈ Σ | Single symbol | {a} — the set containing just that character |
If R₁ and R₂ are regular expressions, then so are:
| Operation | Expression | Language | Read as |
|---|---|---|---|
| Union | R₁ ∪ R₂ | L(R₁) ∪ L(R₂) | "R₁ or R₂" |
| Concatenation | R₁ · R₂ | L(R₁) · L(R₂) | "R₁ followed by R₂" |
| Kleene star | R₁* | L(R₁)* | "zero or more copies of R₁" |
That is all. Formal regular expressions have exactly three operations. No more.
R* denotes zero or more concatenations of strings from R:
R* = ε ∪ R ∪ RR ∪ RRR ∪ ...
More precisely: L(R)* = {ε} ∪ L(R) ∪ L(R)·L(R) ∪ ...
Derived operations (not primitive, but definable):
R+ = R · R* — one or more repetitions of RR? = ε ∪ R — zero or one copy of R (optional)These are shorthand — they do not add expressive power. Every R+ and R? can be rewritten using only ∪, ·, and *.
Binary strings with even number of 0s:
The key insight: between any two consecutive 0s, there can be any number of 1s. Between the start and first 0, any number of 1s. After the last 0, any number of 1s. The 0s come in pairs.
L = (1* 0 1* 0)* 1*
Read: zero or more pairs of 0s (with 1s allowed anywhere), followed by trailing 1s.
Simplified email pattern:
[a-zA-Z0-9.]+ @ [a-zA-Z0-9.]+ \. [a-zA-Z]{2,}
(This is practical regex syntax. In formal notation, character classes like [a-z] are shorthand for a finite union: a ∪ b ∪ ... ∪ z.)
Binary strings divisible by 3:
A binary number is divisible by 3 based on its bit pattern — trackable by a DFA with 3 states (one per remainder). The corresponding regex is:
0* | (0*1(01*0)*10*)+
This is complex but exists — divisibility by 3 is a regular language.
The central result connecting regular expressions to automata:
Kleene's Theorem (1951): A language L is regular if and only if it can be described by a regular expression.
This theorem has two directions, each proven constructively:
Every regular expression can be converted to an NFA using Thompson's construction (as detailed in Lesson 4). The NFA is built bottom-up, matching the recursive structure of the regex.
For regex ab*c:
The ε-transitions around q3 allow zero repetitions of b (for the case b* = ε).
Every DFA can be converted to a regular expression using the state elimination algorithm:
This proves that DFAs cannot recognize anything that regular expressions cannot describe — and vice versa.
Regular expression precedence (highest to lowest):
* (Kleene star) — binds most tightly· (concatenation)∪ (union) — binds least tightlySo ab* ∪ c means (a(b*)) ∪ c, not a(b*(∪c)).
Parentheses override precedence: (ab)* ∪ c means "zero or more abs, or the string c."
Modern "regex" engines — Python re, JavaScript, Perl, Java Pattern — implement far more than formal regular expressions. Most extensions are shorthand (no extra power), but one critical extension breaks the formal model:
| Feature | Formal Regex | Practical Regex | Regular? |
|---|---|---|---|
a|b, ab, a* | Core operations | Same | Yes |
a+, a? | Shorthand via core | Same | Yes |
[a-z] | Union of 26 chars | Character class | Yes |
{m,n} | Repeated concatenation | Bounded quantifier | Yes |
. | Union of all Σ chars | Any character | Yes |
(?=...) | Not expressible | Lookahead | No (strictly more power) |
\1 | Not expressible | Backreference | NO — not regular! |
Backreferences are the critical exception. The pattern (.+)\1 matches any string that is its own repetition — like "abab" or "hellohello." The language {ww | w ∈ Σ*} is not regular (we will prove this with the pumping lemma). Backreferences make practical regex engines capable of recognizing non-regular languages — but at the cost of worst-case exponential runtime (no efficient DFA exists for them).
This is why Google's RE2 library deliberately omits backreferences: without them, all patterns compile to DFAs with guaranteed linear-time matching.
import re
# Compiling a pattern builds the DFA/NFA
pattern = re.compile(r'^[0-9]{10}$')
# Matching runs the automaton on the input
result = pattern.match("5551234567") # ACCEPT
result = pattern.match("555-123-456") # REJECT
# The formal regex for this pattern (expanded):
# (0|1|2|3|4|5|6|7|8|9)(0|1|2|3|4|5|6|7|8|9)...(×10)
# Shorthand: [0-9]{10}
Python's re module uses a backtracking NFA simulation (not a true DFA), which can exhibit catastrophic backtracking on pathological inputs. The re2 Python binding uses the true DFA approach.
| Language | Regular Expression | Notes |
|---|---|---|
| Strings over {0,1} with even length | (00|01|10|11)* | Pair up all characters |
| Binary strings representing even numbers | (0|1)*0 | End in 0 |
| Strings not containing "11" | 0*(10+)*1? | Careful: no two consecutive 1s |
| All strings over {a,b} | (a|b)* | Σ* |
| Empty language | ∅ | No string accepted |
| Only the empty string | ε | One string: "" |
| IPv4 octet (0-255) | (25[0-5]|2[0-4][0-9]|[01]?[0-9]{1,2}) | Practical syntax |
Understanding the formal definition of regular expressions does something the practical tools cannot: it gives you a proof of what is and is not possible.
When you know that {0ⁿ1ⁿ} is not regular, you know that no regex — no matter how clever — can validate balanced parentheses. You do not need to try. You know to use a parser (context-free grammar) instead.
When you know that backreferences break the regular model, you understand why RE2 excludes them and why Perl's regex engine can sometimes hang for minutes on pathological input.
When you know Kleene's Theorem, you know that switching between regex notation and automaton diagrams is lossless — one can always be converted to the other without losing any expressive power.
Formal theory is not a replacement for practical tools. It is the lens that shows you exactly what those tools can and cannot do.
The next lesson gives you the weapon for proving languages non-regular: the Pumping Lemma.
Get this course's notes on Telegram!
Free cheat sheets, summaries & practice exercises