AiTechWorlds
AiTechWorlds
A compiler's parser faces a deceptively hard question: given a CFG with hundreds of rules and a source file with thousands of tokens, does this file belong to the language? And if so, what is its parse tree? You could try all possible derivations — but their number is astronomical. The CYK algorithm (Cocke-Younger-Kasami), developed independently in the 1960s by John Cocke, Daniel Younger, and Tadao Kasami, solves this in O(n³ × |G|) time using dynamic programming. It is the theoretical foundation behind every context-free parser. The one requirement: the grammar must first be in Chomsky Normal Form.
A CFG is in Chomsky Normal Form (CNF) if every production rule is of exactly one of these forms:
A → BC — exactly two variables on the right-hand side.A → a — exactly one terminal on the right-hand side.S → ε — permitted only for the start symbol S, and only if the empty string is in the language.No mixed rules. No long right-hand sides. No unit rules (A → B). This rigid structure is precisely what enables CYK's dynamic programming decomposition.
Every CFG can be converted to an equivalent CNF grammar. Here is the systematic procedure:
Step 1 — Eliminate ε-rules (except possibly S → ε): Find all "nullable" variables (those that can derive ε). For every rule containing a nullable variable, add a version of the rule with that variable omitted. Repeat until no new nullable variables are found. Remove all ε-rules except S → ε.
Step 2 — Eliminate unit rules (A → B): A unit rule passes control to another variable without consuming input. Find all unit pairs (A, B) where A ⇒* B via unit rules alone. For each unit pair (A, B) and each non-unit rule B → w, add A → w. Remove all unit rules.
Step 3 — Break long right-hand sides:
Replace any rule A → B₁B₂...Bₙ (n ≥ 3) with a chain of binary rules using fresh variables:
A → B₁X₁, X₁ → B₂X₂, ..., Xₙ₋₂ → Bₙ₋₁Bₙ.
Step 4 — Isolate terminals in binary rules:
Replace A → aB or A → Ba with A → TₐB (or A → BTₐ) and a new rule Tₐ → a. Terminals now appear only in rules of the form A → a.
Worked example: Convert S → aSb | ab to CNF.
S → Tₐ Y | Tₐ Tᵦ, Y → S Tᵦ, Tₐ → a, Tᵦ → b.CYK uses bottom-up dynamic programming. The central insight: if variable A derives the substring w[i..j], then some rule A → BC must exist where B derives w[i..k] and C derives w[k+1..j] for some split point k.
Input: String w = w₁w₂...wₙ, CNF grammar G = (V, Σ, R, S).
Output: Does G generate w? (Plus the parse tree if desired.)
Table structure: T[i][j] = set of variables that derive the substring w[i..j].
Algorithm CYK(w, G):
n = length(w)
// Base case: single characters
for i = 1 to n:
for each rule A → wᵢ in R:
add A to T[i][i]
// Recursive case: substrings of length 2 to n
for length = 2 to n:
for i = 1 to n - length + 1:
j = i + length - 1
for k = i to j - 1: // Try all split points
for each rule A → BC in R:
if B ∈ T[i][k] and C ∈ T[k+1][j]:
add A to T[i][j]
return (S ∈ T[1][n])
Time complexity: O(n³ × |G|) — three nested loops over string positions, times the number of grammar rules. Space complexity: O(n² × |V|) — the table has n² cells, each holding a subset of V.
Consider the CNF grammar:
S → AB | BCA → BA | aB → CC | bC → AB | aInput: w = b a a b a (indices 1 through 5).
Because S ∈ T[1][5], the grammar generates baaba. The CYK table itself encodes the full parsing structure; backtracking through it yields the parse tree.
Fill-in for T[1][2] (substring ba): split at k=1. Check all rules A → BC: does B ∈ T[1][1] and C ∈ T[2][2]? T[1][1] = {B}, T[2][2] = {A,C}. Rule S → AB: B ∈ {B}? No (need A). Rule A → BA: B ∈ {B}? Yes. A ∈ {A,C}? Yes. So A ∈ T[1][2]. Rule S → BC: B ∈ {B}? Yes. C ∈ {A,C}? Yes. So S ∈ T[1][2].
| Parsing Algorithm | Grammar Type | Time Complexity | Used In |
|---|---|---|---|
| CYK | Any CFG (CNF) | O(n³) | Natural language processing, theorem proving |
| Earley (1968) | Any CFG | O(n³) general, O(n²) unambiguous | Marpa, research parsers |
| LL(k) | Deterministic CFG subset | O(n) | Python (cpython), recursive-descent parsers |
| LALR(1) | Deterministic CFG subset | O(n) | GCC (C/C++), Java compilers, yacc/bison |
| GLR | Ambiguous CFG | O(n³) worst case | Natural language, some programming languages |
C's grammar is technically ambiguous due to the "dangling else" problem:
if (a) if (b) x = 1; else x = 2;
Does the else belong to the outer or inner if? The CFG for C has two parse trees for this. In practice, C compilers resolve this by convention (else matches the nearest if), but it means C's formal grammar is ambiguous. The C++ and Java standards handle this by specifying the ambiguity resolution rule in prose, not in the grammar itself — a pragmatic compromise between theoretical cleanliness and practical usability.
CYK is the standard algorithm for constituency parsing in natural language processing. Every time a speech assistant parses a sentence like "remind me to call John at five" into a syntactic tree, a variant of CYK (or the faster Earley algorithm) is running under the hood. In compiler theory, CYK represents the theoretical worst-case — real compilers use LL or LR variants that run in O(n) for well-designed grammars. But understanding CYK means understanding why parsing is fundamentally a substring-decomposition problem.
Sources: Sipser, M. (2013). Introduction to the Theory of Computation (3rd ed.). Cengage. | Younger, D. H. (1967). Recognition and parsing of context-free languages in time n³. Information and Control, 10(2), 189–208. | Kasami, T. (1966). An efficient recognition and syntax-analysis algorithm for context-free languages. AFCRL Technical Report 65-758.
Get this course's notes on Telegram!
Free cheat sheets, summaries & practice exercises