A compiler's parser faces a deceptively hard question: given a CFG with hundreds of rules and a source file with thousands of tokens, does this file belong to the language? And if so, what is its parse tree? You could try all possible derivations — but their number is astronomical. The CYK algorithm (Cocke-Younger-Kasami), developed independently in the 1960s by John Cocke, Daniel Younger, and Tadao Kasami, solves this in O(n³ × |G|) time using dynamic programming. It is the theoretical foundation behind every context-free parser. The one requirement: the grammar must first be in Chomsky Normal Form.

Chomsky Normal Form: The Required Shape

A CFG is in Chomsky Normal Form (CNF) if every production rule is of exactly one of these forms:

A → BC — exactly two variables on the right-hand side.
A → a — exactly one terminal on the right-hand side.
S → ε — permitted only for the start symbol S, and only if the empty string is in the language.

No mixed rules. No long right-hand sides. No unit rules (A → B). This rigid structure is precisely what enables CYK's dynamic programming decomposition.

Converting Any CFG to CNF (Four Steps)

Every CFG can be converted to an equivalent CNF grammar. Here is the systematic procedure:

Step 1 — Eliminate ε-rules (except possibly S → ε): Find all "nullable" variables (those that can derive ε). For every rule containing a nullable variable, add a version of the rule with that variable omitted. Repeat until no new nullable variables are found. Remove all ε-rules except S → ε.

Step 2 — Eliminate unit rules (A → B): A unit rule passes control to another variable without consuming input. Find all unit pairs (A, B) where A ⇒* B via unit rules alone. For each unit pair (A, B) and each non-unit rule B → w, add A → w. Remove all unit rules.

Step 3 — Break long right-hand sides: Replace any rule A → B₁B₂...Bₙ (n ≥ 3) with a chain of binary rules using fresh variables: A → B₁X₁, X₁ → B₂X₂, ..., Xₙ₋₂ → Bₙ₋₁Bₙ.

Step 4 — Isolate terminals in binary rules: Replace A → aB or A → Ba with A → TₐB (or A → BTₐ) and a new rule Tₐ → a. Terminals now appear only in rules of the form A → a.

Worked example: Convert S → aSb | ab to CNF.

Rule S → aSb has length 3. Introduce X: S → aY where Y → Sb, then handle terminals: S → Tₐ Y, Y → S Tᵦ, Tₐ → a, Tᵦ → b.
Rule S → ab becomes S → Tₐ Tᵦ.
Final CNF: S → Tₐ Y | Tₐ Tᵦ, Y → S Tᵦ, Tₐ → a, Tᵦ → b.

The CYK Algorithm

CYK uses bottom-up dynamic programming. The central insight: if variable A derives the substring w[i..j], then some rule A → BC must exist where B derives w[i..k] and C derives w[k+1..j] for some split point k.

Input: String w = w₁w₂...wₙ, CNF grammar G = (V, Σ, R, S). Output: Does G generate w? (Plus the parse tree if desired.)

Table structure: T[i][j] = set of variables that derive the substring w[i..j].

Algorithm CYK(w, G):
  n = length(w)
  
  // Base case: single characters
  for i = 1 to n:
    for each rule A → wᵢ in R:
      add A to T[i][i]
  
  // Recursive case: substrings of length 2 to n
  for length = 2 to n:
    for i = 1 to n - length + 1:
      j = i + length - 1
      for k = i to j - 1:           // Try all split points
        for each rule A → BC in R:
          if B ∈ T[i][k] and C ∈ T[k+1][j]:
            add A to T[i][j]
  
  return (S ∈ T[1][n])

Time complexity: O(n³ × |G|) — three nested loops over string positions, times the number of grammar rules. Space complexity: O(n² × |V|) — the table has n² cells, each holding a subset of V.

Worked Example: String "baaba"

Consider the CNF grammar:

S → AB | BC
A → BA | a
B → CC | b
C → AB | a

Input: w = b a a b a (indices 1 through 5).

Because S ∈ T[1][5], the grammar generates baaba. The CYK table itself encodes the full parsing structure; backtracking through it yields the parse tree.

Fill-in for T[1][2] (substring ba): split at k=1. Check all rules A → BC: does B ∈ T[1][1] and C ∈ T[2][2]? T[1][1] = {B}, T[2][2] = {A,C}. Rule S → AB: B ∈ {B}? No (need A). Rule A → BA: B ∈ {B}? Yes. A ∈ {A,C}? Yes. So A ∈ T[1][2]. Rule S → BC: B ∈ {B}? Yes. C ∈ {A,C}? Yes. So S ∈ T[1][2].

Beyond CYK: Other Parsing Algorithms

Parsing Algorithm	Grammar Type	Time Complexity	Used In
CYK	Any CFG (CNF)	O(n³)	Natural language processing, theorem proving
Earley (1968)	Any CFG	O(n³) general, O(n²) unambiguous	Marpa, research parsers
LL(k)	Deterministic CFG subset	O(n)	Python (cpython), recursive-descent parsers
LALR(1)	Deterministic CFG subset	O(n)	GCC (C/C++), Java compilers, yacc/bison
GLR	Ambiguous CFG	O(n³) worst case	Natural language, some programming languages

The Dangling Else: Ambiguity in Real Languages

C's grammar is technically ambiguous due to the "dangling else" problem:

if (a) if (b) x = 1; else x = 2;

Does the else belong to the outer or inner if? The CFG for C has two parse trees for this. In practice, C compilers resolve this by convention (else matches the nearest if), but it means C's formal grammar is ambiguous. The C++ and Java standards handle this by specifying the ambiguity resolution rule in prose, not in the grammar itself — a pragmatic compromise between theoretical cleanliness and practical usability.

Why CYK Matters Today

CYK is the standard algorithm for constituency parsing in natural language processing. Every time a speech assistant parses a sentence like "remind me to call John at five" into a syntactic tree, a variant of CYK (or the faster Earley algorithm) is running under the hood. In compiler theory, CYK represents the theoretical worst-case — real compilers use LL or LR variants that run in O(n) for well-designed grammars. But understanding CYK means understanding why parsing is fundamentally a substring-decomposition problem.

Sources: Sipser, M. (2013). Introduction to the Theory of Computation (3rd ed.). Cengage. | Younger, D. H. (1967). Recognition and parsing of context-free languages in time n³. Information and Control, 10(2), 189–208. | Kasami, T. (1966). An efficient recognition and syntax-analysis algorithm for context-free languages. AFCRL Technical Report 65-758.

💬 DiscussionPowered by GitHub Discussions

📱

Get this course's notes on Telegram!

Free cheat sheets, summaries & practice exercises

Get Notes Free →

30 minLesson 11 of 16

Course Contents(16 lessons)

▾

Chapter 1: Mathematical Foundations

What Is Theory of Computation? Why It Matters20 min

Mathematical Foundations: Sets, Functions, Proofs28 min

Chapter 2: Finite Automata

Deterministic Finite Automata (DFA): States and Transitions35 min

Non-Deterministic Finite Automata (NFA)32 min

NFA to DFA: Subset Construction Algorithm30 min

Chapter 3: Regular Languages

Regular Expressions: Pattern Matching Formalized32 min

Pumping Lemma: Proving Languages Are Not Regular28 min

Closure Properties of Regular Languages25 min

Chapter 4: Context-Free Languages

Context-Free Grammars: Defining Programming Languages35 min

Pushdown Automata: Adding a Stack to Finite Automata32 min

CNF and the CYK Parsing Algorithm30 min

Pumping Lemma for Context-Free Languages25 min

Chapter 5: Computability

Turing Machines: The Universal Computer38 min

Decidability, Undecidability, and the Halting Problem32 min

Chapter 6: Computational Complexity

P vs NP: The Million-Dollar Question35 min

NP-Completeness and Polynomial Reductions38 min