A framework for understanding

Context-Free Grammars and Languages

Context-free grammars provide a powerful formalism for describing languages with nested or recursive structures. Unlike regular languages, context-free languages can express patterns like balanced parentheses, nested structures, and many programming language constructs. In this section, we'll explore their definition, properties, and theoretical foundations.

Formal Definitions

Definition: Context-Free Grammar

A context-free grammar (CFG) is a 4-tuple G = (V, Σ, R, S) where:

  • V is a finite set of non-terminal symbols
  • Σ is a finite set of terminal symbols, disjoint from V
  • R is a finite set of production rules of the form A → α, where A ∈ V and α ∈ (V ∪ Σ)*
  • S ∈ V is the start symbol

Definition: Derivation

Let G = (V, Σ, R, S) be a context-free grammar. For any strings α, β ∈ (V ∪ Σ)*, we sayα directly derives β, written as α ⇒ β, if there exist strings γ, δ ∈ (V ∪ Σ)* and a production rule A → ω in R such that α = γAδ and β = γωδ.

We say α derives β, written as α ⇒* β, if there exists a sequence α0, α1, ..., αnsuch that α = α0, β = αn, and αi-1 ⇒ αi for i = 1, 2, ..., n.

Definition: Parse Tree

A parse tree (or derivation tree) for a string w ∈ L(G) is a rooted, ordered tree such that:

  • The root is labeled with the start symbol S.
  • Internal nodes are labeled with non-terminals from V.
  • Leaf nodes are labeled with terminals from Σ or ε.
  • If an internal node labeled A has children labeled X₁ X₂ ... Xₖ, then there is a production rule A → X₁X₂...Xₖ in R.

The yield of the tree (concatenation of leaves left-to-right) is the derived string.

Definition: Context-Free Language

Let G = (V, Σ, R, S) be a context-free grammar. The language generated by G, denoted L(G), is the set of all strings w ∈ Σ* such that S ⇒* w.

A language L is context-free if there exists a context-free grammar G such that L = L(G).

Example: A Simple Context-Free Grammar

Let's examine the grammar for balanced parentheses:

Grammar G:

S → (S)S | ε

Language L(G):

L(G) = {strings of balanced parentheses}

Examples: ε, (), (()), ()(), (())()

Derivation of string "(())":

S
⇒ (S)S [using S → (S)S]
⇒ ((S)S)S [using S → (S)S on first S]
⇒ ((ε)S)S [using S → ε on inner S]
⇒ ((ε)ε)S [using S → ε on middle S]
⇒ ((ε)ε)ε [using S → ε on final S]
⇒ (())

Parse Tree for "(())":

        S
       / | \
      (  S  )S
        /|\ |
       ( S )Sε
         | |
         ε ε

Reading leaves left-to-right: ( ( ε ) ε ) ε = (())

Key insight: The recursive structure of the grammar naturally captures the nested nature of balanced parentheses, something impossible with regular languages.

Normal Forms

Definition: Chomsky Normal Form (CNF)

A context-free grammar G = (V, Σ, R, S) is in Chomsky Normal Form (CNF) if every production rule in R is of one of the following forms:

  • A → BC, where A, B, C ∈ V and B, C ≠ S
  • A → a, where A ∈ V and a ∈ Σ
  • S → ε, where S is the start symbol (only allowed if ε ∈ L(G))

Theorem: CNF Conversion Theorem

For any context-free grammar G, there exists an equivalent context-free grammar G′ in Chomsky Normal Form such that L(G) = L(G′) (excluding or optionally including ε).

Proof: Proof of CNF Conversion

The transformation from an arbitrary CFG G to an equivalent CNF grammar G′ involves several steps:

  1. Introduce a new start symbol S′ with production S′ → S to ensure the start symbol doesn't appear on the right-hand side of any production.
  2. Eliminate all ε-productions (except possibly S′ → ε) using the following procedure:
    • Identify all nullable non-terminals (those that can derive ε).
    • For every production A → α containing nullable non-terminals, add new productions by systematically removing one or more nullable non-terminals from α.
    • Remove all original ε-productions.
  3. Eliminate all unit productions (rules of form A → B where B is a non-terminal) by:
    • Compute the transitive closure of unit production relationships.
    • For each unit pair (A,B), add A → γ for every original production B → γ where γ is not a single non-terminal.
    • Remove all original unit productions.
  4. Replace terminals in RHS of length ≥ 2 by introducing new non-terminals:
    • For each terminal a in a production A → α with |α| ≥ 2, create a new non-terminal Xa and add production Xa → a.
    • Replace each occurrence of a in α with Xa.
  5. Convert RHS of length > 2 to binary form:
      Step 4: For any decomposition s = uvwxy with |vwx| ≤ p and |vx| > 0, we need to show that there exists some i such that uviwxiy ∉ L.
    • Replace the original production with:

      A → X1Y1

      Y1 → X2Y2

      ...

      Yn-3 → Xn-2Yn-2

      Yn-2 → Xn-1Xn

Each step preserves the language generated by the grammar, and the resulting grammar G′ generates L(G) (excluding ε, which can be handled separately) and is in Chomsky Normal Form.

These transformations preserve the language generated by the grammar, ensuring that L(G) = L(G') where G' is the converted grammar in CNF.

Definition: Greibach Normal Form (GNF)

A context-free grammar G = (V, Σ, R, S) is in Greibach Normal Form (GNF) if every production rule in R is of the form:

  • A → aα, where A ∈ V, a ∈ Σ, and α ∈ V*
  • S → ε, where S is the start symbol (only allowed if ε ∈ L(G))

Definition: GNF Conversion Overview

Any ε-free CFG can be converted to an equivalent grammar in GNF through the following process:

  1. Eliminate left recursion (both direct and indirect):
    • For direct left recursion (A → Aα | β), replace with A → βA′ and A′ → αA′ | ε
    • For indirect left recursion, first order all non-terminals and systematically eliminate recursion
  2. Transform the grammar to ensure each production starts with a terminal:
    • Replace A → Bα (where B is a non-terminal) with A → βα for each B → β
    • Continue this substitution process until all productions begin with terminals
  3. Apply left factoring to ensure deterministic expansion where possible

GNF is particularly important for top-down parsers (like recursive-descent parsers), as it ensures each derivation step produces one terminal followed by non-terminals, making the parsing decision deterministic based on the next input symbol.

Example: CNF vs GNF for the Same Language

Let's convert the same language L = {a^n b^n | n ≥ 0} into both normal forms.

Original Grammar:

S → aSb | ε

Chomsky Normal Form (CNF):

S → AS₁ | ε
S₁ → SB
A → a
B → b

Every production is either A → BC (two variables) or A → a (one terminal)

Greibach Normal Form (GNF):

S → aSB | ε
B → b

Every production starts with a terminal: A → aα

Derivation comparison for "aabb":

CNF Derivation:

S ⇒ AS₁
⇒ aSB (A→a, S₁→SB)
⇒ aAS₁B
⇒ aaS₁B
⇒ aaSBB
⇒ aaεBB
⇒ aabbε
⇒ aabb

GNF Derivation:

S ⇒ aSB
⇒ aaSBB
⇒ aaεBB
⇒ aaBB
⇒ aabB
⇒ aabb

Usage comparison:

CNF Benefits:

  • Perfect for CYK parsing algorithm
  • Binary tree structure simplifies analysis
  • Uniform production format

GNF Benefits:

  • Ideal for top-down parsing
  • Each step produces one terminal
  • Predictive parsing without lookahead

Key insight: Both normal forms generate the same language but optimize for different parsing strategies - CNF for bottom-up, GNF for top-down.

The Pumping Lemma for Context-Free Languages

Lemma: Pumping Lemma for Context-Free Languages

Let L be a context-free language. Then there exists a constant p ≥ 1 (the pumping length) such that any string s ∈ L with |s| ≥ p can be decomposed as s = uvwxy where:

  1. |vwx| ≤ p
  2. |vx| > 0 (i.e., at least one of v or x is non-empty)
  3. For all i ≥ 0, uviwxiy ∈ L

Proof: Proof of the Pumping Lemma for CFLs

Let G be a context-free grammar for L in Chomsky Normal Form. Let k be the number of variables in G.

Let p = 2k and let s be any string in L such that |s| ≥ p.

Consider a parse tree for s. Since s has at least 2k symbols and every internal node in the parse tree has at most 2 children (since G is in CNF), the height of the parse tree must be at least k+1. Therefore, there must be some path from the root to a leaf with at least k+1 internal nodes.

By the pigeonhole principle, this path must contain at least one repeated variable. Let A be a variable that appears at least twice in this path, and let's denote the two occurrences as A1 and A2.

The subtree rooted at A1 contains the subtree rooted at A2. Let v be the string derived from A2, and let vwx be the string derived from A1. Thus, s can be written as uvwxy where:

  • u is the part of s derived from the part of the parse tree before reaching A1
  • v is the part of s derived from the part of the parse tree between A1 and A2
  • w is the part of s derived from A2
  • x is the part of s derived from the part of the parse tree after A2 but still within A1
  • y is the part of s derived from the part of the parse tree after A1

Since A1 and A2 are the same variable, we can replace the subtree rooted at A2 with the entire subtree rooted at A1, which would insert an extra copy of vwx. Similarly, we could remove the entire subtree rooted at A2 and replace it with just w.

This gives us that uviwxiy ∈ L for any i ≥ 0. Also, since the height of the parse tree is bounded by 2|s|, we know that |vwx| ≤ p. Finally, v and x can't both be empty since that would mean A1 and A2 derive the same string, which contradicts the assumption that they are distinct occurrences.

Proof: Using the Pumping Lemma to Prove Non-Context-Free Languages

We can use the pumping lemma to prove that certain languages are not context-free through a proof by contradiction.

  1. Assume that the language L is context-free
  2. Then the pumping lemma must apply to L with some pumping length p
  3. Choose a string s ∈ L with |s| ≥ p
  4. Show that for any decomposition s = uvwxy satisfying conditions 1 and 2, there exists some i ≥ 0 such that uviwxiy ∉ L
  5. This contradicts condition 3 of the pumping lemma
  6. Therefore, L cannot be context-free

Example: Proving {aⁿ bⁿ cⁿ | n ≥ 0} is Not Context-Free

Proof:

Step 1: Assume, for contradiction, that L = {aⁿ bⁿ cⁿ | n ≥ 0} is context-free.

Step 2: By the pumping lemma, there exists a pumping length p ≥ 1.

Step 3: Consider the string s = aᵖ bᵖ cᵖ ∈ L.

Step 4: For any decomposition s = uvwxy with |vwx| ≤ p and |vx| > 0, we need to show that there exists some i such that uviwxiy ∉ L.

Since |vwx| ≤ p, the string vwx can contain at most two different symbols from {a, b, c}. We consider all possible cases:

  • If vwx contains only a, then v and x contain only a. Pumping with i = 2 gives us more a than b or c.
  • If vwx contains only a and b, then at least one of v or x contains a or both contain b. Either way, pumping with i = 2 will give us unequal numbers of a, b, and c.
  • Similarly, if vwx contains only b, only b and c, or only c, pumping will disrupt the balance.

Step 5: For any decomposition, pumping with i = 2 gives a string that is not in L, contradicting the pumping lemma.

Step 6: Therefore, L is not context-free.

Closure Properties

Theorem: Closure Properties of Context-Free Languages

Context-free languages are closed under the following operations:

  • Union: If L1 and L2 are CFLs, then L1 ∪ L2 is a CFL.
  • Concatenation: If L1 and L2 are CFLs, then L1L2 is a CFL.
  • Kleene Star: If L is a CFL, then L* is a CFL.
  • Substitution: If L is a CFL and each symbol is replaced by a CFL, the result is a CFL.
  • Homomorphism: If L is a CFL and h is a homomorphism, then h(L) is a CFL.
  • Inverse homomorphism: If L is a CFL and h is a homomorphism, then h-1(L) is a CFL.

Context-free languages are not closed under the following operations:

  • Intersection: There exist CFLs L1 and L2 such that L1 ∩ L2 is not a CFL.
  • Complement: There exists a CFL L such that Σ* - L is not a CFL.
  • Set difference: There exist CFLs L1 and L2 such that L1 - L2 is not a CFL.

Proof: Proving Closure Under Union

If L₁ and L₂ are context-free languages with corresponding grammars G₁ = (V₁, Σ, R₁, S₁) and G₂ = (V₂, Σ, R₂, S₂), we can construct a grammar G = (V, Σ, R, S) for L₁ ∪ L₂ as follows:

  • V = V₁ ∪ V₂ ∪ {S} where S is a new non-terminal not in V₁ or V₂
  • R = R₁ ∪ R₂ ∪ {S → S₁, S → S₂}

This grammar generates precisely L₁ ∪ L₂, proving that the union of context-free languages is context-free.

Proof: Non-Closure Under Intersection

To prove that context-free languages are not closed under intersection, we can show a counterexample:

Let L₁ = {aⁿ bⁿ cᵐ | n, m ≥ 0} and L₂ = {aᵐ bⁿ cⁿ | n, m ≥ 0}

Both L₁ and L₂ are context-free, but their intersection is:

L₁ ∩ L₂ = {aⁿ bⁿ cⁿ | n ≥ 0}

Which we've proven is not context-free using the pumping lemma.

Example: Closure Properties in Action

Let's demonstrate closure properties with concrete examples using two CFLs.

Given CFLs:

L₁ = {a^n b^n | n ≥ 0}

S₁ → aS₁b | ε

Examples: ε, ab, aabb, aaabbb

L₂ = {b^m c^m | m ≥ 0}

S₂ → bS₂c | ε

Examples: ε, bc, bbcc, bbbccc

Union (L₁ ∪ L₂) - Closed ✓

S → S₁ | S₂
S₁ → aS₁b | ε
S₂ → bS₂c | ε

L₁ ∪ L₂ = {a^n b^n | n ≥ 0} ∪ {b^m c^m | m ≥ 0} = {ε, ab, bc, aabb, bbcc, ...}

Concatenation (L₁L₂) - Closed ✓

S → S₁S₂
S₁ → aS₁b | ε
S₂ → bS₂c | ε

L₁L₂ = {a^n b^n b^m c^m | n,m ≥ 0} = {ε, bc, ab, abc, abbc, aabbc, ...}

Kleene Star (L₁*) - Closed ✓

S → S₁S | ε
S₁ → aS₁b | ε

L₁* = {ε, ab, abab, aabbaabb, aabbab, ...} (concatenations of strings from L₁)

Intersection (L₁ ∩ L₂) - Not Closed ✗

L₁ ∩ L₂ = {strings that are both a^n b^n and b^m c^m}

Only common string: ε (when n=0 and m=0)

So L₁ ∩ L₂ = {ε}, which is finite and thus regular, not demonstrating non-closure.

Better counterexample:

L₃ = {a^n b^n c^m | n,m ≥ 0} (context-free)
L₄ = {a^m b^n c^n | n,m ≥ 0} (context-free)
L₃ ∩ L₄ = {a^n b^n c^n | n ≥ 0} (not context-free!)

Intersection with Regular Language - Closed ✓

L₁ ∩ R where R = {strings ending with 'bb'} (regular)

Result: {a^n b^n | n ≥ 2} = {aabb, aaabbb, aaaabbbb, ...}

S → aaSbb | aabb

Still context-free! (Bar-Hillel theorem in action)

Key insight: CFLs are closed under operations that preserve their "stack-like" structure but not under operations that require comparing multiple independent constraints.

Ambiguity in Context-Free Grammars

Definition: Ambiguous Grammar

A context-free grammar G is ambiguous if there exists at least one string w ∈ L(G) that has multiple different leftmost derivations (or equivalently, multiple parse trees).

Example: Ambiguous Arithmetic Expressions

Consider the grammar:

E → E + E | E * E | (E) | a

The string a + a * a has two different parse trees, depending on whether + or * is applied first:

  • Derivation 1: EE + Ea + Ea + E * Ea + a * Ea + a * a
  • Derivation 2: EE * EE + E * Ea + E * Ea + a * Ea + a * a

A common way to resolve this ambiguity is to use precedence rules by introducing additional non-terminals:

E → E + T | T T → T * F | F F → (E) | a

This unambiguous grammar enforces operator precedence: * has higher precedence than +.

Theorem: Inherent Ambiguity

There exist context-free languages that cannot be generated by any unambiguous context-free grammar. Such languages are called inherently ambiguous.

Proof: Proof of Inherent Ambiguity

We'll prove that L = {aᶦ bʲ cᵏ | i = j or j = k, where i, j, k ≥ 0} is inherently ambiguous.

Suppose, for contradiction, that L is not inherently ambiguous. Then there exists an unambiguous CFG G such that L(G) = L.

We can decompose L as L = L₁ ∪ L₂, where:

  • L₁ = {aⁿ bⁿ cᵐ | n, m ≥ 0}
  • L₂ = {aᵐ bⁿ cⁿ | n, m ≥ 0}

Both L₁ and L₂ are context-free languages. Consider their intersection:

L₁ ∩ L₂ = {aⁿ bⁿ cⁿ | n ≥ 0}

We have already proven that this language is not context-free (using the pumping lemma). However, the Bar-Hillel lemma states that if L₁ and L₂ are unambiguous context-free languages, then L₁ ∩ L₂ is a context-free language (though possibly ambiguous).

This contradiction shows that our assumption must be false. Either L₁ or L₂ must be inherently ambiguous, or both are. Since they are simpler variants of L, and L is their union, it follows that L must be inherently ambiguous.

Example: Inherently Ambiguous Language

The language L = {aᶦ bʲ cᵏ | i = j or j = k, where i, j, k ≥ 0} is inherently ambiguous.

This means there is no unambiguous grammar that generates exactly this language.

Ogden's Lemma

Ogden's Lemma is a strengthened version of the pumping lemma that provides more precise control over which positions in a string can be "pumped." It's particularly powerful for proving languages are not context-free when the standard pumping lemma fails.

Lemma: Ogden's Lemma for Context-Free Languages

Let L be a context-free language. Then there exists a constant p ≥ 1 such that for any string s ∈ L with at least p marked positions, s can be decomposed as s = uvwxy where:

  1. vwx contains at most p marked positions
  2. vx contains at least one marked position
  3. For all i ≥ 0, uviwxiy ∈ L

Proof: Proof of Ogden's Lemma

Let G be a CFG for L in Chomsky Normal Form with k variables. Set p = 2k.

Consider any string s ∈ L with at least p marked positions, and let T be a parse tree for s.

Key insight: We focus only on paths in T that lead to marked leaf positions. Since there are at least p = 2k marked positions, there must be at least one path from root to a marked leaf with length ≥ k + 1.

On this path, by the pigeonhole principle, some variable A appears at least twice. Let the two occurrences be A1 (higher) and A2 (lower).

Define the decomposition s = uvwxy where:

  • vwx is the yield of the subtree rooted at A1
  • w is the yield of the subtree rooted at A2
  • v and x are the parts of vwx not in w

Since A2 is at most k + 1 levels from the marked leaf, the subtree rooted atA1 has height ≤ k + 1, so |vwx| ≤ 2k+1 - 1. Since at most 2k - 1 positions can be unmarked, vwx contains ≤ p marked positions.

The path from A1 to the marked leaf passes through A2, so this marked position is either in v, w, or x. Since we can pump by replacing the A2 subtree with the A1 subtree, and the marked position isn't in the pumpable parts v and xif it's in w, we get that vx contains at least one marked position.

Example: Using Ogden's Lemma

Let's prove that L = {a^i b^j c^k d^l | i = 0 or j = k = l} is not context-free. This language resists the standard pumping lemma but succumbs to Ogden's Lemma.

Why standard pumping lemma fails:

Strings like apbpcpdp can be pumped in the a region without affecting the j = k = l constraint.

Ogden's Lemma approach:

  1. Consider string s = bpcpdp (satisfies j = k = l)
  2. Mark all positions in s - we have 3p marked positions
  3. Any decomposition s = uvwxy with vwx containing ≤ p marked positions means vwx spans at most 2 of the 3 regions {b^p, c^p, d^p}
  4. If vx affects only 2 regions, pumping with i = 2 will make those regions longer while leaving the third unchanged, violating j = k = l

Detailed case analysis:

• If vwx ⊆ bpcp: pumping increases b and/or c counts, but not d
• If vwx ⊆ cpdp: pumping increases c and/or d counts, but not b
• If vwx ⊆ bpdp: impossible since cp separates them

Key insight: Ogden's Lemma forces the pumpable region to be localized, preventing the adversary from avoiding constraint violations by pumping in "safe" regions.

Parikh's Theorem

Parikh's Theorem establishes a fundamental connection between context-free languages and regular languages in terms of their "letter-counting" properties. It shows that CFLs and regular languages are indistinguishable when we only care about how many times each symbol appears.

Definition: Parikh Vector and Semi-linear Sets

For an alphabet Σ = {a₁, a₂, ..., aₖ} and string w ∈ Σ*, the Parikh vector of w is Ψ(w) = (|w|a₁, |w|a₂, ..., |w|aₖ)where |w|aᵢ is the number of occurrences of aᵢ in w.

For a language L, define Ψ(L) = {Ψ(w) | w ∈ L}.

A set S ⊆ ℕᵏ is linear if S = {c + n₁p₁ + n₂p₂ + ... + nᵣpᵣ | n₁, n₂, ..., nᵣ ≥ 0}for some vectors c, p₁, p₂, ..., pᵣ ∈ ℕᵏ.

A set is semi-linear if it's a finite union of linear sets.

Theorem: Parikh's Theorem

For every context-free language L, the set Ψ(L) is semi-linear.

Moreover, there exists a regular language R such that Ψ(L) = Ψ(R).

Proof: Proof of Parikh's Theorem

Let G = (V, Σ, R, S) be a CFG for L. The proof proceeds by constructing a regular language with the same Parikh image.

Step 1: For each production A → α in R, introduce a new terminal symbol tA→α. Create a modified grammar G'where each production A → α becomes A → tA→αα.

Step 2: The language L(G') contains strings of the form twwhere t encodes a leftmost derivation of w ∈ L and consists only of the new terminal symbols.

Step 3: Define a homomorphism h that maps each tA→αto the Parikh vector of the terminals in α, and maps each original terminal to its unit vector.

Step 4: The set {h(tw) | tw ∈ L(G')} equals Ψ(L). Since L(G') is context-free and homomorphisms preserve context-freeness, we need to show that Parikh images of CFLs are semi-linear.

Step 5: Use structural induction on CFGs. For base cases (finite languages), Parikh images are clearly finite and thus semi-linear. For inductive cases:

  • Union: If Ψ(L₁) and Ψ(L₂) are semi-linear, then Ψ(L₁ ∪ L₂) = Ψ(L₁) ∪ Ψ(L₂) is semi-linear
  • Concatenation: If Ψ(L₁) and Ψ(L₂) are semi-linear, then Ψ(L₁L₂) = Ψ(L₁) + Ψ(L₂) is semi-linear
  • Kleene star: If Ψ(L) is semi-linear, then Ψ(L*) is the submonoid generated by Ψ(L), which is semi-linear

Example: Applying Parikh's Theorem

Consider L = {a^n b^n | n ≥ 0}. Even though this CFL has a complex nested structure, its Parikh image has the same pattern as a simple regular language.

Parikh image of L:

Ψ(L) = {(n, n) | n ≥ 0} = linear set with base (0, 0) and period (1, 1)

Equivalent regular language:

R = (ab)* has Ψ(R) = {(n, n) | n ≥ 0}

Same Parikh image despite completely different structure!

Key insight: This shows that the "counting complexity" of CFLs is no greater than that of regular languages, even though their structural complexity is much greater.

Bar-Hillel Theorem

The Bar-Hillel Theorem establishes that context-free languages are closed under intersection with regular languages. This is a fundamental result that shows the boundary of what operations preserve context-freeness.

Theorem: Bar-Hillel Theorem

If L is a context-free language and R is a regular language, then L ∩ R is a context-free language.

Proof: Proof of Bar-Hillel Theorem

We prove this by constructing a context-free grammar for L ∩ Rfrom a CFG for L and a finite automaton for R.

Let G = (V, Σ, P, S) be a CFG for L andM = (Q, Σ, δ, q₀, F) be a DFA for R.

Construction: We build a new CFG G' = (V', Σ, P', S') where:

  • V' = {[A, q, q'] | A ∈ V, q, q' ∈ Q}
  • S' = [S, q₀, q] for each q ∈ F (multiple start productions)

The non-terminal [A, q, q'] represents "strings derivable from Athat take the DFA from state q to state q'."

Productions in P':

  • For each production A → a in P and each q ∈ Q:
    Add [A, q, δ(q, a)] → a to P'
  • For each production A → B₁B₂...Bₖ in P and states q₀, q₁, ..., qₖ ∈ Q:
    Add [A, q₀, qₖ] → [B₁, q₀, q₁][B₂, q₁, q₂]...[Bₖ, qₖ₋₁, qₖ] to P'

Correctness: By induction on derivation length, we can show that[A, q, q'] ⇒* w in G' if and only ifA ⇒* w in G and δ*(q, w) = q' in M.

Therefore, L(G') = L ∩ R, proving the theorem.

Example: Bar-Hillel Theorem Application

Let L = {a^n b^n c^m | n, m ≥ 0} (context-free) andR = {strings ending with 'bc'} (regular). Find L ∩ R.

Result:

L ∩ R = {a^n b^n c^m bc | n ≥ 0, m ≥ 0}

This is context-free: S → AB, A → aAb | ε, B → cB | bc

Key insight: Regular languages act as "well-behaved filters" for CFLs - they can restrict without destroying the context-free structure.

Advanced Ambiguity Theory

Beyond simple ambiguous vs. unambiguous grammars lies a rich hierarchy of ambiguity degrees. Some languages have bounded ambiguity, others have polynomial ambiguity, and some have exponential ambiguity. This classification provides deep insights into parsing complexity.

Definition: Degrees of Ambiguity

For a CFG G and string w ∈ L(G), let ambG(w)denote the number of distinct parse trees for w in G.

A CFG G has:

  • Finite ambiguity if there exists k such that ambG(w) ≤ k for all w ∈ L(G)
  • Polynomial ambiguity if ambG(w) = O(|w|k) for some k
  • Exponential ambiguity if ambG(w) = 2Ω(|w|)

Example: Analyzing Ambiguity Degree

Consider the grammar: S → SS | (S) | ε. Let's analyze its ambiguity degree.

Language generated:

L = {strings of balanced parentheses}

General pattern:

• String of n pairs of parentheses has Cn parse trees
Cn is the n-th Catalan number ≈ 4n/(√π n3/2)
• This is exponential growth: Cn = Θ(4n/n3/2)

Classification:

This grammar is exponentially ambiguous because the number of parse trees grows exponentially with the length of the input string.

Key insight: The same language can have grammars with different ambiguity degrees. The language itself is unambiguous (has an unambiguous grammar), but some grammars for it are exponentially ambiguous.

Parsing Algorithms

Definition: CYK Algorithm

The Cocke-Younger-Kasami (CYK) algorithm is a dynamic programming algorithm that determines whether a string can be derived from a context-free grammar in Chomsky Normal Form.

For a string w of length n and a grammar G in CNF, the algorithm constructs a table T where T[i,j] contains the set of non-terminals that can derive the substring w[j:j+i].

CYK Algorithm Steps

For a string w = a₁a₂...aₙ and a grammar G in CNF:

  1. Initialize an n × n table T
  2. For i from 1 to n (fill the diagonal):
    • T[1, i] = {A | A → aᵢ is a production in G}
  3. For l from 2 to n (length of substring):
    • For i from 1 to n − l + 1 (starting position):
    • j = i + l − 1 (ending position)
    • T[l, i] = {A | A → BC is a production, B ∈ T[k, i], C ∈ T[l − k, i + k] for some 1 ≤ k < l}
  4. The string is in the language if the start symbol S ∈ T[n, 1]

Definition: Earley Parser

The Earley parser is a chart parsing algorithm that can handle any context-free grammar, including ambiguous ones, without requiring conversion to a normal form.

It runs in O(n3) time for general context-free grammars and O(n2) time for unambiguous grammars.

Decidability Results

Decidable Problems for Context-Free Grammars

The following problems are decidable for context-free grammars and languages:

  • Emptiness: Is L(G) = ∅?
  • Membership: Is w ∈ L(G)?
  • Finiteness: Is L(G) finite?

Undecidable Problems for Context-Free Grammars

The following problems are undecidable for context-free grammars and languages:

  • Equivalence: Is L(G₁) = L(G₂)?
  • Containment: Is L(G₁) ⊆ L(G₂)?
  • Universality: Is L(G) = Σ*?
  • Ambiguity: Is G ambiguous?
  • Inherent Ambiguity: Is L(G) inherently ambiguous?

Proof: Sketch of CFG Equivalence Undecidability

We prove that the problem of determining whether two context-free grammars generate the same language (the equivalence problem) is undecidable.

Suppose, for contradiction, that there exists an algorithm A that decides whether L(G₁) = L(G₂) for any two context-free grammars G₁ and G₂.

Consider the universality problem: determining whether L(G) = Σ* for a context-free grammar G. We can reduce this problem to the equivalence problem by constructing a grammar G₀ that generates Σ* (this is trivial with the production S → S | aS | a for each a ∈ Σ, starting with S → ε).

Then, L(G) = Σ* if and only if L(G) = L(G₀). Using our assumed algorithm A, we could decide whether L(G) = L(G₀), and thus whether L(G) = Σ*.

However, the universality problem for context-free grammars is known to be undecidable (this can be proven by reduction from the Post Correspondence Problem). Therefore, our assumption that algorithm A exists must be false, and the equivalence problem for context-free grammars is undecidable.

Proof: Sketch of Ambiguity Undecidability

We prove that the problem of determining whether a context-free grammar is ambiguous is undecidable.

Suppose, for contradiction, that there exists an algorithm B that decides whether a given context-free grammar G is ambiguous.

We can use this to construct an algorithm for the Post Correspondence Problem (PCP) as follows: Given a PCP instance with pairs (u₁, v₁), ..., (uₙ, vₙ), construct a CFG G that generates the language:

L(G) = {aᶦbʲcᵏ | (i = j and j ≠ k) or (i ≠ j and j = k)}

This grammar can be constructed to be ambiguous if and only if the PCP instance has a solution.

Since the PCP is undecidable, our assumption that algorithm B exists must be false, and the ambiguity problem for context-free grammars is undecidable.

Summary

  • Context-Free Grammar (CFG): A 4-tuple G = (V, Σ, R, S) defining rules for generating strings
  • Context-Free Language (CFL): A language generated by some CFG
  • Parse Tree: A tree representation of a derivation, showing the hierarchical structure
  • Ambiguity: When a string has multiple parse trees in a grammar
  • Inherent Ambiguity: When a language cannot be represented by any unambiguous grammar
  • Chomsky Normal Form (CNF): A standardized form where rules are either A → BC or A → a
  • Greibach Normal Form (GNF): A form where rules start with a terminal: A → aα
  • Pumping Lemma: A property of all CFLs used to prove languages are not context-free
  • CYK Algorithm: A parsing algorithm for CFGs in CNF with O(n³m) time complexity
  • Closure Properties:
    CFLs are closed under: union, concatenation, Kleene star, substitution, homomorphism
    CFLs are not closed under: intersection, complement, difference
  • Decidability:
    Decidable: emptiness, membership, finiteness
    Undecidable: equivalence, containment, universality, ambiguity

Suggested Reading

  • Sipser, Introduction to the Theory of Computation – Chapter 2.1: Context-Free Grammars
  • Hopcroft, Motwani, and Ullman, Introduction to Automata Theory, Languages, and Computation – Chapter 5: Context-Free Grammars and Languages
  • Kozen, Automata and Computability – Lecture 19: Context-Free Grammars and Languages
  • Kozen, Automata and Computability – Lecture 20: Balanced Parentheses
  • Kozen, Automata and Computability – Lecture 21: Normal Forms
  • Kozen, Automata and Computability – Lecture 22: The Pumping Lemma for CFLs