Let G
be a context-free grammar for L
in Chomsky Normal Form. Let k
be the number of variables in G
.
Let p = 2k
and let s
be any string in L
such that |s| ≥ p
.
Consider a parse tree for s
. Since s
has at least 2k
symbols and every internal node in the parse tree has at most 2 children (since G
is in CNF), the height of the parse tree must be at least k+1
. Therefore, there must be some path from the root to a leaf with at least k+1
internal nodes.
By the pigeonhole principle, this path must contain at least one repeated variable. Let A
be a variable that appears at least twice in this path, and let's denote the two occurrences as A1
and A2
.
The subtree rooted at A1
contains the subtree rooted at A2
. Let v
be the string derived from A2
, and let vwx
be the string derived from A1
. Thus, s
can be written as uvwxy
where:
u
is the part of s
derived from the part of the parse tree before reaching A1
v
is the part of s
derived from the part of the parse tree between A1
and A2
w
is the part of s
derived from A2
x
is the part of s
derived from the part of the parse tree after A2
but still within A1
y
is the part of s
derived from the part of the parse tree after A1
Since A1
and A2
are the same variable, we can replace the subtree rooted at A2
with the entire subtree rooted at A1
, which would insert an extra copy of vwx
. Similarly, we could remove the entire subtree rooted at A2
and replace it with just w
.
This gives us that uviwxiy ∈ L
for any i ≥ 0
. Also, since the height of the parse tree is bounded by 2|s|
, we know that |vwx| ≤ p
. Finally, v
and x
can't both be empty since that would mean A1
and A2
derive the same string, which contradicts the assumption that they are distinct occurrences.