Regular expressions provide a powerful, algebraic way to specify patterns in strings. They form a precise mathematical language that can describe exactly the same set of languages as finite automata, and are widely used in practical applications from text processing to compiler design.
Definition: Regular Expression
Let Σ be an alphabet. The set of regular expressions over Σ is defined recursively as follows:
∅ is a regular expression (the empty set)
ε is a regular expression (the empty string)
For each a ∈ Σ, a is a regular expression
If r and s are regular expressions, then (r|s) is a regular expression (union)
If r and s are regular expressions, then (rs) is a regular expression (concatenation)
If r is a regular expression, then (r*) is a regular expression (Kleene star)
Nothing else is a regular expression
Each regular expression r denotes a language L(r) defined recursively:
L(∅) = ∅
L(ε) = {ε}
L(a) = {a} for a ∈ Σ
L(r|s) = L(r) ∪ L(s)
L(rs) = L(r)·L(s) = {xy | x ∈ L(r) and y ∈ L(s)}
L(r*) = L(r)* = ∪i≥0L(r)i
Regular Expression Operations
Regular expressions are built using three fundamental operations:
1. Union |
The union of two regular expressions r and s, written as (r|s), matches any string that matches either r or s.
Example:a|b matches either the string a or the string b.
2. Concatenation
The concatenation of two regular expressions r and s, written as rs, matches any string that can be split into two parts, where the first part matches r and the second part matches s.
Example:ab matches the string ab.
3. Kleene Star *
The Kleene star of a regular expression r, written as r*, matches zero or more repetitions of strings that match r.
Example:a* matches ε, a, aa, aaa, and so on.
Precedence and Parentheses
To avoid ambiguity, regular expressions follow a precedence order:
Kleene star * has the highest precedence
Concatenation is next
Union | has the lowest precedence
Parentheses can be used to override this precedence. For example:
ab|c means (ab)|c
a|bc means a|(bc)
a*b means (a*)b
(a|b)* means the Kleene star applies to the expression (a|b)
Theorem: Kleene's Theorem
A language L is regular if and only if there exists a regular expression r such that L = L(r).
In other words, the class of languages that can be described by regular expressions is exactly the class of regular languages.
Proof: Proof of Kleene's Theorem
We need to prove that a language L is regular if and only if there exists a regular expression r such that L = L(r).
Part 1: If L = L(r) for some regular expression r, then L is regular
We prove this direction by induction on the structure of the regular expression r:
Base cases:
Case r = ∅:L(∅) = ∅, which is recognized by an NFA with a non-accepting start state and no transitions.
Case r = ε:L(ε) = ε, which is recognized by an NFA with an accepting start state and no transitions.
Case r = a for a ∈ Σ:L(a) = {a}, which is recognized by an NFA with a start state, an accepting state, and a transition on a from the start state to the accepting state.
Inductive step:
Assume that for regular expressions s and t, there exist NFAs Ms and Mt such that L(Ms) = L(s) and L(Mt) = L(t).
Case r = (s|t): We construct an NFA Mr with a new start state q0 and ε-transitions from q0 to the start states of Ms and Mt. Thus L(Mr) = L(s) ∪ L(t) = L(s|t).
Case r = st: We construct an NFA Mr by connecting every accepting state of Ms to the start state of Mt with ε-transitions, making the accepting states of Ms non-accepting, and designating the start state of Ms as the start state of Mr. Thus L(Mr) = L(s)·L(t) = L(st).
Case r = s*: We construct an NFA Mr by adding a new start state q0 that is also an accepting state, adding an ε-transition from q0 to the start state of Ms, and adding ε-transitions from each accepting state of Ms back to the start state of Ms. Thus L(Mr) = L(s)* = L(s*).
By the principle of induction, for any regular expression r, there exists an NFA M such that L(M) = L(r). Since every NFA can be converted to a DFA by the subset construction, and DFAs recognize exactly the regular languages, L(r) is regular.
Part 2: If L is regular, then L = L(r) for some regular expression r
If L is regular, then there exists a DFA M = (Q, Σ, δ, q0, F) such that L(M) = L.
Let Q = {q0, q1, ..., qn} be the set of states of M.
We define Rijk as the set of strings that take the DFA from state qi to state qj without going through any state qm where m > k.
We construct regular expressions rijk for Rijk recursively:
Base case (k = -1):
rij-1 = a1|a2|...|am if there are transitions from qi to qj on symbols a1, a2, ..., am
rij-1 = ∅ if there is no direct transition from qi to qj
rii-1 = ε|rii-1 to account for staying in state qi with the empty string
Inductive step:
rijk = rijk-1 | (rikk-1(rkkk-1)*rkjk-1)
The final regular expression r for the language L(M) is the union of r0jn for all accepting states qj ∈ F:
r = r0j1n | r0j2n | ... | r0jmn, where {qj1, qj2, ..., qjm} = F
By the construction of r, L(r) = L(M) = L, which completes the proof.
Converting Between Regular Expressions and Finite Automata
From Regular Expression to NFA: Thompson's Construction
Thompson's construction algorithm provides a systematic way to convert any regular expression into an equivalent NFA. The algorithm works recursively on the structure of the regular expression:
Base cases: Create simple NFAs for ∅, ε, and individual symbols
Inductive cases: Combine smaller NFAs according to the regex operations (union, concatenation, Kleene star)
The resulting NFA may have ε-transitions but will always have exactly one accept state and no transitions leading out of this state.
Thompson's Construction Rules:
For ε: Create an NFA with two states, a start state and an accept state, connected by an ε-transition.
For a symbol a: Create an NFA with two states, a start state and an accept state, connected by an a-transition.
For union (r|s): Create a new start state with ε-transitions to the start states of the NFAs for r and s. Create a new accept state with ε-transitions from the accept states of the NFAs for r and s.
For concatenation (rs): Connect the accept state of the NFA for r to the start state of the NFA for s with an ε-transition.
For Kleene star (r*): Create a new start state with an ε-transition to the start state of the NFA for r. Create a new accept state with an ε-transition from the accept state of the NFA for r. Add an ε-transition from the new start state to the new accept state, and an ε-transition from the accept state of the NFA for r back to its start state.
From NFA to Regular Expression: State Elimination Method
Converting an NFA to a regular expression is typically done using the state elimination method, which systematically removes states from the NFA while preserving the recognized language using regular expressions as transition labels.
Convert the NFA into a generalized NFA with regular expressions as transition labels
Add a new start state with an ε-transition to the original start state
Add a new accept state with ε-transitions from all original accept states
Eliminate states one by one (except the new start and accept states)
For each eliminated state, update transitions between its neighbors
The final regular expression is the label on the transition from the new start state to the new accept state
State Elimination Rules:
When eliminating a state qk, for every pair of states qi and qj with transitions:
From qi to qk with label R1
From qk to itself with label R2
From qk to qj with label R3
Add a new transition from qi to qj with label R1R2*R3, or update the existing transition if one already exists.
Algebraic Properties of Regular Expressions
Regular expressions follow a rich set of algebraic laws that allow for simplification, manipulation, and reasoning about pattern equivalence. Two regular expressions are equivalent if they denote the same language.
Fundamental Regular Expression Identities:
Property
Identity
Category
Empty set identity (union)
r|∅ = r
Union
Idempotence (union)
r|r = r
Union
Commutativity (union)
r|s = s|r
Union
Associativity (union)
(r|s)|t = r|(s|t)
Union
Empty string identity
rε = εr = r
Concatenation
Empty set annihilation
r∅ = ∅r = ∅
Concatenation
Associativity (concatenation)
(rs)t = r(st)
Concatenation
Left distributivity
r(s|t) = rs|rt
Mixed
Right distributivity
(s|t)r = sr|tr
Mixed
Star of empty set
∅* = ε
Kleene Star
Star of empty string
ε* = ε
Kleene Star
Star idempotence
(r*)* = r*
Kleene Star
Star expansion
r* = ε|rr*
Kleene Star
Alternative star expansion
r* = ε|r*r
Kleene Star
Proof: Proofs of Selected Identities
Proof of r|∅ = r
To prove that r|∅ = r, we need to show that L(r|∅) = L(r).
By definition of the union operation:
L(r|∅) = L(r) ∪ L(∅) = L(r) ∪ ∅ = L(r)
Since the union of any set with the empty set is the set itself.
Proof of r∅ = ∅
To prove that r∅ = ∅, we need to show that L(r∅) = L(∅).
By definition of the concatenation operation:
L(r∅) = L(r)·L(∅) = L(r)·∅ = ∅
Since the concatenation of any set with the empty set is the empty set.
Proof of r|r = r
To prove that r|r = r, we need to show that L(r|r) = L(r).
By definition of the union operation:
L(r|r) = L(r) ∪ L(r) = L(r)
Since the union of a set with itself is the set itself.
Proof of rε = r
To prove that rε = r, we need to show that L(rε) = L(r).
By definition of the concatenation operation:
L(rε) = L(r)·L(ε) = L(r)·{ε} = {xε | x ∈ L(r)} = {x | x ∈ L(r)} = L(r)
Since concatenating the empty string to any string leaves it unchanged.
Proof of (r*)* = r*
To prove that (r*)* = r*, we need to show that L((r*)*) = L(r*).
First, L(r*) = {ε} ∪ L(r) ∪ L(r)² ∪ L(r)³ ∪ ... = the set of all strings formed by concatenating zero or more strings from L(r).
Now, L((r*)*) = {ε} ∪ L(r*) ∪ L(r*)² ∪ L(r*)³ ∪ ... = the set of all strings formed by concatenating zero or more strings from L(r*).
Since L(r*) already includes all possible ways to concatenate strings from L(r) (including zero times), concatenating multiple strings from L(r*) doesn't add any new strings that aren't already in L(r*).
Therefore, L((r*)*) = L(r*), which proves (r*)* = r*.
Properties of Regular Expressions
Regular expressions have several important algebraic properties:
Algebraic Laws for Regular Expressions:
For all regular expressions r, s, and t:
Union:
r|s = s|r (commutativity)
(r|s)|t = r|(s|t) (associativity)
r|r = r (idempotence)
r|∅ = r (identity)
Concatenation:
(rs)t = r(st) (associativity)
r(s|t) = rs|rt (distributivity)
(s|t)r = sr|tr (distributivity)
∅r = r∅ = ∅ (annihilation)
εr = rε = r (identity)
Kleene Star:
r** = r* (idempotence)
∅* = ε (special case)
ε* = ε (special case)
Extended Regular Expression Formalisms
While standard regular expressions use union, concatenation, and Kleene star, various extensions have been proposed that add expressive power or computational convenience. Understanding these extensions and their theoretical properties is crucial for both practical applications and theoretical analysis.
In addition to the standard operations, generalized regular expressions may include:
Intersection &:L(r & s) = L(r) ∩ L(s)
Complement ¬:L(¬r) = Σ* \ L(r)
Squaring:L(r²) = L(r)·L(r)
These operations preserve regularity but may lead to exponential size increases when converted to standard regular expressions.
Theorem: Succinctness of Extended Operations
There exist families of regular languages such that:
Languages expressible with intersection and complement using O(n) symbols require 2Ω(n) symbols in standard regular expressions
The squaring operation can provide exponential succinctness over standard operations
Converting extended regular expressions to standard form may require non-elementary increases in size
Proof: Example: Intersection Succinctness
Consider the language Ln of strings over {a,b} where:
The n-th symbol from the beginning is a
The n-th symbol from the end is b
With intersection:Ln = L1 & L2 where:
L1 = (a|b)n-1a(a|b)*
L2 = (a|b)*b(a|b)n-1
Total size: O(n)
Without intersection: Must enumerate all strings of length ≥ 2n-1with the required properties, leading to size 2Θ(n).
Counting Constraints
Definition: Bounded Repetition
Bounded repetition extends the Kleene star with numerical constraints:
r{n}: exactly n repetitions of r
r{n,m}: between n and m repetitions of r
r{n,}: at least n repetitions of r
These can be expressed using standard operations: r{n,m} = rn | rn+1 | ... | rm
However, the expansion may be exponential in the encoding of the bounds.
Definition: Numerical Predicates
More general counting constraints include:
Modular constraints:|w|a ≡ k (mod m)
Comparison constraints:|w|a ≥ |w|b
Arithmetic constraints:|w|a + 2|w|b = n
Where |w|a denotes the number of occurrences of symbol a in string w.
Many such constraints go beyond regular languages and require more powerful models.
Theorem: Expressiveness of Counting Constraints
The following hierarchy holds for expressiveness:
Regular expressions can express modular counting constraints
Regular expressions with intersection can express more complex modular relationships
Context-free languages are needed for balanced counting (e.g., equal a and b)
Context-sensitive languages are needed for general arithmetic constraints
The satisfiability problem becomes increasingly difficult as we move up this hierarchy.
Example: Extended Operations in Action
Intersection Example
Consider strings over {a,b} that are both:
Even length: ((a|b)(a|b))*
Start with a: a(a|b)*
With intersection:((a|b)(a|b))* & a(a|b)* Without intersection:a(a|b)a(a|b)a(a|b)a(a|b)...|aa(a|b)aa(a|b)aa(a|b)...|ab(a|b)ab(a|b)ab(a|b)...|... The intersection version is 2 simple patterns; the expanded version requires enumerating exponentially many cases!
Much longer when written with standard operations!
Parsing and Ambiguity Theory
Regular expressions, as syntactic objects, must themselves be parsed and interpreted. Understanding the formal grammar of regular expressions and the ambiguities that arise in matching provides crucial insight into both theoretical properties and practical implementation.
Regular Expression Grammars
Definition: Context-Free Grammar for Regular Expressions
A formal grammar for regular expressions over alphabet Σ:
E → E | T (union)
T → T F (concatenation)
F → F * (Kleene star)
F → ( E ) (grouping)
F → a for each a ∈ Σ
F → ε | ∅ (base cases)
This grammar captures the standard precedence: Kleene star binds tightest, followed by concatenation, then union.
Definition: Parse Tree Structure
The parse tree of a regular expression r directly corresponds to its recursive structure:
Leaf nodes: symbols from Σ ∪ {ε, ∅}
Internal nodes: operators |, concatenation, *
Tree evaluation: bottom-up computation of L(r)
The parse tree provides a canonical representation that makes structural induction proofs straightforward and enables systematic algorithm design.
Theorem: Unique Parse Trees Under Precedence
Given the standard precedence rules (Kleene star > concatenation > union), every syntactically valid regular expression has a unique parse tree.
This eliminates structural ambiguity in regular expression parsing, though semantic ambiguity in string matching may still occur.
Ambiguous Regular Expressions
Definition: Matching Ambiguity
A regular expression r is ambiguous for string wif there exist multiple distinct ways to parse w according to r.
Formally, if w ∈ L(r) and there are multiple parse trees (derivation trees) showing how w matches r, then r is ambiguous for w.
Example: (a|a)* is ambiguous for a - it can match as the first a or the second a.
Definition: Disambiguation Strategies
Common disambiguation strategies for ambiguous regular expressions:
Greedy matching:r* matches as many repetitions as possible
Reluctant (lazy) matching:r*? matches as few repetitions as possible
Possessive matching:r*+ matches greedily without backtracking
Left-most longest: Among all matches, choose the one starting earliest, then the longest among those
These strategies provide deterministic semantics for regex engines but may not correspond to the mathematical definition of regular languages.
Theorem: Ambiguity and Computational Complexity
The matching problem for ambiguous regular expressions has different complexity characteristics:
Membership testing:O(n) time (same as unambiguous case)
Counting matches: Can be exponential in the input size
Enumerating all matches: May require exponential time and space
Practical regex engines often use backtracking, leading to worst-case exponential behavior on pathological inputs.
Example: Parsing and Disambiguating a|b*c
Step 1: Parse Tree Construction
Expression: a|b*c with precedence (* > concatenation > |)
|
/ \
a concat
/ \
* c
/
b
Interpretation: (a|(b*c)), not ((a|b)*c)
Step 2: Ambiguous Matching Example
Consider pattern (a*)(a*) matching string aaa
Match 1: first a* takes ε, second takes aaa
Match 2: first a* takes a, second takes aa
Match 3: first a* takes aa, second takes a
Match 4: first a* takes aaa, second takes ε
Step 3: Disambiguation Strategies
Pattern (.*)(world) matching hello world:
Greedy:.* takes hello wor, leaving ld for world → fails
Reluctant:.*? takes hello , leaving world → succeeds
Left-most longest: Find leftmost match, then longest among those
Different strategies can lead to completely different matching behavior!
Star Height Problem
The star height of a regular expression measures the maximum nesting depth of Kleene stars. This seemingly simple measure leads to one of the most complex decidability results in regular language theory, with deep connections to algebra and automata theory.
Definition: Star Height
The star heighth(r) of a regular expression r is defined recursively:
h(∅) = h(ε) = h(a) = 0 for a ∈ Σ
h(r|s) = h(rs) = max(h(r), h(s))
h(r*) = h(r) + 1
The star height of a regular language L is:
h(L) = min{h(r) : L(r) = L}
The star height problem asks: given a regular expression r, what is h(L(r))?
Theorem: Decidability of Star Height
Theorem (Hashiguchi, 1988): The star height problem is decidable. That is, there exists an algorithm that, given a regular expression r, computes h(L(r)).
The algorithm has non-elementary complexity - the running time is not bounded by any tower of exponentials of fixed height.
Definition: Examples of Star Height
Classical examples demonstrating different star heights:
Star height 0: All finite languages, (a|b)*
Star height 1:L1 = (a*b*)*
Star height 2: The canonical example is the language of strings over {a,b}such that between any two consecutive a's, the number of b's is even
Proving that a language requires a specific star height involves showing both an upper bound (by construction) and a lower bound (by algebraic techniques).
Proof: Sketch: Lower Bound Techniques
Lower bounds for star height typically use algebraic techniques:
Syntactic monoids: Associate with each regular language La finite monoid M(L) that captures its structural properties
Aperiodicity: A language has star height 0 iff its syntactic monoid is aperiodic
Nested aperiodicity: Higher star heights correspond to nested levels of non-aperiodicity in the monoid structure
Variety theory: Use Eilenberg's correspondence between varieties of languages and varieties of monoids to characterize star height levels
These techniques reduce the star height problem to questions about finite algebra, though the resulting decision procedures remain extremely complex.
Expression Size and Minimization
Understanding the complexity of regular expressions requires precise measures of their size and analysis of the computational difficulty of finding equivalent expressions of minimal size. These questions connect regular expression theory to fundamental problems in computational complexity.
Definition: Size Measures for Regular Expressions
Several measures quantify the complexity of regular expressions:
Alphabetic width:alph(r) = number of symbol occurrences in r
Reverse Polish length:rp(r) = length of postfix representation
Syntax tree size:tree(r) = number of nodes in parse tree
Description length:desc(r) = total string length including operators
These measures are polynomially related: for any regular expression r,alph(r) ≤ rp(r) ≤ tree(r) ≤ desc(r) ≤ c · alph(r) for some constant c.
Definition: Regular Expression Minimization Problem
Input: Regular expression r and size measure μ
Output: Regular expression s such that L(r) = L(s)and μ(s) is minimized
Decision version: Given r and integer k, does there exist s with L(r) = L(s) and μ(s) ≤ k?
This problem asks for the shortest regular expression equivalent to a given one.
Theorem: Intractability of Regular Expression Minimization
Theorem: Regular expression minimization is PSPACE-complete for all standard size measures.
This holds for alphabetic width, reverse Polish length, syntax tree size, and description length. The problem remains PSPACE-complete even when restricted to specific operator sets.
Corollary: Unless P = PSPACE, there is no polynomial-time algorithm for finding minimal regular expressions.
Proof: PSPACE-hardness of Minimization
Proof sketch: Reduction from QBF (Quantified Boolean Formula) satisfiability.
Given QBF φ = ∃x1∀x2...Q xn ψ(x1,...,xn)
Construct regular expressions rtrue and rfalse such that:
If φ is true, then L(rtrue) = L(rfalse)
If φ is false, then L(rtrue) ≠ L(rfalse)
The construction ensures that when φ is true,rtrue | rfalse can be simplified to a much shorter expression
When φ is false, no such simplification is possible
Testing if the minimal size is below a threshold thus solves QBF
The construction uses the expressive power of regular expressions to encode the logical structure of the QBF formula.
Definition: Heuristic Optimization Techniques
Since exact minimization is intractable, practical approaches use heuristics:
Local transformations: Apply algebraic identities like r|r = r,r∅ = ∅, (r*)* = r*
Common subexpression elimination: Factor out repeated subpatterns
Automaton-based optimization: Convert to minimal DFA, then back to regex
Genetic algorithms: Evolve populations of equivalent expressions
These methods provide good approximations but cannot guarantee optimality due to the underlying computational complexity.
Consider expressions describing strings with an even number ofa:
• ((b*ab*a)*b*)* (length 15)
• (b|(ab*ab*))* (length 13)
• b*((ab*a)*b*)* (length 16)
Which is shortest? Finding the absolute minimum requires checking potentially exponentially many equivalent expressions - hence the PSPACE-completeness result.
Automaton-Based Optimization Example
Original:(a|ε)(b|ε)
Convert to NFA: 4 states, ε-transitions
Convert to minimal DFA: 4 states, no ε-transitions
Convert back to regex:ε|a|b|ab
Sometimes this gives a shorter expression, sometimes longer!
Limitations of Regular Expressions
While regular expressions are powerful, they cannot describe all possible languages. Some common examples of non-regular languages include:
The language of balanced parentheses: {(ⁿ)ⁿ | n ≥ 0}
The language of strings with equal numbers of a's and b's: {w | #₍ₐ₎(w) = #₍ᵦ₎(w)}
The language of prime-length strings
These limitations are inherent to regular languages and can be proven using the pumping lemma for regular languages.
Applications of Regular Expressions
Regular expressions are widely used in various computing contexts:
Text Processing: Pattern matching, search, and replace in text editors
Lexical Analysis: Tokenizing source code in compilers
Form Validation: Checking if user input matches expected patterns
Data Extraction: Parsing structured data from text
URL Routing: Matching URL patterns in web frameworks
File Globbing: Matching file names in command-line interfaces
While practical implementations of regular expressions in programming languages often include extensions beyond the formal definition (like backreferences), the core concepts remain based on the theoretical foundations.
Summary
Regular expressions are a formal language for describing patterns in strings