A framework for understanding

Regular Expressions

Regular expressions provide a powerful, algebraic way to specify patterns in strings. They form a precise mathematical language that can describe exactly the same set of languages as finite automata, and are widely used in practical applications from text processing to compiler design.

Definition: Regular Expression

Let Σ be an alphabet. The set of regular expressions over Σ is defined recursively as follows:

  • is a regular expression (the empty set)
  • ε is a regular expression (the empty string)
  • For each a ∈ Σ, a is a regular expression
  • If r and s are regular expressions, then (r|s) is a regular expression (union)
  • If r and s are regular expressions, then (rs) is a regular expression (concatenation)
  • If r is a regular expression, then (r*) is a regular expression (Kleene star)
  • Nothing else is a regular expression

Each regular expression r denotes a language L(r) defined recursively:

  • L(∅) = ∅
  • L(ε) = {ε}
  • L(a) = {a} for a ∈ Σ
  • L(r|s) = L(r) ∪ L(s)
  • L(rs) = L(r)·L(s) = {xy | x ∈ L(r) and y ∈ L(s)}
  • L(r*) = L(r)* = ∪i≥0L(r)i

Regular Expression Operations

Regular expressions are built using three fundamental operations:

1. Union |

The union of two regular expressions r and s, written as (r|s), matches any string that matches either r or s.

Example: a|b matches either the string a or the string b.

2. Concatenation

The concatenation of two regular expressions r and s, written as rs, matches any string that can be split into two parts, where the first part matches r and the second part matches s.

Example: ab matches the string ab.

3. Kleene Star *

The Kleene star of a regular expression r, written as r*, matches zero or more repetitions of strings that match r.

Example: a* matches ε, a, aa, aaa, and so on.

Precedence and Parentheses

To avoid ambiguity, regular expressions follow a precedence order:

  1. Kleene star * has the highest precedence
  2. Concatenation is next
  3. Union | has the lowest precedence

Parentheses can be used to override this precedence. For example:

  • ab|c means (ab)|c
  • a|bc means a|(bc)
  • a*b means (a*)b
  • (a|b)* means the Kleene star applies to the expression (a|b)

Theorem: Kleene's Theorem

A language L is regular if and only if there exists a regular expression r such that L = L(r).

In other words, the class of languages that can be described by regular expressions is exactly the class of regular languages.

Proof: Proof of Kleene's Theorem

We need to prove that a language L is regular if and only if there exists a regular expression r such that L = L(r).

Part 1: If L = L(r) for some regular expression r, then L is regular

We prove this direction by induction on the structure of the regular expression r:

Base cases:

  • Case r = ∅: L(∅) = , which is recognized by an NFA with a non-accepting start state and no transitions.
  • Case r = ε: L(ε) = ε, which is recognized by an NFA with an accepting start state and no transitions.
  • Case r = a for a ∈ Σ: L(a) = {a}, which is recognized by an NFA with a start state, an accepting state, and a transition on a from the start state to the accepting state.

Inductive step:

Assume that for regular expressions s and t, there exist NFAs Ms and Mt such that L(Ms) = L(s) and L(Mt) = L(t).

  • Case r = (s|t): We construct an NFA Mr with a new start state q0 and ε-transitions from q0 to the start states of Ms and Mt. Thus L(Mr) = L(s) ∪ L(t) = L(s|t).
  • Case r = st: We construct an NFA Mr by connecting every accepting state of Ms to the start state of Mt with ε-transitions, making the accepting states of Ms non-accepting, and designating the start state of Ms as the start state of Mr. Thus L(Mr) = L(s)·L(t) = L(st).
  • Case r = s*: We construct an NFA Mr by adding a new start state q0 that is also an accepting state, adding an ε-transition from q0 to the start state of Ms, and adding ε-transitions from each accepting state of Ms back to the start state of Ms. Thus L(Mr) = L(s)* = L(s*).

By the principle of induction, for any regular expression r, there exists an NFA M such that L(M) = L(r). Since every NFA can be converted to a DFA by the subset construction, and DFAs recognize exactly the regular languages, L(r) is regular.

Part 2: If L is regular, then L = L(r) for some regular expression r

If L is regular, then there exists a DFA M = (Q, Σ, δ, q0, F) such that L(M) = L.

Let Q = {q0, q1, ..., qn} be the set of states of M.

We define Rijk as the set of strings that take the DFA from state qi to state qj without going through any state qm where m > k.

We construct regular expressions rijk for Rijk recursively:

  • Base case (k = -1):
    • rij-1 = a1|a2|...|am if there are transitions from qi to qj on symbols a1, a2, ..., am
    • rij-1 = ∅ if there is no direct transition from qi to qj
    • rii-1 = ε|rii-1 to account for staying in state qi with the empty string
  • Inductive step:
    • rijk = rijk-1 | (rikk-1(rkkk-1)*rkjk-1)

The final regular expression r for the language L(M) is the union of r0jn for all accepting states qj ∈ F:

r = r0j1n | r0j2n | ... | r0jmn, where {qj1, qj2, ..., qjm} = F

By the construction of r, L(r) = L(M) = L, which completes the proof.

Converting Between Regular Expressions and Finite Automata

From Regular Expression to NFA: Thompson's Construction

Thompson's construction algorithm provides a systematic way to convert any regular expression into an equivalent NFA. The algorithm works recursively on the structure of the regular expression:

  1. Base cases: Create simple NFAs for , ε, and individual symbols
  2. Inductive cases: Combine smaller NFAs according to the regex operations (union, concatenation, Kleene star)

The resulting NFA may have ε-transitions but will always have exactly one accept state and no transitions leading out of this state.

Thompson's Construction Rules:

  1. For ε: Create an NFA with two states, a start state and an accept state, connected by an ε-transition.
  2. For a symbol a: Create an NFA with two states, a start state and an accept state, connected by an a-transition.
  3. For union (r|s): Create a new start state with ε-transitions to the start states of the NFAs for r and s. Create a new accept state with ε-transitions from the accept states of the NFAs for r and s.
  4. For concatenation (rs): Connect the accept state of the NFA for r to the start state of the NFA for s with an ε-transition.
  5. For Kleene star (r*): Create a new start state with an ε-transition to the start state of the NFA for r. Create a new accept state with an ε-transition from the accept state of the NFA for r. Add an ε-transition from the new start state to the new accept state, and an ε-transition from the accept state of the NFA for r back to its start state.

From NFA to Regular Expression: State Elimination Method

Converting an NFA to a regular expression is typically done using the state elimination method, which systematically removes states from the NFA while preserving the recognized language using regular expressions as transition labels.

  1. Convert the NFA into a generalized NFA with regular expressions as transition labels
  2. Add a new start state with an ε-transition to the original start state
  3. Add a new accept state with ε-transitions from all original accept states
  4. Eliminate states one by one (except the new start and accept states)
  5. For each eliminated state, update transitions between its neighbors
  6. The final regular expression is the label on the transition from the new start state to the new accept state

State Elimination Rules:

When eliminating a state qk, for every pair of states qi and qj with transitions:

  • From qi to qk with label R1
  • From qk to itself with label R2
  • From qk to qj with label R3

Add a new transition from qi to qj with label R1R2*R3, or update the existing transition if one already exists.

Algebraic Properties of Regular Expressions

Regular expressions follow a rich set of algebraic laws that allow for simplification, manipulation, and reasoning about pattern equivalence. Two regular expressions are equivalent if they denote the same language.

Fundamental Regular Expression Identities:

PropertyIdentityCategory
Empty set identity (union)r|∅ = rUnion
Idempotence (union)r|r = rUnion
Commutativity (union)r|s = s|rUnion
Associativity (union)(r|s)|t = r|(s|t)Union
Empty string identityrε = εr = rConcatenation
Empty set annihilationr∅ = ∅r = ∅Concatenation
Associativity (concatenation)(rs)t = r(st)Concatenation
Left distributivityr(s|t) = rs|rtMixed
Right distributivity(s|t)r = sr|trMixed
Star of empty set∅* = εKleene Star
Star of empty stringε* = εKleene Star
Star idempotence(r*)* = r*Kleene Star
Star expansionr* = ε|rr*Kleene Star
Alternative star expansionr* = ε|r*rKleene Star

Proof: Proofs of Selected Identities

Proof of r|∅ = r

To prove that r|∅ = r, we need to show that L(r|∅) = L(r).

By definition of the union operation:

L(r|∅) = L(r) ∪ L(∅) = L(r) ∪ ∅ = L(r)

Since the union of any set with the empty set is the set itself.

Proof of r∅ = ∅

To prove that r∅ = ∅, we need to show that L(r∅) = L(∅).

By definition of the concatenation operation:

L(r∅) = L(r)·L(∅) = L(r)·∅ = ∅

Since the concatenation of any set with the empty set is the empty set.

Proof of r|r = r

To prove that r|r = r, we need to show that L(r|r) = L(r).

By definition of the union operation:

L(r|r) = L(r) ∪ L(r) = L(r)

Since the union of a set with itself is the set itself.

Proof of rε = r

To prove that rε = r, we need to show that L(rε) = L(r).

By definition of the concatenation operation:

L(rε) = L(r)·L(ε) = L(r)·{ε} = {xε | x ∈ L(r)} = {x | x ∈ L(r)} = L(r)

Since concatenating the empty string to any string leaves it unchanged.

Proof of (r*)* = r*

To prove that (r*)* = r*, we need to show that L((r*)*) = L(r*).

First, L(r*) = {ε} ∪ L(r) ∪ L(r)² ∪ L(r)³ ∪ ... = the set of all strings formed by concatenating zero or more strings from L(r).

Now, L((r*)*) = {ε} ∪ L(r*) ∪ L(r*)² ∪ L(r*)³ ∪ ... = the set of all strings formed by concatenating zero or more strings from L(r*).

Since L(r*) already includes all possible ways to concatenate strings from L(r) (including zero times), concatenating multiple strings from L(r*) doesn't add any new strings that aren't already in L(r*).

Therefore, L((r*)*) = L(r*), which proves (r*)* = r*.

Properties of Regular Expressions

Regular expressions have several important algebraic properties:

Algebraic Laws for Regular Expressions:

For all regular expressions r, s, and t:

  • Union:
    • r|s = s|r (commutativity)
    • (r|s)|t = r|(s|t) (associativity)
    • r|r = r (idempotence)
    • r|∅ = r (identity)

  • Concatenation:
    • (rs)t = r(st) (associativity)
    • r(s|t) = rs|rt (distributivity)
    • (s|t)r = sr|tr (distributivity)
    • ∅r = r∅ = ∅ (annihilation)
    • εr = rε = r (identity)

  • Kleene Star:
    • r** = r* (idempotence)
    • ∅* = ε (special case)
    • ε* = ε (special case)

Extended Regular Expression Formalisms

While standard regular expressions use union, concatenation, and Kleene star, various extensions have been proposed that add expressive power or computational convenience. Understanding these extensions and their theoretical properties is crucial for both practical applications and theoretical analysis.

Generalized Regular Expressions

Definition: Extended Regular Expression Operations

In addition to the standard operations, generalized regular expressions may include:

  • Intersection &: L(r & s) = L(r) ∩ L(s)
  • Complement ¬: L(¬r) = Σ* \ L(r)
  • Squaring: L(r²) = L(r)·L(r)

These operations preserve regularity but may lead to exponential size increases when converted to standard regular expressions.

Theorem: Succinctness of Extended Operations

There exist families of regular languages such that:

  1. Languages expressible with intersection and complement using O(n) symbols require 2Ω(n) symbols in standard regular expressions
  2. The squaring operation can provide exponential succinctness over standard operations
  3. Converting extended regular expressions to standard form may require non-elementary increases in size

Proof: Example: Intersection Succinctness

Consider the language Ln of strings over {a,b} where:

  • The n-th symbol from the beginning is a
  • The n-th symbol from the end is b

With intersection: Ln = L1 & L2 where:

  • L1 = (a|b)n-1a(a|b)*
  • L2 = (a|b)*b(a|b)n-1

Total size: O(n)

Without intersection: Must enumerate all strings of length ≥ 2n-1with the required properties, leading to size 2Θ(n).

Counting Constraints

Definition: Bounded Repetition

Bounded repetition extends the Kleene star with numerical constraints:

  • r{n}: exactly n repetitions of r
  • r{n,m}: between n and m repetitions of r
  • r{n,}: at least n repetitions of r

These can be expressed using standard operations: r{n,m} = rn | rn+1 | ... | rm

However, the expansion may be exponential in the encoding of the bounds.

Definition: Numerical Predicates

More general counting constraints include:

  • Modular constraints: |w|a ≡ k (mod m)
  • Comparison constraints: |w|a ≥ |w|b
  • Arithmetic constraints: |w|a + 2|w|b = n

Where |w|a denotes the number of occurrences of symbol a in string w.

Many such constraints go beyond regular languages and require more powerful models.

Theorem: Expressiveness of Counting Constraints

The following hierarchy holds for expressiveness:

  1. Regular expressions can express modular counting constraints
  2. Regular expressions with intersection can express more complex modular relationships
  3. Context-free languages are needed for balanced counting (e.g., equal a and b)
  4. Context-sensitive languages are needed for general arithmetic constraints

The satisfiability problem becomes increasingly difficult as we move up this hierarchy.

Example: Extended Operations in Action

Intersection Example

Consider strings over {a,b} that are both:

  • Even length: ((a|b)(a|b))*
  • Start with a: a(a|b)*
With intersection: ((a|b)(a|b))* & a(a|b)*
Without intersection: a(a|b)a(a|b)a(a|b)a(a|b)...|aa(a|b)aa(a|b)aa(a|b)...|ab(a|b)ab(a|b)ab(a|b)...|...
The intersection version is 2 simple patterns; the expanded version requires enumerating exponentially many cases!

Counting Constraints Example

Pattern: a{2,4}b{1,3} matches:
aabaaabaaaabaabbaaabbaaaabbaabbbaaabbaaaabbb

Expanded form: aa(b|bb|bbb)|aaa(b|bb|bbb)|aaaa(b|bb|bbb)

Much longer when written with standard operations!

Parsing and Ambiguity Theory

Regular expressions, as syntactic objects, must themselves be parsed and interpreted. Understanding the formal grammar of regular expressions and the ambiguities that arise in matching provides crucial insight into both theoretical properties and practical implementation.

Regular Expression Grammars

Definition: Context-Free Grammar for Regular Expressions

A formal grammar for regular expressions over alphabet Σ:

E → E | T (union)

T → T F (concatenation)

F → F * (Kleene star)

F → ( E ) (grouping)

F → a for each a ∈ Σ

F → ε | ∅ (base cases)

This grammar captures the standard precedence: Kleene star binds tightest, followed by concatenation, then union.

Definition: Parse Tree Structure

The parse tree of a regular expression r directly corresponds to its recursive structure:

  • Leaf nodes: symbols from Σ ∪ {ε, ∅}
  • Internal nodes: operators |, concatenation, *
  • Tree evaluation: bottom-up computation of L(r)

The parse tree provides a canonical representation that makes structural induction proofs straightforward and enables systematic algorithm design.

Theorem: Unique Parse Trees Under Precedence

Given the standard precedence rules (Kleene star > concatenation > union), every syntactically valid regular expression has a unique parse tree.

This eliminates structural ambiguity in regular expression parsing, though semantic ambiguity in string matching may still occur.

Ambiguous Regular Expressions

Definition: Matching Ambiguity

A regular expression r is ambiguous for string wif there exist multiple distinct ways to parse w according to r.

Formally, if w ∈ L(r) and there are multiple parse trees (derivation trees) showing how w matches r, then r is ambiguous for w.

Example: (a|a)* is ambiguous for a - it can match as the first a or the second a.

Definition: Disambiguation Strategies

Common disambiguation strategies for ambiguous regular expressions:

  • Greedy matching: r* matches as many repetitions as possible
  • Reluctant (lazy) matching: r*? matches as few repetitions as possible
  • Possessive matching: r*+ matches greedily without backtracking
  • Left-most longest: Among all matches, choose the one starting earliest, then the longest among those

These strategies provide deterministic semantics for regex engines but may not correspond to the mathematical definition of regular languages.

Theorem: Ambiguity and Computational Complexity

The matching problem for ambiguous regular expressions has different complexity characteristics:

  1. Membership testing: O(n) time (same as unambiguous case)
  2. Counting matches: Can be exponential in the input size
  3. Enumerating all matches: May require exponential time and space

Practical regex engines often use backtracking, leading to worst-case exponential behavior on pathological inputs.

Example: Parsing and Disambiguating a|b*c

Step 1: Parse Tree Construction

Expression: a|b*c with precedence (* > concatenation > |)

   |
/  \
a   concat
/      \
*        c
/
b

Interpretation: (a|(b*c)), not ((a|b)*c)

Step 2: Ambiguous Matching Example

Consider pattern (a*)(a*) matching string aaa

Match 1: first a* takes ε, second takes aaa
Match 2: first a* takes a, second takes aa
Match 3: first a* takes aa, second takes a
Match 4: first a* takes aaa, second takes ε

Step 3: Disambiguation Strategies

Pattern (.*)(world) matching hello world:

Greedy: .* takes hello wor, leaving ld for world → fails
Reluctant: .*? takes hello , leaving world → succeeds
Left-most longest: Find leftmost match, then longest among those

Different strategies can lead to completely different matching behavior!

Star Height Problem

The star height of a regular expression measures the maximum nesting depth of Kleene stars. This seemingly simple measure leads to one of the most complex decidability results in regular language theory, with deep connections to algebra and automata theory.

Definition: Star Height

The star height h(r) of a regular expression r is defined recursively:

  • h(∅) = h(ε) = h(a) = 0 for a ∈ Σ
  • h(r|s) = h(rs) = max(h(r), h(s))
  • h(r*) = h(r) + 1

The star height of a regular language L is:

h(L) = min{h(r) : L(r) = L}

The star height problem asks: given a regular expression r, what is h(L(r))?

Theorem: Decidability of Star Height

Theorem (Hashiguchi, 1988): The star height problem is decidable. That is, there exists an algorithm that, given a regular expression r, computes h(L(r)).

The algorithm has non-elementary complexity - the running time is not bounded by any tower of exponentials of fixed height.

Definition: Examples of Star Height

Classical examples demonstrating different star heights:

  • Star height 0: All finite languages, (a|b)*
  • Star height 1: L1 = (a*b*)*
  • Star height 2: The canonical example is the language of strings over {a,b}such that between any two consecutive a's, the number of b's is even

Proving that a language requires a specific star height involves showing both an upper bound (by construction) and a lower bound (by algebraic techniques).

Proof: Sketch: Lower Bound Techniques

Lower bounds for star height typically use algebraic techniques:

  1. Syntactic monoids: Associate with each regular language La finite monoid M(L) that captures its structural properties
  2. Aperiodicity: A language has star height 0 iff its syntactic monoid is aperiodic
  3. Nested aperiodicity: Higher star heights correspond to nested levels of non-aperiodicity in the monoid structure
  4. Variety theory: Use Eilenberg's correspondence between varieties of languages and varieties of monoids to characterize star height levels

These techniques reduce the star height problem to questions about finite algebra, though the resulting decision procedures remain extremely complex.

Expression Size and Minimization

Understanding the complexity of regular expressions requires precise measures of their size and analysis of the computational difficulty of finding equivalent expressions of minimal size. These questions connect regular expression theory to fundamental problems in computational complexity.

Definition: Size Measures for Regular Expressions

Several measures quantify the complexity of regular expressions:

  • Alphabetic width: alph(r) = number of symbol occurrences in r
  • Reverse Polish length: rp(r) = length of postfix representation
  • Syntax tree size: tree(r) = number of nodes in parse tree
  • Description length: desc(r) = total string length including operators

These measures are polynomially related: for any regular expression r,alph(r) ≤ rp(r) ≤ tree(r) ≤ desc(r) ≤ c · alph(r) for some constant c.

Definition: Regular Expression Minimization Problem

Input: Regular expression r and size measure μ

Output: Regular expression s such that L(r) = L(s)and μ(s) is minimized

Decision version: Given r and integer k, does there exist s with L(r) = L(s) and μ(s) ≤ k?

This problem asks for the shortest regular expression equivalent to a given one.

Theorem: Intractability of Regular Expression Minimization

Theorem: Regular expression minimization is PSPACE-complete for all standard size measures.

This holds for alphabetic width, reverse Polish length, syntax tree size, and description length. The problem remains PSPACE-complete even when restricted to specific operator sets.

Corollary: Unless P = PSPACE, there is no polynomial-time algorithm for finding minimal regular expressions.

Proof: PSPACE-hardness of Minimization

Proof sketch: Reduction from QBF (Quantified Boolean Formula) satisfiability.

  1. Given QBF φ = ∃x1∀x2...Q xn ψ(x1,...,xn)
  2. Construct regular expressions rtrue and rfalse such that:
    • If φ is true, then L(rtrue) = L(rfalse)
    • If φ is false, then L(rtrue) ≠ L(rfalse)
  3. The construction ensures that when φ is true,rtrue | rfalse can be simplified to a much shorter expression
  4. When φ is false, no such simplification is possible
  5. Testing if the minimal size is below a threshold thus solves QBF

The construction uses the expressive power of regular expressions to encode the logical structure of the QBF formula.

Definition: Heuristic Optimization Techniques

Since exact minimization is intractable, practical approaches use heuristics:

  • Local transformations: Apply algebraic identities like r|r = r,r∅ = ∅, (r*)* = r*
  • Common subexpression elimination: Factor out repeated subpatterns
  • Automaton-based optimization: Convert to minimal DFA, then back to regex
  • Genetic algorithms: Evolve populations of equivalent expressions

These methods provide good approximations but cannot guarantee optimality due to the underlying computational complexity.

Example: Size Measures and Optimization Attempts

Comparing Size Measures

Expression: (a|b)*a(a|b)*

Alphabetic width: 4
(count a, b, a, a, b symbols)
Description length: 13
(count all characters)
Tree size: 9
(count all parse tree nodes)
Reverse Polish: 8
(postfix: a b | * a a b | * ·)

Heuristic Optimization Trace

Starting expression: (a|a)*b(b|b)*

Step 1 - Idempotency: (a|a)*a*, (b|b)*b*
Result: a*b b*
Step 2 - Concatenation: Combine b b*b b* = b+
Result: a*b+ (if + operator allowed)
Step 3 - Standard form: b+bb*
Final: a*bb*

Size reduction: 12 symbols → 5 symbols (58% reduction)

Why Minimization is Hard: A Glimpse

Consider expressions describing strings with an even number ofa:

((b*ab*a)*b*)* (length 15)
(b|(ab*ab*))* (length 13)
b*((ab*a)*b*)* (length 16)

Which is shortest? Finding the absolute minimum requires checking potentially exponentially many equivalent expressions - hence the PSPACE-completeness result.

Automaton-Based Optimization Example

Original: (a|ε)(b|ε)
Convert to NFA: 4 states, ε-transitions
Convert to minimal DFA: 4 states, no ε-transitions
Convert back to regex: ε|a|b|ab
Sometimes this gives a shorter expression, sometimes longer!

Limitations of Regular Expressions

While regular expressions are powerful, they cannot describe all possible languages. Some common examples of non-regular languages include:

  • The language of balanced parentheses: {(ⁿ)ⁿ | n ≥ 0}
  • The language of strings with equal numbers of a's and b's: {w | #₍ₐ₎(w) = #₍ᵦ₎(w)}
  • The language of prime-length strings

These limitations are inherent to regular languages and can be proven using the pumping lemma for regular languages.

Applications of Regular Expressions

Regular expressions are widely used in various computing contexts:

  • Text Processing: Pattern matching, search, and replace in text editors
  • Lexical Analysis: Tokenizing source code in compilers
  • Form Validation: Checking if user input matches expected patterns
  • Data Extraction: Parsing structured data from text
  • URL Routing: Matching URL patterns in web frameworks
  • File Globbing: Matching file names in command-line interfaces

While practical implementations of regular expressions in programming languages often include extensions beyond the formal definition (like backreferences), the core concepts remain based on the theoretical foundations.

Summary

  • Regular expressions are a formal language for describing patterns in strings
  • Basic elements: (empty set), ε (empty string), and literal characters
  • Operations: Union |, Concatenation, and Kleene Star *
  • Precedence: Kleene Star > Concatenation > Union
  • Kleene's Theorem: Regular expressions and finite automata recognize exactly the same class of languages
  • Thompson's Construction: Algorithm to convert a regular expression to an NFA
  • State Elimination: Method to convert an NFA to a regular expression
  • Algebraic Properties: Regular expressions follow a rich set of identities and simplification rules
  • Limitations: Cannot describe languages with balanced/nested structures or counting-based patterns
  • Applications: Text processing, lexical analysis, form validation, pattern matching

Suggested Reading

  • Hopcroft, Motwani, and Ullman, Introduction to Automata Theory, Languages, and Computation – Chapter 3: Regular Expressions and Languages
  • Sipser, Introduction to the Theory of Computation – Chapter 1.3: Regular Expressions
  • Kozen, Automata and Computability – Chapter 8: Pattern Matching and Regular Expressions