Cover image for Mastering Regular Expressions, 3rd Edition

Book description

Regular expressions are an extremely powerful tool for manipulating text and data. They are now standard features in a wide range of languages and popular tools, including Perl, Python, Ruby, Java, VB.NET and C# (and any language using the .NET Framework), PHP, and MySQL.

If you don't use regular expressions yet, you will discover in this book a whole new world of mastery over your data. If you already use them, you'll appreciate this book's unprecedented detail and breadth of coverage. If you think you know all you need to know about regular expressions, this book is a stunning eye-opener.

As this book shows, a command of regular expressions is an invaluable skill. Regular expressions allow you to code complex and subtle text processing that you never imagined could be automated. Regular expressions can save you time and aggravation. They can be used to craft elegant solutions to a wide range of problems. Once you've mastered regular expressions, they'll become an invaluable part of your toolkit. You will wonder how you ever got by without them.

Yet despite their wide availability, flexibility, and unparalleled power, regular expressions are frequently underutilized. Yet what is power in the hands of an expert can be fraught with peril for the unwary. Mastering Regular Expressions will help you navigate the minefield to becoming an expert and help you optimize your use of regular expressions.

Mastering Regular Expressions, Third Edition, now includes a full chapter devoted to PHP and its powerful and expressive suite of regular expression functions, in addition to enhanced PHP coverage in the central "core" chapters. Furthermore, this edition has been updated throughout to reflect advances in other languages, including expanded in-depth coverage of Sun's java.util.regex package, which has emerged as the standard Java regex implementation.Topics include:

  • A comparison of features among different versions of many languages and tools

  • How the regular expression engine works

  • Optimization (major savings available here!)

  • Matching just what you want, but not what you don't want

  • Sections and chapters on individual languages

Written in the lucid, entertaining tone that makes a complex, dry topic become crystal-clear to programmers, and sprinkled with solutions to complex real-world problems, Mastering Regular Expressions, Third Edition offers a wealth information that you can put to immediate use.

Reviews of this new edition and the second edition:

"There isn't a better (or more useful) book available on regular expressions."

--Zak Greant, Managing Director, eZ Systems

"A real tour-de-force of a book which not only covers the mechanics of regexes in extraordinary detail but also talks about efficiency and the use of regexes in Perl, Java, and .NET...If you use regular expressions as part of your professional work (even if you already have a good book on whatever language you're programming in) I would strongly recommend this book to you."

--Dr. Chris Brown, Linux Format

"The author does an outstanding job leading the reader from regex novice to master. The book is extremely easy to read and chock full of useful and relevant examples...Regular expressions are valuable tools that every developer should have in their toolbox. Mastering Regular Expressions is the definitive guide to the subject, and an outstanding resource that belongs on every programmer's bookshelf. Ten out of Ten Horseshoes."

--Jason Menard, Java Ranch

Table of Contents

  1. Cover Page
  2. Title Page
  3. Copyright Page
  4. Dedication
  5. Table of Contents
  6. Preface
  7. 1: Introduction to Regular Expressions
    1. Solving Real Problems
    2. Regular Expressions as a Language
      1. The Filename Analogy
      2. The Language Analogy
    3. The Regular-Expression Frame of Mind
      1. If You Have Some Regular-Expression Experience
      2. Searching Text Files: Egrep
    4. Egrep Metacharacters
      1. Start and End of the Line
      2. Character Classes
      3. Matching Any Character with Dot
      4. Alternation
      5. Ignoring Differences in Capitalization
      6. Word Boundaries
      7. In a Nutshell
      8. Optional Items
      9. Other Quantifiers: Repetition
      10. Parentheses and Backreferences
      11. The Great Escape
    5. Expanding the Foundation
      1. Linguistic Diversification
      2. The Goal of a Regular Expression
      3. A Few More Examples
      4. Regular Expression Nomenclature
      5. Improving on the Status Quo
      6. Summary
    6. Personal Glimpses
  8. 2: Extended Introductory Examples
    1. About the Examples
      1. A Short Introduction to Perl
    2. Matching Text with Regular Expressions
      1. Toward a More Real-World Example
      2. Side Effects of a Successful Match
      3. Intertwined Regular Expressions
      4. Intermission
    3. Modifying Text with Regular Expressions
      1. Example: Form Letter
      2. Example: Prettifying a Stock Price
      3. Automated Editing
      4. A Small Mail Utility
      5. Adding Commas to a Number with Lookaround
      6. Text-to-HTML Conversion
      7. That Doubled-Word Thing
  9. 3: Overview of Regular Expression Features and Flavors
    1. A Casual Stroll Across the Regex Landscape
      1. The Origins of Regular Expressions
      2. At a Glance
    2. Care and Handling of Regular Expressions
      1. Integrated Handling
      2. Procedural and Object-Oriented Handling
      3. A Search-and-Replace Example
      4. Search and Replace in Other Languages
      5. Care and Handling: Summary
    3. Strings, Character Encodings, and Modes
      1. Strings as Regular Expressions
      2. Character-Encoding Issues
      3. Unicode
      4. Regex Modes and Match Modes
    4. Common Metacharacters and Features
      1. Character Representations
      2. Character Classes and Class-Like Constructs
      3. Anchors and Other “Zero-Width Assertions”
      4. Comments and Mode Modifiers
      5. Grouping, Capturing, Conditionals, and Control
    5. Guide to the Advanced Chapters
  10. 4: The Mechanics of Expression Processing
    1. Start Your Engines!
      1. Two Kinds of Engines
      2. New Standards
      3. Regex Engine Types
      4. From the Department of Redundancy Department
      5. Testing the Engine Type
    2. Match Basics
      1. About the Examples
      2. Rule 1: The Match That Begins Earliest Wins
      3. Engine Pieces and Parts
      4. Rule 2: The Standard Quantifiers Are Greedy
    3. Regex-Directed Versus Text-Directed
      1. NFA Engine: Regex-Directed
      2. DFA Engine: Text-Directed
      3. First Thoughts: NFA and DFA in Comparison
    4. Backtracking
      1. A Really Crummy Analogy
      2. Two Important Points on Backtracking
      3. Saved States
      4. Backtracking and Greediness
    5. More About Greediness and Backtracking
      1. Problems of Greediness
      2. Multi-Character “Quotes”
      3. Using Lazy Quantifiers
      4. Greediness and Laziness Always Favor a Match
      5. The Essence of Greediness, Laziness, and Backtracking
      6. Possessive Quantifiers and Atomic Grouping
      7. Possessive Quantifiers, ?+, *+, ++, and {m,n}+
      8. The Backtracking of Lookaround
      9. Is Alternation Greedy?
      10. Taking Advantage of Ordered Alternation
    6. NFA, DFA, and POSIX
      1. “The Longest-Leftmost”
      2. POSIX and the Longest-Leftmost Rule
      3. Speed and Efficiency
      4. Summary: NFA and DFA in Comparison
    7. Summary
  11. 5: Practical Regex Techniques
    1. Regex Balancing Act
    2. A Few Short Examples
      1. Continuing with Continuation Lines
      2. Matching an IP Address
      3. Working with Filenames
      4. Matching Balanced Sets of Parentheses
      5. Watching Out for Unwanted Matches
      6. Matching Delimited Text
      7. Knowing Your Data and Making Assumptions
      8. Stripping Leading and Trailing Whitespace
    3. HTML-Related Examples
      1. Matching an HTML Tag
      2. Matching an HTML Link
      3. Examining an HTTP URL
      4. Validating a Hostname
      5. Plucking Out a URL in the Real World
    4. Extended Examples
      1. Keeping in Sync with Your Data
      2. Parsing CSV Files
  12. 6: Crafting an Efficient Expression
    1. A Sobering Example
      1. A Simple Change—Placing Your Best Foot Forward
      2. Efficiency Versus Correctness
      3. Advancing Further—Localizing the Greediness
      4. Reality Check
    2. A Global View of Backtracking
      1. More Work for a POSIX NFA
      2. Work Required During a Non-Match
      3. Being More Specific
      4. Alternation Can Be Expensive
    3. Benchmarking
      1. Know What You’re Measuring
      2. Benchmarking with PHP
      3. Benchmarking with Java
      4. Benchmarking with VB.NET
      5. Benchmarking with Ruby
      6. Benchmarking with Python
      7. Benchmarking with Tcl
    4. Common Optimizations
      1. No Free Lunch
      2. Everyone’s Lunch is Different
      3. The Mechanics of Regex Application
      4. Pre-Application Optimizations
      5. Optimizations with the Transmission
      6. Optimizations of the Regex Itself
    5. Techniques for Faster Expressions
      1. Common Sense Techniques
      2. Expose Literal Text
      3. Expose Anchors
      4. Lazy Versus Greedy: Be Specific
      5. Split Into Multiple Regular Expressions
      6. Mimic Initial-Character Discrimination
      7. Use Atomic Grouping and Possessive Quantifiers
      8. Lead the Engine to a Match
    6. Unrolling the Loop
      1. Method 1: Building a Regex From Past Experiences
      2. The Real “Unrolling-the-Loop” Pattern
      3. Method 2: A Top-Down View
      4. Method 3: An Internet Hostname
      5. Observations
      6. Using Atomic Grouping and Possessive Quantifiers
      7. Short Unrolling Examples
      8. Unrolling C Comments
    7. The Freeflowing Regex
      1. A Helping Hand to Guide the Match
      2. A Well-Guided Regex is a Fast Regex
      3. Wrapup
    8. In Summary: Think!
  13. 7: Perl
    1. Regular Expressions as a Language Component
      1. Perl’s Greatest Strength
      2. Perl’s Greatest Weakness
    2. Perl’s Regex Flavor
      1. Regex Operands and Regex Literals
      2. How Regex Literals Are Parsed
      3. Regex Modifiers
    3. Regex-Related Perlisms
      1. Expression Context
      2. Dynamic Scope and Regex Match Effects
      3. Special Variables Modified by a Match
    4. The qr/···/ Operator and Regex Objects
      1. Building and Using Regex Objects
      2. Viewing Regex Objects
      3. Using Regex Objects for Efficiency
    5. The Match Operator
      1. Match’s Regex Operand
      2. Specifying the Match Target Operand
      3. Different Uses of the Match Operator
      4. Iterative Matching: Scalar Context, with /g
      5. The Match Operator’s Environmental Relations
    6. The Substitution Operator
      1. The Replacement Operand
      2. The /e Modifier
      3. Context and Return Value
    7. The Split Operator
      1. Basic Split
      2. Returning Empty Elements
      3. Split’s Special Regex Operands
      4. Split’s Match Operand with Capturing Parentheses
    8. Fun with Perl Enhancements
      1. Using a Dynamic Regex to Match Nested Pairs
      2. Using the Embedded-Code Construct
      3. Using local in an Embedded-Code Construct
      4. A Warning About Embedded Code and my Variables
      5. Matching Nested Constructs with Embedded Code
      6. Overloading Regex Literals
      7. Problems with Regex-Literal Overloading
      8. Mimicking Named Capture
    9. Perl Efficiency Issues
      1. “There’s More Than One Way to Do It”
      2. Regex Compilation, the /o Modifier, qr/···/, and Efficiency
      3. Understanding the “Pre-Match” Copy
      4. The Study Function
      5. Benchmarking
      6. Regex Debugging Information
    10. Final Comments
  14. 8: Java
    1. Java’s Regex Flavor
      1. Java Support for \p{···} and \P{···}
      2. Unicode Line Terminators
    2. Using java.util.regex
    3. The Pattern.compile() Factory
      1. Pattern’s matcher method
    4. The Matcher Object
      1. Applying the Regex
      2. Querying Match Results
      3. Simple Search and Replace
      4. Advanced Search and Replace
      5. In-Place Search and Replace
      6. The Matcher’s Region
      7. Method Chaining
      8. Methods for Building a Scanner
      9. Other Matcher Methods
    5. Other Pattern Methods
      1. Pattern’s split Method, with One Argument
      2. Pattern’s split Method, with Two Arguments
    6. Additional Examples
      1. Adding Width and Height Attributes to Image Tags
      2. Validating HTML with Multiple Patterns Per Matcher
      3. Parsing Comma-Separated Values (CSV) Text
    7. Java Version Differences
      1. Differences Between 1.4.2 and 1.5.0
      2. Differences Between 1.5.0 and 1.6
  15. 9: .NET
    1. .NET’s Regex Flavor
      1. Additional Comments on the Flavor
    2. Using .NET Regular Expressions
      1. Regex Quickstart
      2. Package Overview
      3. Core Object Overview
    3. Core Object Details
      1. Creating Regex Objects
      2. Using Regex Objects
      3. Using Match Objects
      4. Using Group Objects
    4. Static “Convenience” Functions
      1. Regex Caching
    5. Support Functions
    6. Advanced .NET
      1. Regex Assemblies
      2. Matching Nested Constructs
      3. Capture Objects
  16. 10: PHP
    1. PHP’s Regex Flavor
    2. The Preg Function Interface
      1. “Pattern” Arguments
    3. The Preg Functions
      1. preg_match
      2. preg_match_all
      3. preg_replace
      4. preg_replace_callback
      5. preg_split
      6. preg_grep
      7. preg_quote
    4. “Missing” Preg Functions
      1. preg_regex_to_pattern
      2. Syntax-Checking an Unknown Pattern Argument
      3. Syntax-Checking an Unknown Regex
    5. Recursive Expressions
      1. Matching Text with Nested Parentheses
      2. No Backtracking Into Recursion
      3. Matching a Set of Nested Parentheses
    6. PHP Efficiency Issues
      1. The S Pattern Modifier: “Study”
    7. Extended Examples
      1. CSV Parsing with PHP
      2. Checking Tagged Data for Proper Nesting
  17. Index
  18. About the Author
  19. Colophon
  20. Footnotes
    1. Chapter 1
    2. Chapter 2
    3. Chapter 3
    4. Chapter 4
    5. Chapter 5
    6. Chapter 6
    7. Chapter 7
    8. Chapter 8
    9. Chapter 9
    10. Chapter 10