You are previewing Mastering Regular Expressions, 3rd Edition.

Mastering Regular Expressions, 3rd Edition

Cover of Mastering Regular Expressions, 3rd Edition by Jeffrey E.F. Friedl Published by O'Reilly Media, Inc.
  1. Cover Page
  2. Title Page
  3. Copyright Page
  4. Dedication
  5. Table of Contents
  6. Preface
  7. 1: Introduction to Regular Expressions
    1. Solving Real Problems
    2. Regular Expressions as a Language
      1. The Filename Analogy
      2. The Language Analogy
    3. The Regular-Expression Frame of Mind
      1. If You Have Some Regular-Expression Experience
      2. Searching Text Files: Egrep
    4. Egrep Metacharacters
      1. Start and End of the Line
      2. Character Classes
      3. Matching Any Character with Dot
      4. Alternation
      5. Ignoring Differences in Capitalization
      6. Word Boundaries
      7. In a Nutshell
      8. Optional Items
      9. Other Quantifiers: Repetition
      10. Parentheses and Backreferences
      11. The Great Escape
    5. Expanding the Foundation
      1. Linguistic Diversification
      2. The Goal of a Regular Expression
      3. A Few More Examples
      4. Regular Expression Nomenclature
      5. Improving on the Status Quo
      6. Summary
    6. Personal Glimpses
  8. 2: Extended Introductory Examples
    1. About the Examples
      1. A Short Introduction to Perl
    2. Matching Text with Regular Expressions
      1. Toward a More Real-World Example
      2. Side Effects of a Successful Match
      3. Intertwined Regular Expressions
      4. Intermission
    3. Modifying Text with Regular Expressions
      1. Example: Form Letter
      2. Example: Prettifying a Stock Price
      3. Automated Editing
      4. A Small Mail Utility
      5. Adding Commas to a Number with Lookaround
      6. Text-to-HTML Conversion
      7. That Doubled-Word Thing
  9. 3: Overview of Regular Expression Features and Flavors
    1. A Casual Stroll Across the Regex Landscape
      1. The Origins of Regular Expressions
      2. At a Glance
    2. Care and Handling of Regular Expressions
      1. Integrated Handling
      2. Procedural and Object-Oriented Handling
      3. A Search-and-Replace Example
      4. Search and Replace in Other Languages
      5. Care and Handling: Summary
    3. Strings, Character Encodings, and Modes
      1. Strings as Regular Expressions
      2. Character-Encoding Issues
      3. Unicode
      4. Regex Modes and Match Modes
    4. Common Metacharacters and Features
      1. Character Representations
      2. Character Classes and Class-Like Constructs
      3. Anchors and Other “Zero-Width Assertions”
      4. Comments and Mode Modifiers
      5. Grouping, Capturing, Conditionals, and Control
    5. Guide to the Advanced Chapters
  10. 4: The Mechanics of Expression Processing
    1. Start Your Engines!
      1. Two Kinds of Engines
      2. New Standards
      3. Regex Engine Types
      4. From the Department of Redundancy Department
      5. Testing the Engine Type
    2. Match Basics
      1. About the Examples
      2. Rule 1: The Match That Begins Earliest Wins
      3. Engine Pieces and Parts
      4. Rule 2: The Standard Quantifiers Are Greedy
    3. Regex-Directed Versus Text-Directed
      1. NFA Engine: Regex-Directed
      2. DFA Engine: Text-Directed
      3. First Thoughts: NFA and DFA in Comparison
    4. Backtracking
      1. A Really Crummy Analogy
      2. Two Important Points on Backtracking
      3. Saved States
      4. Backtracking and Greediness
    5. More About Greediness and Backtracking
      1. Problems of Greediness
      2. Multi-Character “Quotes”
      3. Using Lazy Quantifiers
      4. Greediness and Laziness Always Favor a Match
      5. The Essence of Greediness, Laziness, and Backtracking
      6. Possessive Quantifiers and Atomic Grouping
      7. Possessive Quantifiers, ?+, *+, ++, and {m,n}+
      8. The Backtracking of Lookaround
      9. Is Alternation Greedy?
      10. Taking Advantage of Ordered Alternation
    6. NFA, DFA, and POSIX
      1. “The Longest-Leftmost”
      2. POSIX and the Longest-Leftmost Rule
      3. Speed and Efficiency
      4. Summary: NFA and DFA in Comparison
    7. Summary
  11. 5: Practical Regex Techniques
    1. Regex Balancing Act
    2. A Few Short Examples
      1. Continuing with Continuation Lines
      2. Matching an IP Address
      3. Working with Filenames
      4. Matching Balanced Sets of Parentheses
      5. Watching Out for Unwanted Matches
      6. Matching Delimited Text
      7. Knowing Your Data and Making Assumptions
      8. Stripping Leading and Trailing Whitespace
    3. HTML-Related Examples
      1. Matching an HTML Tag
      2. Matching an HTML Link
      3. Examining an HTTP URL
      4. Validating a Hostname
      5. Plucking Out a URL in the Real World
    4. Extended Examples
      1. Keeping in Sync with Your Data
      2. Parsing CSV Files
  12. 6: Crafting an Efficient Expression
    1. A Sobering Example
      1. A Simple Change—Placing Your Best Foot Forward
      2. Efficiency Versus Correctness
      3. Advancing Further—Localizing the Greediness
      4. Reality Check
    2. A Global View of Backtracking
      1. More Work for a POSIX NFA
      2. Work Required During a Non-Match
      3. Being More Specific
      4. Alternation Can Be Expensive
    3. Benchmarking
      1. Know What You’re Measuring
      2. Benchmarking with PHP
      3. Benchmarking with Java
      4. Benchmarking with VB.NET
      5. Benchmarking with Ruby
      6. Benchmarking with Python
      7. Benchmarking with Tcl
    4. Common Optimizations
      1. No Free Lunch
      2. Everyone’s Lunch is Different
      3. The Mechanics of Regex Application
      4. Pre-Application Optimizations
      5. Optimizations with the Transmission
      6. Optimizations of the Regex Itself
    5. Techniques for Faster Expressions
      1. Common Sense Techniques
      2. Expose Literal Text
      3. Expose Anchors
      4. Lazy Versus Greedy: Be Specific
      5. Split Into Multiple Regular Expressions
      6. Mimic Initial-Character Discrimination
      7. Use Atomic Grouping and Possessive Quantifiers
      8. Lead the Engine to a Match
    6. Unrolling the Loop
      1. Method 1: Building a Regex From Past Experiences
      2. The Real “Unrolling-the-Loop” Pattern
      3. Method 2: A Top-Down View
      4. Method 3: An Internet Hostname
      5. Observations
      6. Using Atomic Grouping and Possessive Quantifiers
      7. Short Unrolling Examples
      8. Unrolling C Comments
    7. The Freeflowing Regex
      1. A Helping Hand to Guide the Match
      2. A Well-Guided Regex is a Fast Regex
      3. Wrapup
    8. In Summary: Think!
  13. 7: Perl
    1. Regular Expressions as a Language Component
      1. Perl’s Greatest Strength
      2. Perl’s Greatest Weakness
    2. Perl’s Regex Flavor
      1. Regex Operands and Regex Literals
      2. How Regex Literals Are Parsed
      3. Regex Modifiers
    3. Regex-Related Perlisms
      1. Expression Context
      2. Dynamic Scope and Regex Match Effects
      3. Special Variables Modified by a Match
    4. The qr/···/ Operator and Regex Objects
      1. Building and Using Regex Objects
      2. Viewing Regex Objects
      3. Using Regex Objects for Efficiency
    5. The Match Operator
      1. Match’s Regex Operand
      2. Specifying the Match Target Operand
      3. Different Uses of the Match Operator
      4. Iterative Matching: Scalar Context, with /g
      5. The Match Operator’s Environmental Relations
    6. The Substitution Operator
      1. The Replacement Operand
      2. The /e Modifier
      3. Context and Return Value
    7. The Split Operator
      1. Basic Split
      2. Returning Empty Elements
      3. Split’s Special Regex Operands
      4. Split’s Match Operand with Capturing Parentheses
    8. Fun with Perl Enhancements
      1. Using a Dynamic Regex to Match Nested Pairs
      2. Using the Embedded-Code Construct
      3. Using local in an Embedded-Code Construct
      4. A Warning About Embedded Code and my Variables
      5. Matching Nested Constructs with Embedded Code
      6. Overloading Regex Literals
      7. Problems with Regex-Literal Overloading
      8. Mimicking Named Capture
    9. Perl Efficiency Issues
      1. “There’s More Than One Way to Do It”
      2. Regex Compilation, the /o Modifier, qr/···/, and Efficiency
      3. Understanding the “Pre-Match” Copy
      4. The Study Function
      5. Benchmarking
      6. Regex Debugging Information
    10. Final Comments
  14. 8: Java
    1. Java’s Regex Flavor
      1. Java Support for \p{···} and \P{···}
      2. Unicode Line Terminators
    2. Using java.util.regex
    3. The Pattern.compile() Factory
      1. Pattern’s matcher method
    4. The Matcher Object
      1. Applying the Regex
      2. Querying Match Results
      3. Simple Search and Replace
      4. Advanced Search and Replace
      5. In-Place Search and Replace
      6. The Matcher’s Region
      7. Method Chaining
      8. Methods for Building a Scanner
      9. Other Matcher Methods
    5. Other Pattern Methods
      1. Pattern’s split Method, with One Argument
      2. Pattern’s split Method, with Two Arguments
    6. Additional Examples
      1. Adding Width and Height Attributes to Image Tags
      2. Validating HTML with Multiple Patterns Per Matcher
      3. Parsing Comma-Separated Values (CSV) Text
    7. Java Version Differences
      1. Differences Between 1.4.2 and 1.5.0
      2. Differences Between 1.5.0 and 1.6
  15. 9: .NET
    1. .NET’s Regex Flavor
      1. Additional Comments on the Flavor
    2. Using .NET Regular Expressions
      1. Regex Quickstart
      2. Package Overview
      3. Core Object Overview
    3. Core Object Details
      1. Creating Regex Objects
      2. Using Regex Objects
      3. Using Match Objects
      4. Using Group Objects
    4. Static “Convenience” Functions
      1. Regex Caching
    5. Support Functions
    6. Advanced .NET
      1. Regex Assemblies
      2. Matching Nested Constructs
      3. Capture Objects
  16. 10: PHP
    1. PHP’s Regex Flavor
    2. The Preg Function Interface
      1. “Pattern” Arguments
    3. The Preg Functions
      1. preg_match
      2. preg_match_all
      3. preg_replace
      4. preg_replace_callback
      5. preg_split
      6. preg_grep
      7. preg_quote
    4. “Missing” Preg Functions
      1. preg_regex_to_pattern
      2. Syntax-Checking an Unknown Pattern Argument
      3. Syntax-Checking an Unknown Regex
    5. Recursive Expressions
      1. Matching Text with Nested Parentheses
      2. No Backtracking Into Recursion
      3. Matching a Set of Nested Parentheses
    6. PHP Efficiency Issues
      1. The S Pattern Modifier: “Study”
    7. Extended Examples
      1. CSV Parsing with PHP
      2. Checking Tagged Data for Proper Nesting
  17. Index
  18. About the Author
  19. Colophon
  20. Footnotes
    1. Chapter 1
    2. Chapter 2
    3. Chapter 3
    4. Chapter 4
    5. Chapter 5
    6. Chapter 6
    7. Chapter 7
    8. Chapter 8
    9. Chapter 9
    10. Chapter 10
O'Reilly logo

Perl Efficiency Issues

For the most part, efficiency with Perl regular expressions is achieved in the same way as with any tool that uses a Traditional NFA. Use the techniques discussed in Chapter 6 — the internal optimizations, the unrolling methods, the “Think” section — all apply to Perl.

There are, of course, Perl-specific issues as well, and in this section, we’ll look at the following topics:

  • There’s More Than One Way To Do It  Perl is a toolbox offering many approaches to a solution. Knowing which problems are nails comes with understanding The Perl Way, and knowing which hammer to use for any particular nail goes a long way toward making more efficient and more understandable programs. Sometimes efficiency and understandability seem to be mutually exclusive, but a better understanding allows you to make better choices.
  • Regex Compilation, qr/···/, the /o Modifier, and Efficiency  The interpolation and compilation of regex operands are fertile ground for saving time. The /o modifier, which I haven’t discussed much yet, along with regex objects (qr/···/), gives you some control over when the costly re-compilation takes place.
  • The $& Penalty  The three match side effect variables, $', $&, and $', can be convenient, but there’s a hidden efficiency gotcha waiting in store for any script that uses them, even once, anywhere. Heck, you don’t even have to use them — the entire script is penalized if one of these variables even appears in the script.
  • The Study Function  Since ages ...

The best content for your career. Discover unlimited learning on demand for around $1/day.