You are previewing Programming PHP, 3rd Edition.

Programming PHP, 3rd Edition

Cover of Programming PHP, 3rd Edition by Kevin Tatroe... Published by O'Reilly Media, Inc.
  1. Dedication
  2. Special Upgrade Offer
  3. Foreword
  4. Preface
    1. Audience
    2. Assumptions This Book Makes
    3. Contents of This Book
    4. Conventions Used in This Book
    5. Using Code Examples
    6. Safari® Books Online
    7. How to Contact Us
    8. Acknowledgments
      1. Kevin Tatroe
      2. Peter MacIntyre
  5. 1. Introduction to PHP
    1. What Does PHP Do?
    2. A Brief History of PHP
      1. The Evolution of PHP
      2. The Widespread Use of PHP
    3. Installing PHP
    4. A Walk Through PHP
      1. Configuration Page
      2. Forms
      3. Databases
      4. Graphics
  6. 2. Language Basics
    1. Lexical Structure
      1. Case Sensitivity
      2. Statements and Semicolons
      3. Whitespace and Line Breaks
      4. Comments
      5. Literals
      6. Identifiers
      7. Keywords
    2. Data Types
      1. Integers
      2. Floating-Point Numbers
      3. Strings
      4. Booleans
      5. Arrays
      6. Objects
      7. Resources
      8. Callbacks
      9. NULL
    3. Variables
      1. Variable Variables
      2. Variable References
      3. Variable Scope
      4. Garbage Collection
    4. Expressions and Operators
      1. Number of Operands
      2. Operator Precedence
      3. Operator Associativity
      4. Implicit Casting
      5. Arithmetic Operators
      6. String Concatenation Operator
      7. Auto-increment and Auto-decrement Operators
      8. Comparison Operators
      9. Bitwise Operators
      10. Logical Operators
      11. Casting Operators
      12. Assignment Operators
      13. Miscellaneous Operators
    5. Flow-Control Statements
      1. if
      2. switch
      3. while
      4. for
      5. foreach
      6. try...catch
      7. declare
      8. exit and return
      9. goto
    6. Including Code
    7. Embedding PHP in Web Pages
      1. Standard (XML) Style
      2. SGML Style
      3. ASP Style
      4. Script Style
      5. Echoing Content Directly
  7. 3. Functions
    1. Calling a Function
    2. Defining a Function
    3. Variable Scope
      1. Global Variables
      2. Static Variables
    4. Function Parameters
      1. Passing Parameters by Value
      2. Passing Parameters by Reference
      3. Default Parameters
      4. Variable Parameters
      5. Missing Parameters
      6. Type Hinting
    5. Return Values
    6. Variable Functions
    7. Anonymous Functions
  8. 4. Strings
    1. Quoting String Constants
      1. Variable Interpolation
      2. Single-Quoted Strings
      3. Double-Quoted Strings
      4. Here Documents
    2. Printing Strings
      1. echo
      2. print()
      3. printf()
      4. print_r() and var_dump()
    3. Accessing Individual Characters
    4. Cleaning Strings
      1. Removing Whitespace
      2. Changing Case
    5. Encoding and Escaping
      1. HTML
      2. URLs
      3. SQL
      4. C-String Encoding
    6. Comparing Strings
      1. Exact Comparisons
      2. Approximate Equality
    7. Manipulating and Searching Strings
      1. Substrings
      2. Miscellaneous String Functions
      3. Decomposing a String
      4. String-Searching Functions
    8. Regular Expressions
      1. The Basics
      2. Character Classes
      3. Alternatives
      4. Repeating Sequences
      5. Subpatterns
      6. Delimiters
      7. Match Behavior
      8. Character Classes
      9. Anchors
      10. Quantifiers and Greed
      11. Noncapturing Groups
      12. Backreferences
      13. Trailing Options
      14. Inline Options
      15. Lookahead and Lookbehind
      16. Cut
      17. Conditional Expressions
      18. Functions
      19. Differences from Perl Regular Expressions
  9. 5. Arrays
    1. Indexed Versus Associative Arrays
    2. Identifying Elements of an Array
    3. Storing Data in Arrays
      1. Adding Values to the End of an Array
      2. Assigning a Range of Values
      3. Getting the Size of an Array
      4. Padding an Array
    4. Multidimensional Arrays
    5. Extracting Multiple Values
      1. Slicing an Array
      2. Splitting an Array into Chunks
      3. Keys and Values
      4. Checking Whether an Element Exists
      5. Removing and Inserting Elements in an Array
    6. Converting Between Arrays and Variables
      1. Creating Variables from an Array
      2. Creating an Array from Variables
    7. Traversing Arrays
      1. The foreach Construct
      2. The Iterator Functions
      3. Using a for Loop
      4. Calling a Function for Each Array Element
      5. Reducing an Array
      6. Searching for Values
    8. Sorting
      1. Sorting One Array at a Time
      2. Natural-Order Sorting
      3. Sorting Multiple Arrays at Once
      4. Reversing Arrays
      5. Randomizing Order
    9. Acting on Entire Arrays
      1. Calculating the Sum of an Array
      2. Merging Two Arrays
      3. Calculating the Difference Between Two Arrays
      4. Filtering Elements from an Array
    10. Using Arrays
      1. Sets
      2. Stacks
    11. Iterator Interface
  10. 6. Objects
    1. Terminology
    2. Creating an Object
    3. Accessing Properties and Methods
    4. Declaring a Class
      1. Declaring Methods
      2. Declaring Properties
      3. Declaring Constants
      4. Inheritance
      5. Interfaces
      6. Traits
      7. Abstract Methods
      8. Constructors
      9. Destructors
    5. Introspection
      1. Examining Classes
      2. Examining an Object
      3. Sample Introspection Program
    6. Serialization
  11. 7. Web Techniques
    1. HTTP Basics
    2. Variables
    3. Server Information
    4. Processing Forms
      1. Methods
      2. Parameters
      3. Self-Processing Pages
      4. Sticky Forms
      5. Multivalued Parameters
      6. Sticky Multivalued Parameters
      7. File Uploads
      8. Form Validation
    5. Setting Response Headers
      1. Different Content Types
      2. Redirections
      3. Expiration
      4. Authentication
    6. Maintaining State
      1. Cookies
      2. Sessions
      3. Combining Cookies and Sessions
    7. SSL
  12. 8. Databases
    1. Using PHP to Access a Database
    2. Relational Databases and SQL
      1. PHP Data Objects
    3. MySQLi Object Interface
      1. Retrieving Data for Display
    4. SQLite
    5. Direct File-Level Manipulation
    6. MongoDB
      1. Retrieving Data
      2. Inserting More Complex Data
  13. 9. Graphics
    1. Embedding an Image in a Page
    2. Basic Graphics Concepts
    3. Creating and Drawing Images
      1. The Structure of a Graphics Program
      2. Changing the Output Format
      3. Testing for Supported Image Formats
      4. Reading an Existing File
      5. Basic Drawing Functions
    4. Images with Text
      1. Fonts
      2. TrueType Fonts
    5. Dynamically Generated Buttons
      1. Caching the Dynamically Generated Buttons
      2. A Faster Cache
    6. Scaling Images
    7. Color Handling
      1. Using the Alpha Channel
      2. Identifying Colors
      3. True Color Indexes
      4. Text Representation of an Image
  14. 10. PDF
    1. PDF Extensions
    2. Documents and Pages
      1. A Simple Example
      2. Initializing the Document
      3. Outputting Basic Text Cells
    3. Text
      1. Coordinates
      2. Text Attributes
      3. Page Headers, Footers, and Class Extension
      4. Images and Links
      5. Tables and Data
  15. 11. XML
    1. Lightning Guide to XML
    2. Generating XML
    3. Parsing XML
      1. Element Handlers
      2. Character Data Handler
      3. Processing Instructions
      4. Entity Handlers
      5. Default Handler
      6. Options
      7. Using the Parser
      8. Errors
      9. Methods as Handlers
      10. Sample Parsing Application
    4. Parsing XML with DOM
    5. Parsing XML with SimpleXML
    6. Transforming XML with XSLT
  16. 12. Security
    1. Filter Input
    2. Cross-Site Scripting
      1. SQL Injection
    3. Escape Output
      1. Filenames
    4. Session Fixation
    5. File Uploads
      1. Distrust Browser-Supplied Filenames
      2. Beware of Filling Your Filesystem
      3. Surviving register_globals
    6. File Access
      1. Restrict Filesystem Access to a Specific Directory
      2. Get It Right the First Time
      3. Don’t Use Files
      4. Session Files
      5. Concealing PHP Libraries
    7. PHP Code
    8. Shell Commands
    9. More Information
    10. Security Recap
  17. 13. Application Techniques
    1. Code Libraries
    2. Templating Systems
    3. Handling Output
      1. Output Buffering
      2. Compressing Output
    4. Error Handling
      1. Error Reporting
      2. Error Suppression
      3. Triggering Errors
      4. Defining Error Handlers
    5. Performance Tuning
      1. Benchmarking
      2. Profiling
      3. Optimizing Execution Time
      4. Optimizing Memory Requirements
      5. Reverse Proxies and Replication
  18. 14. PHP on Disparate Platforms
    1. Writing Portable Code for Windows and Unix
      1. Determining the Platform
      2. Handling Paths Across Platforms
      3. The Server Environment
      4. Sending Mail
      5. End-of-Line Handling
      6. End-of-File Handling
      7. External Commands
      8. Common Platform-Specific Extensions
    2. Interfacing with COM
      1. Background
      2. PHP Functions
      3. Determining the API
  19. 15. Web Services
    1. REST Clients
      1. Responses
      2. Retrieving Resources
      3. Updating Resources
      4. Creating Resources
      5. Deleting Resources
    2. XML-RPC
      1. Servers
      2. Clients
  20. 16. Debugging PHP
    1. The Development Environment
    2. The Staging Environment
    3. The Production Environment
    4. php.ini Settings
    5. Manual Debugging
    6. Error Log
    7. IDE Debugging
    8. Additional Debugging Techniques
  21. 17. Dates and Times
  22. A. Function Reference
    1. PHP Functions by Category
      1. Arrays
      2. Classes and Objects
      3. Date and Time
      4. Directories
      5. Errors and Logging
      6. Program Execution
      7. Filesystem
      8. Data Filtering
      9. Functions
      10. PHP Options/Info
      11. Mail
      12. Math
      13. Miscellaneous Functions
      14. Network
      15. Output Buffering
      16. Session Handling
      17. Streams
      18. Strings
      19. PHP Language Tokenizer
      20. URLs
      21. Variables
    2. Alphabetical Listing of PHP Functions
  23. Index
  24. About the Authors
  25. Colophon
  26. Special Upgrade Offer
  27. Copyright
O'Reilly logo

Regular Expressions

If you need more complex searching functionality than the previous methods provide, you can use regular expressions. A regular expression is a string that represents a pattern. The regular expression functions compare that pattern to another string and see if any of the string matches the pattern. Some functions tell you whether there was a match, while others make changes to the string.

There are three uses for regular expressions: matching, which can also be used to extract information from a string; substituting new text for matching text; and splitting a string into an array of smaller chunks. PHP has functions for all. For instance, preg_match() does a regular expression match.

Perl has long been considered the benchmark for powerful regular expressions. PHP uses a C library called pcre to provide almost complete support for Perl’s arsenal of regular expression features. Perl regular expressions act on arbitrary binary data, so you can safely match with patterns or strings that contain the NUL-byte (\x00).

The Basics

Most characters in a regular expression are literal characters, meaning that they match only themselves. For instance, if you search for the regular expression "/cow/" in the string "Dave was a cowhand", you get a match because "cow" occurs in that string.

Some characters have special meanings in regular expressions. For instance, a caret (^) at the beginning of a regular expression indicates that it must match the beginning of the string (or, more precisely, anchors the regular expression to the beginning of the string):

preg_match("/^cow/", "Dave was a cowhand"); // returns false
preg_match("/^cow/", "cowabunga!");         // returns true

Similarly, a dollar sign ($) at the end of a regular expression means that it must match the end of the string (i.e., anchors the regular expression to the end of the string):

preg_match("/cow$/", "Dave was a cowhand"); // returns false
preg_match("/cow$/", "Don't have a cow");   // returns true

A period (.) in a regular expression matches any single character:

preg_match("/c.t/", "cat"); // returns true
preg_match("/c.t/", "cut"); // returns true
preg_match("/c.t/", "c t"); // returns true
preg_match("/c.t/", "bat"); // returns false
preg_match("/c.t/", "ct");  // returns false

If you want to match one of these special characters (called a metacharacter), you have to escape it with a backslash:

preg_match("/\$5\.00", "Your bill is $5.00 exactly"); // returns true
preg_match("/$5.00", "Your bill is $5.00 exactly");   // returns false

Regular expressions are case-sensitive by default, so the regular expression "/cow/" doesn’t match the string "COW". If you want to perform a case-insensitive match, you specify a flag to indicate a case-insensitive match (as you’ll see later in this chapter).

So far, we haven’t done anything we couldn’t have done with the string functions we’ve already seen, like strstr(). The real power of regular expressions comes from their ability to specify abstract patterns that can match many different character sequences. You can specify three basic types of abstract patterns in a regular expression:

  • A set of acceptable characters that can appear in the string (e.g., alphabetic characters, numeric characters, specific punctuation characters)

  • A set of alternatives for the string (e.g., "com", "edu", "net", or "org")

  • A repeating sequence in the string (e.g., at least one but not more than five numeric characters)

These three kinds of patterns can be combined in countless ways to create regular expressions that match such things as valid phone numbers and URLs.

Character Classes

To specify a set of acceptable characters in your pattern, you can either build a character class yourself or use a predefined one. You can build your own character class by enclosing the acceptable characters in square brackets:

preg_match("/c[aeiou]t/", "I cut my hand");     // returns true
preg_match("/c[aeiou]t/", "This crusty cat");   // returns true
preg_match("/c[aeiou]t/", "What cart?");        // returns false
preg_match("/c[aeiou]t/", "14ct gold");         // returns false

The regular expression engine finds a "c", then checks that the next character is one of "a", "e", "i", "o", or "u". If it isn’t a vowel, the match fails and the engine goes back to looking for another "c". If a vowel is found, the engine checks that the next character is a "t". If it is, the engine is at the end of the match and returns true. If the next character isn’t a "t", the engine goes back to looking for another "c".

You can negate a character class with a caret (^) at the start:

preg_match("/c[^aeiou]t/", "I cut my hand");   // returns false
preg_match("/c[^aeiou]t/", "Reboot chthon");   // returns true
preg_match("/c[^aeiou]t/", "14ct gold");       // returns false

In this case, the regular expression engine is looking for a "c" followed by a character that isn’t a vowel, followed by a "t".

You can define a range of characters with a hyphen (-). This simplifies character classes like “all letters” and “all digits”:

preg_match("/[0-9]%/", "we are 25% complete");          // returns true
preg_match("/[0123456789]%/", "we are 25% complete");   // returns true
preg_match("/[a-z]t/", "11th");                         // returns false
preg_match("/[a-z]t/", "cat");                          // returns true
preg_match("/[a-z]t/", "PIT");                          // returns false
preg_match("/[a-zA-Z]!/", "11!");                       // returns false
preg_match("/[a-zA-Z]!/", "stop!");                     // returns true

When you are specifying a character class, some special characters lose their meaning while others take on new meanings. In particular, the $ anchor and the period lose their meaning in a character class, while the ^ character is no longer an anchor but negates the character class if it is the first character after the open bracket. For instance, [^\]] matches any nonclosing bracket character, while [$.^] matches any dollar sign, period, or caret.

The various regular expression libraries define shortcuts for character classes, including digits, alphabetic characters, and whitespace.

Alternatives

You can use the vertical pipe (|) character to specify alternatives in a regular expression:

preg_match("/cat|dog/", "the cat rubbed my legs");      // returns true
preg_match("/cat|dog/", "the dog rubbed my legs");      // returns true
preg_match("/cat|dog/", "the rabbit rubbed my legs");   // returns false

The precedence of alternation can be a surprise: "/^cat|dog$/" selects from "^cat" and "dog$", meaning that it matches a line that either starts with "cat" or ends with "dog". If you want a line that contains just "cat" or "dog", you need to use the regular expression "/^(cat|dog)$/".

You can combine character classes and alternation to, for example, check for strings that don’t start with a capital letter:

preg_match("/^([a-z]|[0-9])/", "The quick brown fox");   // returns false
preg_match("/^([a-z]|[0-9])/", "jumped over");           // returns true
preg_match("/^([a-z]|[0-9])/", "10 lazy dogs");          // returns true

Repeating Sequences

To specify a repeating pattern, you use something called a quantifier. The quantifier goes after the pattern that’s repeated and says how many times to repeat that pattern. Table 4-6 shows the quantifiers that are supported by both PHP’s regular expressions.

Table 4-6. Regular expression quantifiers

Quantifier

Meaning

?

0 or 1

*

0 or more

+

1 or more

{ n }

Exactly n times

{ n , m }

At least n, no more than m times

{ n ,}

At least n times

To repeat a single character, simply put the quantifier after the character:

preg_match("/ca+t/", "caaaaaaat");   // returns true
preg_match("/ca+t/", "ct");          // returns false
preg_match("/ca?t/", "caaaaaaat");   // returns false
preg_match("/ca*t/", "ct");          // returns true

With quantifiers and character classes, we can actually do something useful, like matching valid U.S. telephone numbers:

preg_match("/[0-9]{3}-[0-9]{3}-[0-9]{4}/", "303-555-1212");    // returns true
preg_match("/[0-9]{3}-[0-9]{3}-[0-9]{4}/", "64-9-555-1234");   // returns false

Subpatterns

You can use parentheses to group bits of a regular expression together to be treated as a single unit called a subpattern:

preg_match("/a (very )+big dog/", "it was a very very big dog");   // returns true
preg_match("/^(cat|dog)$/", "cat");                                // returns true
preg_match("/^(cat|dog)$/", "dog");                                // returns true

The parentheses also cause the substring that matches the subpattern to be captured. If you pass an array as the third argument to a match function, the array is populated with any captured substrings:

preg_match("/([0-9]+)/", "You have 42 magic beans", $captured);
// returns true and populates $captured

The zeroth element of the array is set to the entire string being matched against. The first element is the substring that matched the first subpattern (if there is one), the second element is the substring that matched the second subpattern, and so on.

Delimiters

Perl-style regular expressions emulate the Perl syntax for patterns, which means that each pattern must be enclosed in a pair of delimiters. Traditionally, the slash (/) character is used; for example, /pattern/. However, any nonalphanumeric character other than the backslash character (\) can be used to delimit a Perl-style pattern. This is useful when matching strings containing slashes, such as filenames. For example, the following are equivalent:

preg_match("/\/usr\/local\//", "/usr/local/bin/perl");   // returns true
preg_match("#/usr/local/#", "/usr/local/bin/perl");      // returns true

Parentheses (()), curly braces ({}), square brackets ([]), and angle brackets (<>) can be used as pattern delimiters:

preg_match("{/usr/local/}", "/usr/local/bin/perl");      // returns true

The section Trailing Options discusses the single-character modifiers you can put after the closing delimiter to modify the behavior of the regular expression engine. A very useful one is x, which makes the regular expression engine strip whitespace and #-marked comments from the regular expression before matching. These two patterns are the same, but one is much easier to read:

'/([[:alpha:]]+)\s+\1/'
'/(          # start capture
[[:alpha:]]+ #   a word
\s+          #   whitespace
\1           #   the same word again
  )          # end capture
/x'

Match Behavior

The period (.) matches any character except for a newline (\n). The dollar sign ($) matches at the end of the string or, if the string ends with a newline, just before that newline:

preg_match("/is (.*)$/", "the key is in my pants", $captured);
// $captured[1] is 'in my pants'

Character Classes

As shown in Table 4-7, Perl-compatible regular expressions define a number of named sets of characters that you can use in character classes. The expansions in Table 4-7 are for English. The actual letters vary from locale to locale.

Each [: something :] class can be used in place of a character in a character class. For instance, to find any character that’s a digit, an uppercase letter, or an “at” sign (@), use the following regular expression:

[@[:digit:][:upper:]]

However, you can’t use a character class as the endpoint of a range:

preg_match("/[A-[:lower:]]/", "string");// invalid regular expression

Some locales consider certain character sequences as if they were a single character—these are called collating sequences. To match one of these multicharacter sequences in a character class, enclose it with [. and .]. For example, if your locale has the collating sequence ch, you can match s, t, or ch with this character class:

[st[.ch.]]

The final extension to character classes is the equivalence class, specified by enclosing the character in [= and =]. Equivalence classes match characters that have the same collating order, as defined in the current locale. For example, a locale may define a, á, and ä as having the same sorting precedence. To match any one of them, the equivalence class is [=a=].

Table 4-7. Character classes

Class

Description

Expansion

[:alnum:]

Alphanumeric characters

[0-9a-zA-Z]

[:alpha:]

Alphabetic characters (letters)

[a-zA-Z]

[:ascii:]

7-bit ASCII

[\x01-\x7F]

[:blank:]

Horizontal whitespace (space, tab)

[ \t]

[:cntrl:]

Control characters

[\x01-\x1F]

[:digit:]

Digits

[0-9]

[:graph:]

Characters that use ink to print (nonspace, noncontrol)

[^\x01-\x20]

[:lower:]

Lowercase letter

[a-z]

[:print:]

Printable character (graph class plus space and tab)

[\t\x20-\xFF]

[:punct:]

Any punctuation character, such as the period (.) and the semicolon (;)

[-!"#$%&'()*+,./:;<=>?@[\\\]^_'{|}~]

[:space:]

Whitespace (newline, carriage return, tab, space, vertical tab)

[\n\r\t \x0B]

[:upper:]

Uppercase letter

[A-Z]

[:xdigit:]

Hexadecimal digit

[0-9a-fA-F]

\s

Whitespace

[\r\n \t]

\S

Nonwhitespace

[^\r\n \t]

\w

Word (identifier) character

[0-9A-Za-z_]

\W

Nonword (identifier) character

[^0-9A-Za-z_]

\d

Digit

[0-9]

\D

Nondigit

[^0-9]

Anchors

An anchor limits a match to a particular location in the string (anchors do not match actual characters in the target string). Table 4-8 lists the anchors supported by regular expressions.

Table 4-8. Anchors

Anchor

Matches

^

Start of string

$

End of string

[[:<:]]

Start of word

[[:>:]]

End of word

\b

Word boundary (between \w and \W or at start or end of string)

\B

Nonword boundary (between \w and \w, or \W and \W)

\A

Beginning of string

\Z

End of string or before \n at end

\z

End of string

^

Start of line (or after \n if /m flag is enabled)

$

End of line (or before \n if /m flag is enabled)

A word boundary is defined as the point between a whitespace character and an identifier (alphanumeric or underscore) character:

preg_match("/[[:<:]]gun[[:>:]]/", "the Burgundy exploded");   // returns false
preg_match("/gun/", "the Burgundy exploded");                 // returns true

Note that the beginning and end of a string also qualify as word boundaries.

Quantifiers and Greed

Regular expression quantifiers are typically greedy. That is, when faced with a quantifier, the engine matches as much as it can while still satisfying the rest of the pattern. For instance:

preg_match("/(<.*>)/", "do <b>not</b> press the button", $match);
// $match[1] is '<b>not</b>'

The regular expression matches from the first less-than sign to the last greater-than sign. In effect, the .* matches everything after the first less-than sign, and the engine backtracks to make it match less and less until finally there’s a greater-than sign to be matched.

This greediness can be a problem. Sometimes you need minimal (nongreedy) matching—that is, quantifiers that match as few times as possible to satisfy the rest of the pattern. Perl provides a parallel set of quantifiers that match minimally. They’re easy to remember, because they’re the same as the greedy quantifiers, but with a question mark (?) appended. Table 4-9 shows the corresponding greedy and nongreedy quantifiers supported by Perl-style regular expressions.

Table 4-9. Greedy and nongreedy quantifiers in Perl-compatible regular expressions

Greedy quantifier

Nongreedy quantifier

?

??

*

*?

+

+?

{m}

{m}?

{m,}

{m,}?

{m,n}

{m,n}?

Here’s how to match a tag using a nongreedy quantifier:

preg_match("/(<.*?>)/", "do <b>not</b> press the button", $match);
// $match[1] is "<b>"

Another, faster way is to use a character class to match every non-greater-than character up to the next greater-than sign:

preg_match("/(<[^>]*>)/", "do <b>not</b> press the button", $match);
// $match[1] is '<b>'

Noncapturing Groups

If you enclose a part of a pattern in parentheses, the text that matches that subpattern is captured and can be accessed later. Sometimes, though, you want to create a subpattern without capturing the matching text. In Perl-compatible regular expressions, you can do this using the (?: subpattern ) construct:

preg_match("/(?:ello)(.*)/", "jello biafra", $match);
// $match[1] is " biafra"

Backreferences

You can refer to text captured earlier in a pattern with a backreference: \1 refers to the contents of the first subpattern, \2 refers to the second, and so on. If you nest subpatterns, the first begins with the first opening parenthesis, the second begins with the second opening parenthesis, and so on.

For instance, this identifies doubled words:

preg_match("/([[:alpha:]]+)\s+\1/", "Paris in the the spring", $m);
// returns true and $m[1] is "the"

The preg_match() function captures at most 99 subpatterns; subpatterns after the 99th are ignored.

Trailing Options

Perl-style regular expressions let you put single-letter options (flags) after the regular expression pattern to modify the interpretation, or behavior, of the match. For instance, to match case-insensitively, simply use the i flag:

preg_match("/cat/i", "Stop, Catherine!"); // returns true

Table 4-10 shows the modifiers from Perl that are supported in Perl-compatible regular expressions.

Table 4-10. Perl flags

Modifier

Meaning

/regexp/i

Match case-insensitively

/regexp/s

Make period (.) match any character, including newline (\n)

/regexp/x

Remove whitespace and comments from the pattern

/regexp/m

Make caret (^) match after, and dollar sign ($) match before, internal newlines (\n)

/regexp/e

If the replacement string is PHP code, eval() it to get the actual replacement string

PHP’s Perl-compatible regular expression functions also support other modifiers that aren’t supported by Perl, as listed in Table 4-11.

Table 4-11. Additional PHP flags

Modifier

Meaning

/regexp/U

Reverses the greediness of the subpattern; * and + now match as little as possible, instead of as much as possible

/regexp/u

Causes pattern strings to be treated as UTF-8

/regexp/X

Causes a backslash followed by a character with no special meaning to emit an error

/regexp/A

Causes the beginning of the string to be anchored as if the first character of the pattern were ^

/regexp/D

Causes the $ character to match only at the end of a line

/regexp/S

Causes the expression parser to more carefully examine the structure of the pattern, so it may run slightly faster the next time (such as in a loop)

It’s possible to use more than one option in a single pattern, as demonstrated in the following example:

$message = <<< END
To: you@youcorp
From: me@mecorp
Subject: pay up

Pay me or else!
END;

preg_match("/^subject: (.*)/im", $message, $match);
print_r($match);

pay up

Inline Options

In addition to specifying pattern-wide options after the closing pattern delimiter, you can specify options within a pattern to have them apply only to part of the pattern. The syntax for this is:

(?flags:subpattern)

For example, only the word “PHP” is case-insensitive in this example:

preg_match('/I like (?i:PHP)/', 'I like pHp');  // returns true

The i, m, s, U, x, and X options can be applied internally in this fashion. You can use multiple options at once:

preg_match('/eat (?ix:foo   d)/', 'eat FoOD'); // returns true

Prefix an option with a hyphen (-) to turn it off:

preg_match('/(?-i:I like) PHP/i', 'I like pHp');   // returns true

An alternative form enables or disables the flags until the end of the enclosing subpattern or pattern:

preg_match('/I like (?i)PHP/', 'I like pHp');  // returns true
preg_match('/I (like (?i)PHP) a lot/', 'I like pHp a lot', $match);
// $match[1] is 'like pHp'

Inline flags do not enable capturing. You need an additional set of capturing parentheses to do that.

Lookahead and Lookbehind

In patterns it’s sometimes useful to be able to say “match here if this is next.” This is particularly common when you are splitting a string. The regular expression describes the separator, which is not returned. You can use lookahead to make sure (without matching it, thus preventing it from being returned) that there’s more data after the separator. Similarly, lookbehind checks the preceding text.

Lookahead and lookbehind come in two forms: positive and negative. A positive lookahead or lookbehind says “the next/preceding text must be like this.” A negative lookahead or lookbehind indicates “the next/preceding text must not be like this.” Table 4-12 shows the four constructs you can use in Perl-compatible patterns. None of the constructs captures text.

Table 4-12. Lookahead and lookbehind assertions

Construct

Meaning

(?=subpattern)

Positive lookahead

(?!subpattern)

Negative lookahead

(?<=subpattern)

Positive lookbehind

(?<!subpattern)

Negative lookbehind

A simple use of positive lookahead is splitting a Unix mbox mail file into individual messages. The word "From" starting a line by itself indicates the start of a new message, so you can split the mailbox into messages by specifying the separator as the point where the next text is "From" at the start of a line:

$messages = preg_split('/(?=^From )/m', $mailbox);

A simple use of negative lookbehind is to extract quoted strings that contain quoted delimiters. For instance, here’s how to extract a single-quoted string (note that the regular expression is commented using the x modifier):

$input = <<< END
name = 'Tim O\'Reilly';
END;

$pattern = <<< END
'             # opening quote
(             # begin capturing
  .*?         # the string
  (?<! \\\\ ) # skip escaped quotes
)             # end capturing
'             # closing quote
END;
preg_match( "($pattern)x", $input, $match);
echo $match[1];
Tim O\'Reilly

The only tricky part is that to get a pattern that looks behind to see if the last character was a backslash, we need to escape the backslash to prevent the regular expression engine from seeing \), which would mean a literal close parenthesis. In other words, we have to backslash that backslash: \\). But PHP’s string-quoting rules say that \\ produces a literal single backslash, so we end up requiring four backslashes to get one through the regular expression! This is why regular expressions have a reputation for being hard to read.

Perl limits lookbehind to constant-width expressions. That is, the expressions cannot contain quantifiers, and if you use alternation, all the choices must be the same length. The Perl-compatible regular expression engine also forbids quantifiers in lookbehind, but does permit alternatives of different lengths.

Cut

The rarely used once-only subpattern, or cut, prevents worst-case behavior by the regular expression engine on some kinds of patterns. The subpattern is never backed out of once matched.

The common use for the once-only subpattern is when you have a repeated expression that may itself be repeated:

/(a+|b+)*\.+/

This code snippet takes several seconds to report failure:

$p = '/(a+|b+)*\.+$/';
$s = 'abababababbabbbabbaaaaaabbbbabbababababababbba..!';

if (preg_match($p, $s)) {
  echo "Y";
}
else {
  echo "N";
}

This is because the regular expression engine tries all the different places to start the match, but has to backtrack out of each one, which takes time. If you know that once something is matched it should never be backed out of, you should mark it with (?> subpattern ):

$p = '/(?>a+|b+)*\.+$/';

The cut never changes the outcome of the match; it simply makes it fail faster.

Conditional Expressions

A conditional expression is like an if statement in a regular expression. The general form is:

(?(condition)yespattern)
(?(condition)yespattern|nopattern)

If the assertion succeeds, the regular expression engine matches the yespattern. With the second form, if the assertion doesn’t succeed, the regular expression engine skips the yespattern and tries to match the nopattern.

The assertion can be one of two types: either a backreference, or a lookahead or lookbehind match. To reference a previously matched substring, the assertion is a number from 1–99 (the most backreferences available). The condition uses the pattern in the assertion only if the backreference was matched. If the assertion is not a backreference, it must be a positive or negative lookahead or lookbehind assertion.

Functions

There are five classes of functions that work with Perl-compatible regular expressions: matching, replacing, splitting, filtering, and a utility function for quoting text.

Matching

The preg_match() function performs Perl-style pattern matching on a string. It’s the equivalent of the m// operator in Perl. The preg_match() function takes the same arguments and gives the same return value as the preg_match() function, except that it takes a Perl-style pattern instead of a standard pattern:

$found = preg_match(pattern, string [, captured ]);

For example:

preg_match('/y.*e$/', 'Sylvie');         // returns true
preg_match('/y(.*)e$/', 'Sylvie', $m);   // $m is array('ylvie', 'lvi')

While there’s a preg_match() function to match case-insensitively, there’s no preg_matchi() function. Instead, use the i flag on the pattern:

preg_match('y.*e$/i', 'SyLvIe');   // returns true

The preg_match_all() function repeatedly matches from where the last match ended, until no more matches can be made:

$found = preg_match_all(pattern, string, matches [, order ]);

The order value, either PREG_PATTERN_ORDER or PREG_SET_ORDER, determines the layout of matches. We’ll look at both, using this code as a guide:

$string = <<< END
13 dogs
12 rabbits
8 cows
1 goat
END;
preg_match_all('/(\d+) (\S+)/', $string, $m1, PREG_PATTERN_ORDER);
preg_match_all('/(\d+) (\S+)/', $string, $m2, PREG_SET_ORDER);

With PREG_PATTERN_ORDER (the default), each element of the array corresponds to a particular capturing subpattern. So $m1[0] is an array of all the substrings that matched the pattern, $m1[1] is an array of all the substrings that matched the first subpattern (the numbers), and $m1[2] is an array of all the substrings that matched the second subpattern (the words). The array $m1 has one more elements than subpatterns.

With PREG_SET_ORDER, each element of the array corresponds to the next attempt to match the whole pattern. So $m2[0] is an array of the first set of matches ('13 dogs', '13', 'dogs'), $m2[1] is an array of the second set of matches ('12 rabbits', '12', 'rabbits'), and so on. The array $m2 has as many elements as there were successful matches of the entire pattern.

Example 4-1 fetches the HTML at a particular web address into a string and extracts the URLs from that HTML. For each URL, it generates a link back to the program that will display the URLs at that address.

Example 4-1. Extracting URLs from an HTML page
<?php
if (getenv('REQUEST_METHOD') == 'POST') {
  $url = $_POST['url'];
}
else {
  $url = $_GET['url'];
}
?>

<form action="<?php echo $_SERVER['PHP_SELF']; ?>" method="POST">
  <p>URL: <input type="text" name="url" value="<?php echo $url ?>" /><br />
  <input type="submit">
</form>

<?php
if ($url) {
  $remote = fopen($url, 'r'); {
    $html = fread($remote, 1048576); // read up to 1 MB of HTML
  }
  fclose($remote);

  $urls = '(http|telnet|gopher|file|wais|ftp)';
  $ltrs = '\w';
  $gunk = '/#~:.?+=&%@!\-';
  $punc = '.:?\-';
  $any = "{$ltrs}{$gunk}{$punc}";

  preg_match_all("{
    \b          # start at word boundary
    {$urls}:    # need resource and a colon
    [{$any}] +? # followed by one or more of any valid
                # characters—but be conservative
                # and take only what you need
    (?=         # the match ends at
    [{$punc}]*  # punctuation
    [^{$any}]   # followed by a non-URL character
    |           # or
    \$          # the end of the string
    )
  }x", $html, $matches);

  printf("I found %d URLs<P>\n", sizeof($matches[0]));

  foreach ($matches[0] as $u) {
    $link = $_SERVER['PHP_SELF'] . '?url=' . urlencode($u);
    echo "<a href=\"{$link}\">{$u}</a><br />\n";
  }
}

Replacing

The preg_replace() function behaves like the search-and-replace operation in your text editor. It finds all occurrences of a pattern in a string and changes those occurrences to something else:

$new = preg_replace(pattern, replacement, subject [, limit ]);

The most common usage has all the argument strings except for the integer limit. The limit is the maximum number of occurrences of the pattern to replace (the default, and the behavior when a limit of −1 is passed, is all occurrences):

$better = preg_replace('/<.*?>/', '!', 'do <b>not</b> press the button');
// $better is 'do !not! press the button'

Pass an array of strings as subject to make the substitution on all of them. The new strings are returned from preg_replace():

$names = array('Fred Flintstone',
  'Barney Rubble',
  'Wilma Flintstone',
  'Betty Rubble');
$tidy  = preg_replace('/(\w)\w* (\w+)/', '\1 \2', $names);
// $tidy is array ('F Flintstone', 'B Rubble', 'W Flintstone', 'B Rubble')

To perform multiple substitutions on the same string or array of strings with one call to preg_replace(), pass arrays of patterns and replacements:

$contractions = array("/don't/i", "/won't/i", "/can't/i");
$expansions = array('do not', 'will not', 'can not');
$string = "Please don't yell—I can't jump while you won't speak";
$longer = preg_replace($contractions, $expansions, $string);
// $longer is 'Please do not yell—I can not jump while you will not speak';

If you give fewer replacements than patterns, text matching the extra patterns is deleted. This is a handy way to delete a lot of things at once:

$htmlGunk = array('/<.*?>/', '/&.*?;/');
$html = '&eacute; : <b>very</b> cute';
$stripped  = preg_replace($htmlGunk, array(), $html);
// $stripped is ' : very cute'

If you give an array of patterns but a single string replacement, the same replacement is used for every pattern:

$stripped = preg_replace($htmlGunk, '', $html);

The replacement can use backreferences. Unlike backreferences in patterns, though, the preferred syntax for backreferences in replacements is $1, $2, $3, etc. For example:

echo preg_replace('/(\w)\w+\s+(\w+)/', '$2, $1.', 'Fred Flintstone')
Flintstone, F.

The /e modifier makes preg_replace() treat the replacement string as PHP code that returns the actual string to use in the replacement. For example, this converts every Celsius temperature to Fahrenheit:

$string  = 'It was 5C outside, 20C inside';
echo preg_replace('/(\d+)C\b/e', '$1*9/5+32', $string);
It was 41 outside, 68 inside

This more complex example expands variables in a string:

$name = 'Fred';
$age  = 35;
$string = '$name is $age';
preg_replace('/\$(\w+)/e', '$$1', $string);

Each match isolates the name of a variable ($name, $age). The $1 in the replacement refers to those names, so the PHP code actually executed is $name and $age. That code evaluates to the value of the variable, which is what’s used as the replacement. Whew!

A variation on preg_replace() is preg_replace_callback(). This calls a function to get the replacement string. The function is passed an array of matches (the zeroth element is all the text that matched the pattern, the first is the contents of the first captured subpattern, and so on). For example:

function titlecase($s)
{
  return ucfirst(strtolower($s[0]));
}

$string = 'goodbye cruel world';
$new = preg_replace_callback('/\w+/', 'titlecase', $string);
echo $new;

Goodbye Cruel World

Splitting

Whereas you use preg_match_all() to extract chunks of a string when you know what those chunks are, use preg_split() to extract chunks when you know what separates the chunks from each other:

$chunks = preg_split(pattern, string [, limit [, flags ]]);

The pattern matches a separator between two chunks. By default, the separators are not returned. The optional limit specifies the maximum number of chunks to return (−1 is the default, which means all chunks). The flags argument is a bitwise OR combination of the flags PREG_SPLIT_NO_EMPTY (empty chunks are not returned) and PREG_SPLIT_DELIM_CAPTURE (parts of the string captured in the pattern are returned).

For example, to extract just the operands from a simple numeric expression, use:

$ops = preg_split('{[+*/−]}', '3+5*9/2');
// $ops is array('3', '5', '9', '2')

To extract the operands and the operators, use:

$ops = preg_split('{([+*/−])}', '3+5*9/2', 1, PREG_SPLIT_DELIM_CAPTURE);
// $ops is array('3', '+', '5', '*', '9', '/', '2')

An empty pattern matches at every boundary between characters in the string. This lets you split a string into an array of characters:

$array = preg_split('//', $string);

Filtering an array with a regular expression

The preg_grep() function returns those elements of an array that match a given pattern:

$matching = preg_grep(pattern, array);

For instance, to get only the filenames that end in .txt, use:

$textfiles = preg_grep('/\.txt$/', $filenames);

Quoting for regular expressions

The preg_quote() function creates a regular expression that matches only a given string:

$re = preg_quote(string [, delimiter ]);

Every character in string that has special meaning inside a regular expression (e.g., * or $) is prefaced with a backslash:

echo preg_quote('$5.00 (five bucks)');
\$5\.00 \(five bucks\)

The optional second argument is an extra character to be quoted. Usually, you pass your regular expression delimiter here:

$toFind = '/usr/local/etc/rsync.conf';
$re = preg_quote($toFind, '/');

if (preg_match("/{$re}/", $filename)) {
  // found it!
}

Differences from Perl Regular Expressions

Although very similar, PHP’s implementation of Perl-style regular expressions has a few minor differences from actual Perl regular expressions:

  • The NULL character (ASCII 0) is not allowed as a literal character within a pattern string. You can reference it in other ways, however (\000, \x00, etc.).

  • The \E, \G, \L, \l, \Q, \u, and \U options are not supported.

  • The (?{ some perl code }) construct is not supported.

  • The /D, /G, /U, /u, /A, and /X modifiers are supported.

  • The vertical tab \v counts as a whitespace character.

  • Lookahead and lookbehind assertions cannot be repeated using *, +, or ?.

  • Parenthesized submatches within negative assertions are not remembered.

  • Alternation branches within a lookbehind assertion can be of different lengths.

The best content for your career. Discover unlimited learning on demand for around $1/day.