4.18. Reformat Names From “FirstName LastName” to “LastName, FirstName”

Problem

You want to convert people’s names from the “FirstName LastName” format to “LastName, FirstName” for use in an alphabetical listing. You additionally want to account for other name parts, so that you can, say convert “FirstName MiddleNames Particles LastName Suffix” to “LastName, FirstName MiddleNames Particles Suffix.”

Solution

Unfortunately, it isn’t possible to reliably parse names using a regular expression. Regular expressions are rigid, whereas names are so flexible that even humans get them wrong. Determining the structure of a name or how it should be listed alphabetically often requires taking traditional and national conventions, or even personal preferences, into account. Nevertheless, if you’re willing to make certain assumptions about your data and can handle a moderate level of error, a regular expression can provide a quick solution.

The following regular expression has intentionally been kept simple, rather than trying to account for edge cases.

Regular expression

^(.+?)●([^\s,]+)(,?●(?:[JS]r\.?|III?|IV))?$

Regex options: Case insensitive

Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

Replacement

$2,●$1$3

Replacement text flavors: .NET, Java, JavaScript, Perl, PHP

\2,●\1\3

Replacement text flavors: Python, Ruby

JavaScript example

function formatName(name) {
    return name.replace(/^(.+?) ([^\s,]+)(,? (?:[JS]r\.?|III?|IV))?$/i,
                        "$2, $1$3");
}

Recipe 3.15 has code listings that will help you add this regex search-and-replace to programs written in other languages. Recipe 3.4 shows how to set the “case insensitive” option used here.

Discussion

First, let’s take a look at this regular expression piece by piece. Higher-level comments are provided afterward to help explain which parts of a name are being matched by various segments of the regex. Since the regex is written here in free-spacing mode, the literal space characters have been escaped with backslashes:

^              # Assert position at the beginning of the string.
(              # Capture the enclosed match to backreference 1:
  .+?          #   Match one or more characters, as few times as possible.
)              # End the capturing group.
\              # Match a literal space character.
(              # Capture the enclosed match to backreference 2:
  [^\s,]+      #   Match one or more non-whitespace/comma characters.
)              # End the capturing group.
(              # Capture the enclosed match to backreference 3:
  ,?\          #   Match ", " or " ".
  (?:          #   Group but don't capture:
    [JS]r\.?   #     Match "Jr", "Jr.", "Sr", or "Sr.".
   |           #    Or:
    III?       #     Match "II" or "III".
   |           #    Or:
    IV         #     Match "IV".
  )            #   End the noncapturing group.
)?             # Make the group optional.
$              # Assert position at the end of the string.

Regex options: Case insensitive, free-spacing

Regex flavors: .NET, Java, XRegExp, PCRE, Perl, Python, Ruby

This regular expression makes the following assumptions about the subject data:

It contains at least one first name and one last name (other name parts are optional).
The first name is listed before the last name (not the norm with some national conventions).
If the name contains a suffix, it is one of the values “Jr”, “Jr.”, “Sr”, “Sr.”, “II”, “III”, or “IV”, with an optional preceding comma.

A few more issues to consider:

The regular expression cannot identify compound surnames that don’t use hyphens. For example, Sacha Baron Cohen would be replaced with Cohen, Sacha Baron, rather than the correct listing, Baron Cohen, Sacha.
It does not keep particles in front of the family name, although this is sometimes called for by convention or personal preference (for example, the correct alphabetical listing of “Charles de Gaulle” is “de Gaulle, Charles” according to the Chicago Manual of Style, 16^th Edition, which contradicts Merriam-Webster’s Biographical Dictionary on this particular name).
Because of the ‹^› and ‹$› anchors that bind the match to the beginning and end of the string, no replacement can be made if the entire subject text does not fit the pattern. Hence, if no suitable match is found (for example, if the subject text contains only one name), the name is left unaltered.

As for how the regular expression works, it uses three capturing groups to split up the name. The pieces are then reassembled in the desired order via backreferences in the replacement string. Capturing group 1 uses the maximally flexible ‹.+?› pattern to grab the first name along with any number of middle names and surname particles, such as the German “von” or the French, Portuguese, and Spanish “de.” These name parts are handled together because they are listed sequentially in the output. Lumping the first and middle names together also helps avoid errors, because the regular expression cannot distinguish between a compound first name, such as “Mary Lou” or “Norma Jeane,” and a first name plus middle name. Even humans cannot accurately make the distinction just by visual examination.

Capturing group 2 matches the last name using ‹[^\s,]+›. Like the dot used in capturing group 1, the flexibility of this character class allows it to match accented characters and any other non-Latin characters. Capturing group 3 matches an optional suffix, such as “Jr.” or “III,” from a predefined list of possible values. The suffix is handled separately from the last name because it should continue to appear at the end of the reformatted name.

Let’s go back for a minute to capturing group 1. Why was the dot within group 1 followed by the lazy ‹+?› quantifier, whereas the character class in group 2 was followed by the greedy ‹+› quantifier? If group 1 (which handles a variable number of elements and therefore needs to go as far as it can into the name) used a greedy quantifier, capturing group 3 (which attempts to match a suffix) wouldn’t have a shot at participating in the match. The dot from group 1 would match until the end of the string, and since capturing group 3 is optional, the regex engine would only backtrack enough to find a match for group 2 before declaring success. Capturing group 2 can use a greedy quantifier because its more restrictive character class only allows it to match one name.

Table 4-2 shows some examples of how names are formatted using this regular expression and replacement string.

Table 4-2. Formatted names

Input	Output
`Robert Downey, Jr.`	`Downey, Robert, Jr.`
`John F. Kennedy`	`Kennedy, John F.`
`Scarlett O’Hara`	`O’Hara, Scarlett`
`Pepé Le Pew`	`Pew, Pepé Le`
`J.R.R. Tolkien`	`Tolkien, J.R.R.`
`Catherine Zeta-Jones`	`Zeta-Jones, Catherine`

Variations

List surname particles at the beginning of the name

An added segment in the following regular expression allows you to output surname particles from a predefined list in front of the last name. Specifically, this regular expression accounts for the values “de”, “du”, “la”, “le”, “St”, “St.”, “Ste”, “Ste.”, “van”, and “von”. Any number of these values are allowed in sequence (for example, “de la”):

^(.+?)●((?:(?:d[eu]|l[ae]|Ste?\.?|v[ao]n)●)*[^\s,]+)↵
(,?●(?:[JS]r\.?|III?|IV))?$