O'Reilly logo

Regular Expressions Cookbook by Steven Levithan, Jan Goyvaerts

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

4.13. Validate ISBNs

Problem

You need to check the validity of an International Standard Book Number (ISBN), which can be in either the older ISBN-10 or the current ISBN-13 format. You want to allow a leading ISBN identifier, and ISBN parts can optionally be separated by hyphens or spaces. ISBN 978-0-596-52068-7, ISBN-13: 978-0-596-52068-7, 978 0 596 52068 7, 9780596520687, ISBN-10 0-596-52068-9, and 0-596-52068-9 are all examples of valid input.

Solution

You cannot validate an ISBN using a regex alone, because the last digit is computed using a checksum algorithm. The regular expressions in this section validate the format of an ISBN, whereas the subsequent code examples include a validity check for the final digit.

Regular expressions

ISBN-10:

^(?:ISBN(?:-10)?:?)?(?=[-0-9X]{13}$|[0-9X]{10}$)[0-9]{1,5}[-]?↵
(?:[0-9]+[-]?){2}[0-9X]$
Regex options: None
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

ISBN-13:

^(?:ISBN(?:-13)?:?)?(?=[-0-9]{17}$|[0-9]{13}$)97[89][-]?[0-9]{1,5}↵
[-]?(?:[0-9]+[-]?){2}[0-9]$
Regex options: None
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

ISBN-10 or ISBN-13:

^(?:ISBN(?:-1[03])?:?)?(?=[-0-9]{17}$|[-0-9X]{13}$|[0-9X]{10}$)↵
(?:97[89][-]?)?[0-9]{1,5}[-]?(?:[0-9]+[-]?){2}[0-9X]$
Regex options: None
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

JavaScript

// `regex` checks for ISBN-10 or ISBN-13 format
var regex = /^(?:ISBN(?:-1[03])?:? )?(?=[-0-9 ]{17}$|[-0-9X ]{13}$|↵
[0-9X]{10}$)(?:97[89][- ]?)?[0-9]{1,5}[- ]?(?:[0-9]+[- ]?){2}[0-9X]$/;

if (regex.test(subject)) {
  // Remove non ISBN digits, then split into an array
  var chars = subject.replace(/[^0-9X]/g, "").split("");
  // Remove the final ISBN digit from `chars`, and assign it to `last`
  var last  = chars.pop();
  var sum   = 0;
  var digit = 10;
  var check;

  if (chars.length == 9) {
    // Compute the ISBN-10 check digit
    for (var i = 0; i < chars.length; i++) {
      sum += digit * parseInt(chars[i], 10);
      digit -= 1;
    }
    check = 11 - (sum % 11);
    if (check == 10) {
      check = "X";
    } else if (check == 11) {
      check = "0";
    }
  } else {
    // Compute the ISBN-13 check digit
    for (var i = 0; i < chars.length; i++) {
      sum += (i % 2 * 2 + 1) * parseInt(chars[i], 10);
    }
    check = 10 - (sum % 10);
    if (check == 10) {
      check = "0";
    }
  }

  if (check == last) {
    alert("Valid ISBN");
  } else {
    alert("Invalid ISBN check digit");
  }
} else {
  alert("Invalid ISBN");
}

Python

import re
import sys

# `regex` checks for ISBN-10 or ISBN-13 format
regex = re.compile("^(?:ISBN(?:-1[03])?:? )?(?=[-0-9 ]{17}$|↵
[-0-9X ]{13}$|[0-9X]{10}$)(?:97[89][- ]?)?[0-9]{1,5}[- ]?↵
(?:[0-9]+[- ]?){2}[0-9X]$")

subject = sys.argv[1]

if regex.search(subject):
  # Remove non ISBN digits, then split into an array
  chars = re.sub("[^0-9X]", "", subject).split("")
  # Remove the final ISBN digit from `chars`, and assign it to `last`
  last  = chars.pop()

  if len(chars) == 9:
    # Compute the ISBN-10 check digit
    val = sum((x + 2) * int(y) for x,y in enumerate(reversed(chars)))
    check = 11 - (val % 11)
    if check == 10:
      check = "X"
    elif check == 11:
      check = "0"
  else:
    # Compute the ISBN-13 check digit
    val = sum((x % 2 * 2 + 1) * int(y) for x,y in enumerate(chars))
    check = 10 - (val % 10)
    if check == 10:
      check = "0"

  if (str(check) == last):
    print "Valid ISBN"
  else:
    print "Invalid ISBN check digit"
else:
  print "Invalid ISBN"

Other programming languages

See Recipe 3.5 for help with implementing these regular expressions in other programming languages.

Discussion

An ISBN is a unique identifier for commercial books and book-like products. The 10-digit ISBN format was published as an international standard, ISO 2108, in 1970. All ISBNs assigned since January 1, 2007 are 13 digits.

ISBN-10 and ISBN-13 numbers are divided into four or five elements, respectively. Three of the elements are of variable length; the remaining one or two elements are of fixed length. All five parts are usually separated with hyphens or spaces. A brief description of each element follows:

  • 13-digit ISBNs start with the prefix 978 or 979.

  • The group identifier identifies the language-sharing country group. It ranges from one to five digits long.

  • The publisher identifier varies in length and is assigned by the national ISBN agency.

  • The title identifier also varies in length and is selected by the publisher.

  • The final character is called the check digit, and is computed using a checksum algorithm. An ISBN-10 check digit can be either a number from 0 to 9 or the letter X (Roman numeral for 10), while an ISBN-13 check digit ranges from 0 to 9. The allowed characters are different because the two ISBN types use different checksum algorithms.

The parts of the “ISBN-10 or ISBN-13” regex are shown in the following breakdown. Because this regex is written in free-spacing mode, the literal space characters in the regex have been escaped with backslashes. Java requires that even spaces within character classes be escaped in free-spacing mode:

^                 # Assert position at the beginning of the string.
(?:               # Group but don't capture...
  ISBN            #   Match the text "ISBN".
  (?:-1[03])?     #   Optionally match the text "-10" or "-13".
  :?              #   Optionally match a literal ":".
  \               #   Match a space character (escaped).
)?                # Repeat the group between zero and one time.
(?=               # Assert that the following can be matched here...
  [-0-9\ ]{17}$   #   Match 17 hyphens, digits, and spaces, then the end
 |                #     of the string. Or...
  [-0-9X\ ]{13}$  #   Match 13 hyphens, digits, Xs, and spaces, then the
 |                #     end of the string. Or...
  [0-9X]{10}$     #   Match 10 digits and Xs, then the end of the string.
)                 # End the positive lookahead.
(?:               # Group but don't capture...
  97[89]          #   Match the text "978" or "979".
  [-\ ]?          #   Optionally match a hyphen or space.
)?                # Repeat the group between zero and one time.
[0-9]{1,5}        # Match a digit between one and five times.
[-\ ]?            # Optionally match a hyphen or space.
(?:               # Group but don't capture...
  [0-9]+          #   Match a digit between one and unlimited times.
  [-\ ]?          #   Optionally match a hyphen or space.
){2}              # Repeat the group exactly two times.
[0-9X]            # Match a digit or "X".
$                 # Assert position at the end of the string.
Regex options: Free-spacing
Regex flavors: .NET, Java, PCRE, Perl, Python, Ruby

The leading (?:ISBN(?:-1[03])?:?)? has three optional elements, allowing it to match any one of the following seven strings (all except the empty-string option include a space character at the end):

  • ISBN

  • ISBN-10

  • ISBN-13

  • ISBN:

  • ISBN-10:

  • ISBN-13:

  • The empty string (no prefix)

Next, the positive lookahead (?=[-0-9]{17}$|[-0-9X]{13}$|[0-9X]{10}$) enforces one of three options (separated by the | alternation operator) for the length and character set of the rest of the match. All three options (shown next) end with the $ anchor, which ensures that there cannot be any trailing text that doesn’t fit into one of the patterns:

[-0-9]{17}$

Allows an ISBN-13 with four separators (17 total characters)

[-0-9X]{13}$

Allows an ISBN-13 with no separators or an ISBN-10 with three separators (13 total characters)

[0-9X]{10}$

Allows an ISBN-10 with no separators (10 total characters)

After the positive lookahead validates the length and character set, we can match the individual elements of the ISBN without worrying about their combined length. (?:97[89][-]?)? matches the “978” or “979” prefix required by an ISBN-13. The noncapturing group is optional because it will not match within an ISBN-10 subject string. [0-9]{1,5}[-]? matches the one to five digit group identifier and an optional, following separator. (?:[0-9]+[-]?){2} matches the variable-length publisher and title identifiers, along with their optional separators. Finally, [0-9X]$ matches the check digit at the end of the string.

Although a regular expression can check that the final digit uses a valid character (a digit or X), it cannot determine whether it’s correct for the ISBN’s checksum. One of two checksum algorithms (determined by whether you’re working with an ISBN-10 or ISBN-13 number) are used to provide some level of assurance that the ISBN digits haven’t been accidentally transposed or otherwise entered incorrectly. The JavaScript and Python example code shown earlier implemented both algorithms. The following sections describe the checksum rules in order to help you implement these algorithms with other programming languages.

ISBN-10 checksum

The check digit for an ISBN-10 number ranges from 0 to 10 (with the Roman numeral X used instead of 10). It is computed as follows:

  1. Multiply each of the first 9 digits by a number in the descending sequence from 10 to 2, and sum the results.

  2. Divide the sum by 11.

  3. Subtract the remainder (not the quotient) from 11.

  4. If the result is 11, use the number 0; if 10, use the letter X.

Here’s an example of how to derive the ISBN-10 check digit for 0-596-52068-?:

Step 1:
sum = 10×0 + 9×5 + 8×9 + 7×6 + 6×5 + 5×2 + 4×0 + 3×6 + 2×8
    =    0 +  45 +  72 +  42 +  30 +  10 +   0 +  18 +  16
    = 233
Step 2:
    233 ÷ 11 = 21, remainder 2
Step 3:
    11 − 2 = 9
Step 4:
    9 [no substitution required]

The check digit is 9, so the complete sequence is ISBN 0-596-52068-9.

ISBN-13 checksum

An ISBN-13 check digit ranges from 0 to 9, and is computed using similar steps.

  1. Multiply each of the first 12 digits by 1 or 3, alternating as you move from left to right, and sum the results.

  2. Divide the sum by 10.

  3. Subtract the remainder (not the quotient) from 10.

  4. If the result is 10, use the number 0.

For example, the ISBN-13 check digit for 978-0-596-52068-? is calculated as follows:

Step 1:
sum = 1×9 + 3×7 + 1×8 + 3×0 + 1×5 + 3×9 + 1×6 + 3×5 + 1×2 + 3×0 + 1×6 + 3×8
    =   9 +  21 +   8 +   0 +   5 +  27 +   6 +  15 +   2 +   0 +   6 +  24
    = 123
Step 2:
    123 ÷ 10 = 12, remainder 3
Step 3:
    10 − 3 = 7
Step 4:
    7 [No substitution required]

The check digit is 7, and the complete sequence is ISBN 978-0-596-52068-7.

Variations

Find ISBNs in documents

This version of the “ISBN-10 or ISBN-13” regex uses word boundaries instead of anchors to help you find ISBNs within longer text while ensuring that they stand on their own. The “ISBN” identifier has also been made a required string in this version, for two reasons. First, requiring it helps eliminate false positives (without it, the regex could potentially match any 10 or 13 digit number), and second, ISBNs are officially required to use this identifier when printed:

\bISBN(?:-1[03])?:?(?=[-0-9]{17}$|[-0-9X]{13}$|[0-9X]{10}$)↵
(?:97[89][-]?)?[0-9]{1,5}[-]?(?:[0-9]+[-]?){2}[0-9X]\b
Regex options: None
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

Eliminate incorrect ISBN identifiers

A limitation of the previous regexes is that they allow matching an ISBN-10 number preceded by the “ISBN-13” identifier, and vice versa. The following regex uses regex conditionals (see Recipe 2.17) to ensure that an “ISBN-10” or “ISBN-13” identifier is followed by the appropriate ISBN type. It allows both ISBN-10 and ISBN-13 numbers when the type is not explicitly specified. This regex is overkill in most circumstances because the same result could be achieved more manageably using the ISBN-10 and ISBN-13 specific regexes that were shown earlier, one at a time. It’s included here merely to demonstrate an interesting use of regular expressions:

^
(?:ISBN(-1(?:(0)|3))?:?\ )?
(?(1)
  (?(2)
    (?=[-0-9X ]{13}$|[0-9X]{10}$)
    [0-9]{1,5}[- ]?(?:[0-9]+[- ]?){2}[0-9X]$
   |
    (?=[-0-9 ]{17}$|[0-9]{13}$)
    97[89][- ]?[0-9]{1,5}[- ]?(?:[0-9]+[- ]?){2}[0-9]$
  )
 |
  (?=[-0-9 ]{17}$|[-0-9X ]{13}$|[0-9X]{10}$)
  (?:97[89][- ]?)?[0-9]{1,5}[- ]?(?:[0-9]+[- ]?){2}[0-9X]$
)
$
Regex options: Free-spacing
Regex flavors: .NET, PCRE, Perl, Python

See Also

The most up-to-date version of the ISBN Users’ Manual can be found on the International ISBN Agency’s website at http://www.isbn-international.org.

The official Numerical List of Group Identifiers at http://www.isbn-international.org/en/identifiers/allidentifiers.html can help you identify a book’s originating country or area based on the first 1 to 5 digits of its ISBN.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required