1.10. Parsing Fixed-Width Delimited Data

Problem

You need to break apart fixed-width records in strings.

Solution

Use substr( ) :

$fp = fopen('fixed-width-records.txt','r') or die ("can't open file");
while ($s = fgets($fp,1024)) {
    $fields[1] = substr($s,0,10);  // first field:  first 10 characters of the line
    $fields[2] = substr($s,10,5);  // second field: next 5 characters of the line
    $fields[3] = substr($s,15,12); // third field:  next 12 characters of the line
    // a function to do something with the fields
    process_fields($fields);
}
fclose($fp) or die("can't close file");

Or unpack( ) :

$fp = fopen('fixed-width-records.txt','r') or die ("can't open file");
while ($s = fgets($fp,1024)) {
    // an associative array with keys "title", "author", and "publication_year"
    $fields = unpack('A25title/A14author/A4publication_year',$s);
    // a function to do something with the fields
    process_fields($fields);
}
fclose($fp) or die("can't close file");

Discussion

Data in which each field is allotted a fixed number of characters per line may look like this list of books, titles, and publication dates:

$booklist=<<<END
Elmer Gantry             Sinclair Lewis1927
The Scarlatti InheritanceRobert Ludlum 1971
The Parsifal Mosaic      Robert Ludlum 1982
Sophie's Choice          William Styron1979
END;

In each line, the title occupies the first 25 characters, the author’s name the next 14 characters, and the publication year the next 4 characters. Knowing those field widths, it’s straightforward to use substr( ) to parse the fields into an array:

$books = explode("\n",$booklist);

for($i = 0, $j = count($books); $i < $j; $i++) {
  $book_array[$i]['title'] = substr($books[$i],0,25);
  $book_array[$i]['author'] = substr($books[$i],25,14);
  $book_array[$i]['publication_year'] = substr($books[$i],39,4);
}

Exploding $booklist into an array of lines makes the looping code the same whether it’s operating over a string or a series of lines read in from a file.

The loop can be made more flexible by specifying the field names and widths in a separate array that can be passed to a parsing function, as shown in the pc_fixed_width_substr( ) function in Example 1-3.

Example 1-3. pc_fixed_width_substr( )

function pc_fixed_width_substr($fields,$data) {
  $r = array();
  for ($i = 0, $j = count($data); $i < $j; $i++) {
    $line_pos = 0;
    foreach($fields as $field_name => $field_length) {
      $r[$i][$field_name] = rtrim(substr($data[$i],$line_pos,$field_length));
      $line_pos += $field_length;
    }
  }
  return $r;
}

$book_fields = array('title' => 25,
                     'author' => 14,
                     'publication_year' => 4);

$book_array = pc_fixed_width_substr($book_fields,$books);

The variable $line_pos keeps track of the start of each field, and is advanced by the previous field’s width as the code moves through each line. Use rtrim( ) to remove trailing whitespace from each field.

You can use unpack( ) as a substitute for substr( ) to extract fields. Instead of specifying the field names and widths as an associative array, create a format string for unpack( ). A fixed-width field extractor using unpack( ) looks like the pc_fixed_width_unpack( ) function shown in Example 1-4.

Example 1-4. pc_fixed_width_unpack( )

function pc_fixed_width_unpack($format_string,$data) {
  $r = array();
  for ($i = 0, $j = count($data); $i < $j; $i++) {
    $r[$i] = unpack($format_string,$data[$i]);
  }
  return $r;
}

$book_array = pc_fixed_width_unpack('A25title/A14author/A4publication_year',
                                    $books);

Because the A format to unpack( ) means “space padded string,” there’s no need to rtrim( ) off the trailing spaces.

Once the fields have been parsed into $book_array by either function, the data can be printed as an HTML table, for example:

$book_array = pc_fixed_width_unpack('A25title/A14author/A4publication_year',
                                    $books);
print "<table>\n";
// print a header row
print '<tr><td>';
print join('</td><td>',array_keys($book_array[0]));
print "</td></tr>\n";
// print each data row
foreach ($book_array as $row) {
    print '<tr><td>';
    print join('</td><td>',array_values($row));
    print "</td></tr>\n";
}
print '</table>\n';

Joining data on </td><td> produces a table row that is missing its first <td> and last </td>. We produce a complete table row by printing out <tr><td> before the joined data and </td></tr> after the joined data.

Both substr( ) and unpack( ) have equivalent capabilities when the fixed-width fields are strings, but unpack( ) is the better solution when the elements of the fields aren’t just strings.

See Also

For more information about unpack( ), see Recipe 1.14 and http://www.php.net/unpack; Recipe 4.9 discusses join( ).

Get PHP Cookbook now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.