Chapter 4. Text Processing and File Management

Ruby fills a lot of the same roles that languages such as Perl and Python do. Because of this, you can expect to find first-rate support for text processing and file management. Whether it’s parsing a text file with some regular expressions or building some *nix-style filter applications, Ruby can help make life easier.

However, much of Ruby’s I/O facilities are tersely documented at best. It is also relatively hard to find good resources that show you general strategies for attacking common text-processing tasks. This chapter aims to expose you to some good tricks that you can use to simplify your text-processing needs as well as sharpen your skills when it comes to interacting with and managing files on your system.

As in other chapters, we’ll start off by looking at some real open source code—this time, a simple parser for an Adobe Font Metrics (AFM) file. This example will expose you to text processing in its setting. We’ll then follow up with a number of detailed sections that look at different practices that will help you master basic I/O skills. Armed with these techniques, you’ll be able to take on all sorts of text-processing and file-management tasks with ease.

Line-Based File Processing with State Tracking

Processing a text document line by line does not mean that we’re limited to extracting content in a uniform way, treating each line identically. Some files have more structure than that, but can still benefit from being processed linearly. We’re now going to look over a small parser that illustrates this general idea by selecting different ways to extract our data based on what section of a file we are in.

The code in this section was written by James Edward Gray II as part of Prawn’s AFM support. Though the example itself is domain-specific, we won’t get hung up in the particular details of this parser. Instead, we’ll be taking a look at the general approach for how to build a state-aware parser that operates on an efficient line-by-line basis. Along the way, you’ll pick up some basic I/O tips and tricks as well as see the important role that regular expressions often play in this sort of task.

Before we look at the actual parser, we can take a glance at the sort of data we’re dealing with. AFM files are essentially font glyph measurements and specifications, so they tend to look a bit like a configuration file of sorts. Some of these things are simply straight key/value pairs, such as:

CapHeight 718
XHeight 523
Ascender 718
Descender -207

Others are organized sets of values within a section, as in the following example:

StartCharMetrics 315
C 32 ; WX 278 ; N space ; B 0 0 0 0 ;
C 33 ; WX 278 ; N exclam ; B 90 0 187 718 ;
C 34 ; WX 355 ; N quotedbl ; B 70 463 285 718 ;
C 35 ; WX 556 ; N numbersign ; B 28 0 529 688 ;
C 36 ; WX 556 ; N dollar ; B 32 -115 520 775 ;
....
EndCharMetrics

Sections can be nested within each other, making things more interesting. The data across the file does not fit a uniform format, as each section represents a different sort of thing. However, we can come up with patterns to parse data in each section that we’re interested in, because they are consistent within their sections. We also are interested in only a subset of the sections, so we can safely ignore some of them. This is the essence of the task we needed to accomplish, but as you may have noticed, it’s a fairly abstract pattern that we can reuse. Many documents with a simple section-based structure can be worked with using the approach shown here.

The code that follows is essentially a simple finite state machine that keeps track of what section the current line appears in. It attempts to parse the opening or closing of a section first, and then it uses this information to determine a parsing strategy for the current line. We simply skip the sections that we’re not interested in parsing.

We end up with a very straightforward solution. The whole parser is reduced to a simple iteration over each line of the file, which manages a stack of nested sections, while determining whether and how to parse the current line.

We’ll see the parts in more detail in just a moment, but here is the whole AFM parser that extracts all the information we need to properly render Adobe fonts in Prawn:

def parse_afm(file_name)
  section = []

  File.foreach(file_name) do |line|
    case line
    when /^Start(\w+)/
      section.push $1
      next
    when /^End(\w+)/
      section.pop
      next
    end

    case section
    when ["FontMetrics", "CharMetrics"]
      next unless line =~ /^CH?\s/

      name                  = line[/\bN\s+(\.?\w+)\s*;/, 1]
      @glyph_widths[name]   = line[/\bWX\s+(\d+)\s*;/, 1].to_i
      @bounding_boxes[name] = line[/\bB\s+([^;]+);/, 1].to_s.rstrip
    when ["FontMetrics", "KernData", "KernPairs"]
      next unless line =~ /^KPX\s+(\.?\w+)\s+(\.?\w+)\s+(-?\d+)/
      @kern_pairs[[$1, $2]] = $3.to_i
    when ["FontMetrics", "KernData", "TrackKern"], ["FontMetrics", "Composites"]
      next
    else
      parse_generic_afm_attribute(line)
    end
  end
end

You could try to understand the particular details if you’d like, but it’s also fine to black-box the expressions used here so that you can get a sense of the overall structure of the parser. Here’s what the code looks like if we do that for all but the patterns that determine the section nesting:

def parse_afm(file_name)
  section = []

  File.foreach(file_name) do |line|
    case line
    when /^Start(\w+)/
      section.push $1
      next
    when /^End(\w+)/
      section.pop
      next
    end

    case section
    when ["FontMetrics", "CharMetrics"]
      parse_char_metrics(line)
    when ["FontMetrics", "KernData", "KernPairs"]
      parse_kern_pairs(line)
    when ["FontMetrics", "KernData", "TrackKern"], ["FontMetrics", "Composites"]
      next
    else
      parse_generic_afm_attribute(line)
    end
  end
end

With these simplifications, it’s very clear that we’re looking at an ordinary finite state machine that is acting upon the lines of the file. It also makes it easier to notice what’s actually going on.

The first case statement is just a simple way to check which section we’re currently looking at, updating the stack as necessary as we move in and out of sections:

case line
when /^Start(\w+)/
  section.push $1
  next
when /^End(\w+)/
  section.pop
  next
end

If we find a section beginning or end, we skip to the next line, as we know there is nothing else to parse. Otherwise, we know that we have to do some real work, which is done in the second case statement:

case section
when ["FontMetrics", "CharMetrics"]
  next unless line =~ /^CH?\s/

  name                  = line[/\bN\s+(\.?\w+)\s*;/, 1]
  @glyph_widths[name]   = line[/\bWX\s+(\d+)\s*;/, 1].to_i
  @bounding_boxes[name] = line[/\bB\s+([^;]+);/, 1].to_s.rstrip
when ["FontMetrics", "KernData", "KernPairs"]
  next unless line =~ /^KPX\s+(\.?\w+)\s+(\.?\w+)\s+(-?\d+)/
  @kern_pairs[[$1, $2]] = $3.to_i
when ["FontMetrics", "KernData", "TrackKern"], ["FontMetrics", "Composites"]
  next
else
  parse_generic_afm_attribute(line)
end

Here, we’ve got four different ways to handle our line of text. In the first two cases, we process the lines that we need to as we walk through the section, extracting the bits of information we need and ignoring the information we’re not interested in.

In the third case, we identify certain sections to skip and simply resume processing the next line if we are currently within that section.

Finally, if the other cases fail to match, our last case scenario assumes that we’re dealing with a simple key/value pair, which is handled by a private helper method in Prawn. Because it does not provide anything different to look at than the first two sections of this case statement, we can safely ignore how it works without missing anything important.

However, the interesting thing that you might have noticed is that the first case and the second case use two different ways of extracting values. The code that processes CharMetrics uses String#[], whereas the code handling KernPairs uses Perl-style global match variables. The reason for this is largely convenience. The following two lines of code are equivalent:

name = line[/\bN\s+(\.?\w+)\s*;/, 1]
name = line =~ /\bN\s+(\.?\w+)\s*;/ && $1

There are still other ways to handle your captured matches (such as MatchData via String#match), but we’ll get into those later. For now, it’s simply worth knowing that when you’re trying to extract a single matched capture, String#[] does the job well, but if you need to deal with more than one, you need to use another approach. We see this clearly in the second case:

next unless line =~ /^KPX\s+(\.?\w+)\s+(\.?\w+)\s+(-?\d+)/
@kern_pairs[[$1, $2]] = $3.to_i

This code is a bit clever, as the line that assigns the values to @kern_pairs gets executed only when there is a successful match. When the match fails, it will return nil, causing the parser to skip to the next line for processing.

We could continue studying this example, but we’d then be delving into the specifics, and those details aren’t important for remembering this simple general pattern.

When dealing with a structured document that can be processed by discrete rules for each section, the general approach is simple and does not typically require pulling the entire document into memory or doing multiple passes through the data.

Instead, you can do the following:

  • Identify the beginning and end markers of sections with a pattern.

  • If sections are nested, maintain a stack that you update before further processing of each line.

  • Break up your extraction code into different cases and select the right one based on the current section you are in.

  • When a line cannot be processed, skip to the next one as soon as possible, using the next keyword.

  • Maintain state as you normally would, processing whatever data you need.

By following these basic guidelines, you can avoid overthinking your problem, while still saving clock cycles and keeping your memory footprint low. Although the code here solves a particular problem, it can easily be adapted to fit a wide range of basic document processing needs.

This introduction has hopefully provided a taste of what text processing in Ruby is all about. The rest of the chapter will provide many more tips and tricks, with a greater focus on the particular topics. Feel free to jump around to the things that interest you most, but I’m hoping all of the sections have something interesting to offer—even to seasoned Rubyists.

Get Ruby Best Practices now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.