Chapter 4. Arrays

Works of art, in my opinion, are the only objects in the material universe to possess internal order, and that is why, though I don’t believe that only art matters, I do believe in Art for Art’s sake.

E.M. Forster

Introduction

If you are asked about the contents of your pockets, or the names of the first three Greek letters, or how to get to the highway, you recite a list: you name one thing after another in a particular order. Lists are part of your conception of the world. With Perl’s powerful list- and array-handling primitives, you can translate this world view directly into code.

In this chapter, we’ll use the terms list and array as the Perl language thinks of them. Take ("alpha", "beta", "gamma"); that’s a list of the names of the first three Greek letters, in order. To store that list into a variable, use an array, as in @greeks = ("alpha", "beta", "gamma"). Both are ordered groups of scalar values; the difference is that an array is a named variable, one whose array length can be directly changed, whereas a list is a more ephemeral notion. You might think of an array as a variable and a list as the values it contains.

This distinction may seem arbitrary, but operations that modify the length of these groupings (like push and pop) require a proper array and not merely a list. Think of the difference between $a and 4. You can say $a++ but not 4++. Likewise, you can say pop(@a) but not pop (1,2,3).

The most important thing to glean from this is that Perl’s lists and arrays are both ordered groupings of scalars. Operators and functions that work on lists or arrays are designed to provide faster or more convenient access to the elements than manual access would provide. Since few actually deal with modifying the array’s length, you can usually use arrays and lists interchangeably.

You can’t use nested parentheses to create a list of lists. If you try that in Perl, your lists get flattened, meaning that both these lines are equivalent:

@nested = ("this", "that", "the", "other");
@nested = ("this", "that", ("the", "other"));

Why doesn’t Perl (usefully) just support nested lists directly? Although partially for historical reasons, this easily allows for operations (like print or sort) that work on arbitrarily long lists of arbitrary contents.

What happens if you want a more complex data structure, such as an array of arrays or an array of hashes? Remember that scalars aren’t restricted to containing just numbers or strings; they can also hold references. Complex (multilevel) data structures in Perl are always put together using references. Therefore, what appear to be “two-dimensional arrays” or “arrays of arrays” are always implemented as arrays of array references, in the same way that two-dimensional arrays in C can be arrays of pointers to arrays.

Most recipes in this chapter don’t care what you keep in your arrays; for example, the problem of merging two arrays is the same whether the arrays contains strings, numbers, or references. Some problems are intrinsically tied to the contents of your arrays; recipes for those are in Chapter 11. This chapter’s recipes deal with generic arrays.

Let’s have some more terminology. The scalar items in an array or list are called elements, which you access by specifying their position, or index. Indices in Perl start at 0. So, given this list:

@greeks = ( "alpha", "beta", "gamma" );

"alpha" is in the first position, but you’d access it as $greeks[0]. "beta" is in the second position, but you’d access it as $greeks[1]. This structure is doubly justified: the contrariness of computers, whose first representable number is 0, and the contrariness of language designers, who chose 0 because it is an offset into the array, not the ordinal number of the element.

4.1. Specifying a List in Your Program

Problem

You want to include a list in your program. This is how you initialize arrays.

Solution

You can write out a comma-separated list of elements:

@a = ("quick", "brown", "fox");

If you have a lot of single-word elements, use the qw( ) operator:

@a = qw(Meddle not in the affairs of wizards.);

If you have a lot of multiword elements, use a here document and extract lines:

@lines = (<< "END_OF_HERE_DOC" =~ /^\s*(.+)/gm);
    I sit beside the fire and think
    of all that I have seen,
    of meadow-flowers and butterflies 
    and summers that have been;
END_OF_HERE_DOC

Discussion

The first technique is the one most commonly used, often because only small arrays are normally initialized as program literals. Initializing a large array would fill your program with values and make it hard to read, so such arrays either tend to be initialized in a separate library file (see Chapter 12), or else have their values read in from a file:

@bigarray = ( );
open(FH, "<", "myinfo")   or die "Couldn't open myinfo: $!";
while (<FH>) {
    chomp;
    push(@bigarray, $_);
}
close(FH);

The second technique uses qw( ), one of several pseudo-functions in Perl used for quoting without having to resort to actual quotation marks. This one splits its string argument on whitespace to produce a list of words, where “words” in this instance means strings that don’t contain any whitespace. The initial argument is not subject to interpolation of variables or (most) backslash escape sequences.

@banner = ('Costs', 'only', '$4.95');
@banner = qw(Costs only $4.95);
@banner = split(' ', 'Costs only $4.95');

You can use qw( ) only when each whitespace-separated argument is to be a distinct element in the return list. Be careful not to give Columbus four ships instead of three:

@ships  = qw(Niña Pinta Santa María);               # WRONG
@ships  = ('Niña', 'Pinta', 'Santa María');         # right

The third solution takes a here document, which is a single, multiline string, and applies a global pattern match to that string. The pattern /^\s*(.+)/ says to skip any whitespace at the start of the line, then capture everything through the end of each line. The /g modifier means to apply that match globally, and the /m modifier says to permit ^ to match not just at the beginning of the string, but also immediately after a newline, which, in a multiline string, is just what you need. Applying that technique to the ships example yields:

@ships = ( << "END_OF_FLOTILLA" =~ /^\s*(.+)/gm);
              Niña
              Pinta 
              Santa María
END_OF_FLOTILLA

See Also

The “List Value Constructors” section of perldata(1); the “List Values and Arrays” section of Chapter 2 of Programming Perl; the “Quote and Quote-Like Operators” section of perlop(1); the s/// operator in perlop(1) and Chapter 5 of Programming Perl

4.2. Printing a List with Commas

Problem

You’d like to print out a list containing an unknown number of elements, placing an “and” before the last element and commas between each element if there are more than two.

Solution

Use this function, which returns the formatted string:

sub commify_series {
    (@_ =  = 0) ? ''                                      :
    (@_ =  = 1) ? $_[0]                                   :
    (@_ =  = 2) ? join(" and ", @_)                       :
                join(", ", @_[0 .. ($#_-1)], "and $_[-1]");
}

Discussion

It often looks odd to print out arrays:

@array = ("red", "yellow", "green");
print "I have ", @array, " marbles.\n";
print "I have @array marbles.\n";
I have redyellowgreen marbles
               I have red yellow green marbles

What you really want it to say is, "I have red, yellow, and green marbles“. The function given in the solution generates strings in that format. The word "and" is placed between the last two list elements. If there are more than two elements in the list, a comma is placed between every element.

Example 4-1 gives a complete demonstration of the function, with one addition: if any element in the list already contains a comma, a semicolon is used for the separator character instead.

Example 4-1. commify_series
  #!/usr/bin/perl -w
  # commify_series - show proper comma insertion in list output
  
  # @lists is an array of (references to anonymous) arrays
  @lists = (
      [ 'just one thing' ],
      [ qw(Mutt Jeff) ],
      [ qw(Peter Paul Mary) ],
      [ 'To our parents', 'Mother Theresa', 'God' ],
      [ 'pastrami', 'ham and cheese', 'peanut butter and jelly', 'tuna' ],
      [ 'recycle tired, old phrases', 'ponder big, happy thoughts' ],
      [ 'recycle tired, old phrases', 
        'ponder big, happy thoughts', 
        'sleep and dream peacefully' ],
      );
  
  foreach $aref (@lists) {
      print "The list is: " . commify_series(@$aref) . ".\n";
  } 
  # demo for single list
  @list = qw(one two three);
  print "The last list is: " . commify_series(@list) . ".\n";
  
  sub commify_series {
      my $sepchar = grep(/,/ => @_) ? ";" : ",";
      (@_ =  = 0) ? ''                                      :
      (@_ =  = 1) ? $_[0]                                   :
      (@_ =  = 2) ? join(" and ", @_)                       :
                  join("$sepchar ", @_[0 .. ($#_-1)], "and $_[-1]");
  }

Here’s the output from the program:

               The list is: just one thing
               The list is: Mutt and Jeff
               The list is: Peter, Paul, and Mary
               The list is: To our parents, Mother Theresa, and God
               The list is: pastrami, ham and cheese, peanut butter and jelly, and tuna
               The list is: recycle tired, old phrases and ponder big, happy thoughts
               The list is: recycle tired, old phrases; ponder
               big, happy thoughts; and sleep and dream peacefully
               The last list is: one, two, and three

As you see, we don’t follow the ill-advised practice of omitting the final comma from a series under any circumstances. To do so introduces unfortunate ambiguities and unjustifiable exceptions. The examples shown would have claimed that we were the offspring of Mother Teresa and God, and would have had us eating sandwiches made of jelly and tuna fish mixed together atop the peanut butter.

See Also

Fowler’s Modern English Usage; we explain the nested list syntax in Recipe 11.1; the grep function in perlfunc(1) and Chapter 29 of Programming Perl; the conditional operator ("?:") is discussed in perlop(1) and in the “Conditional Operator” section of Chapter 3 of Programming Perl

4.3. Changing Array Size

Problem

You want to enlarge or truncate an array. For example, you might truncate an array of employees that’s already sorted by salary to list the five highest-paid employees. Or, if you know how big your array will get and that it will grow piecemeal, it’s more efficient to grab memory for it in one step by enlarging just once than to keep pushing values onto the end.

Solution

Assign to $#ARRAY :

# grow or shrink @ARRAY
$#ARRAY = $NEW_LAST_ELEMENT_INDEX_NUMBER;

Assigning to an element past the end automatically extends the array:

$ARRAY[$NEW_LAST_ELEMENT_INDEX_NUMBER] = $VALUE;

Discussion

$#ARRAY is the number of the last valid index in @ARRAY. If we assign it a number smaller than its current value, we truncate the array. Truncated elements are lost forever. If we assign $#ARRAY a number larger than its current value, the array grows. New elements have the undefined value.

$#ARRAY is not @ARRAY, though. Although $#ARRAY is the last valid index in the array, @ARRAY (in scalar context, as when treated as a number) is the number of elements. $#ARRAY is one less than @ARRAY because array indices start at 0.

Here’s some code that uses both. We have to say scalar @array in the print because Perl gives list context to (most) functions’ arguments, but we want @array in scalar context.

sub what_about_that_array {
    print "The array now has ", scalar(@people), " elements.\n";
    print "The index of the last element is $#people.\n";
    print "Element #3 is `$people[3]'.\n";
}

@people = qw(Crosby Stills Nash Young);
what_about_that_array( );

prints:

               The array now has 4 elements
               The index of the last element is 3
               Element #3 is `Young'

whereas:

$#people--;
what_about_that_array( );

prints:

               The array now has 3 elements
               The index of the last element is 2
               Element #3 is `'

Element #3 disappeared when we shortened the array. If we’d turned on warnings (either with the -w command-line option to Perl or with use warnings inside the program), Perl would also have warned “use of uninitialized value” because $people[3] is undefined.

$#people = 10_000;
what_about_that_array( );

prints:

               The array now has 10001 elements
               The index of the last element is 10000
               Element #3 is `'

The "Young" element is now gone forever. Instead of assigning to $#people, we could have said:

$people[10_000] = undef;

although this isn’t exactly the same. If you have a three-element array, as in:

@colors = qw(red blue green);

and you undef its last element:

undef $colors[2];        # green is gone

you still have a three-element array; its last element is just undefined. If you pop the array, either via the function or manually by changing $#colors:

$last_color = $colors[ $#colors-- ];

then the array grows shorter by one element.

Perl arrays are not sparse. In other words, if you have a 10,000th element, you must have the 9,999 other elements, too. They may be undefined, but they still take up memory. For this reason, $array[time( )], or any other construct that uses a very large integer as an array index, is a really bad idea. Use a hash instead.

See Also

The discussion of the $#ARRAY notation in perldata(1), also explained in the “List Values and Arrays” section of Chapter 2 of Programming Perl

4.4. Implementing a Sparse Array

Problem

An array with large, unoccupied expanses between occupied elements wastes memory. How do you reduce that overhead?

Solution

Use a hash instead of an array.

Discussion

If you assign to the millionth element of an array, Perl allocates a million and one slots to store scalars. Only the last element contains interesting data, leaving earlier ones each set to undef at a cost of four (or more) bytes per unoccupied slot.

In recent versions of Perl, if you grow an array by assigning either past the end or directly to $#ARRAY, you can distinguish these implicit undefs from those that would result from assigning undef there by using exists instead of defined, just as you would with a hash.

$#foo = 5;
@bar = ( (undef) x 5 ) ;

printf "foo element 3 is%s defined\n",
        defined $foo[3] ? "" : "n't";
printf "foo element 3 does%s exist\n",
        exists $foo[3] ? "" : "n't";
printf "bar element 3 is%s defined\n",
        defined $bar[3] ? "" : "n't";
printf "bar element 3 does%s exist\n",
        exists $bar[3] ? "" : "n't";

foo element 3 isn't defined
               foo element 3 doesn't exist
               bar element 3 isn't defined
               bar element 3 does exist

However, you still waste a lot of space. That’s because Perl’s array implementation reserves a contiguous vector, one for each element up to the highest occupied position.

$real_array[ 1_000_000 ] = 1;       # costs 4+ megabytes

A hash works differently: you pay only for what you really use, not for unoccupied positions. Although a hash element costs somewhat more than an array element because you need to store both the value and its key, with sparse arrays, the savings can be astonishing.

$fake_array{ 1_000_000 } = 1;       # costs 28 bytes

What’s the trade-off? Because a hash’s keys aren’t ordered, a little more work is needed to sort the numeric keys so you can handle their values in the same order as you would if they were stored as a real array. With an array, you’d just do this to process elements in index order:

foreach $element ( @real_array ) {
    # do something with $element
}

or this to process indices in ascending order:

foreach $idx ( 0 .. $#real_array ) {
    # do something with $real_array[$idx]
}

Using a hash representation, you should instead do either this to process elements in index order:

foreach $element ( @fake_array{ sort {$a <=> $b} keys %fake_array } ) {
    # do something with $element
}

or this to process indices in ascending order:

foreach $idx ( sort {$a <=> $b} keys %fake_array ) {
    # do something with $fake_array{$idx}
}

If you don’t care about handling elements in a particular order, however, you don’t need to go through all that. Just process the values according to their internal order, either like this:

foreach $element ( values %fake_array ) {
    # do something with $element
}

or like this:

# process indices in internal hash order
foreach $idx ( keys %fake_array ) {
    # do something with $fake_array{$idx}
}

If you’re determined to use an array, two fairly specialized cases occasionally arise in which you can save substantial amounts of memory by using an alternate storage scheme. Both cases also apply to arrays that are densely populated, not just those that are mostly empty.

The first case shows up when you grow an array by repeatedly appending new elements until its subscripts become large. Because of how Perl reallocates memory for growing arrays, this can use up to four times the memory you really need. If you happen to know how big the array will (or might) eventually become, you can avoid this reallocation overhead either by storing the large subscripts first instead of the small ones:

for ($i = 10_000; $i >= 0; $i--) { $real_array[$i] = 1 }

or by presizing the array by assigning to the special $#ARRAY notation:

$#real_array = 10_000;

The second special case comes up when each array element holds nothing but a single one-bit value—essentially either a true or a false. For example, suppose you are keeping track of numbered USENET news articles, and you only need to know whether a given article number has been read. For situations like this, use a bit vector instead of a real array:

my $have_read = '';
for ($i = 10_000; $i >= 0; $i--) { vec($have_read, $i, 1) = 1 }

Then you can check to see whether a given article has been read this way:

if (vec($have_read, $artno, 1)) { .... }

See Also

The vec function in perlfunc(1) and in Chapter 29 of Programming Perl

4.5. Iterating Over an Array

Problem

You want to repeat a procedure for every element in a list.

Often you use an array to collect information you’re interested in; for instance, login names of users who have exceeded their disk quota. When you finish collecting the information, you want to process it by doing something with every element in the array. In the disk quota example, you might send each user a stern mail message.

Solution

Use a foreach loop:

foreach $item (LIST) {
    # do something with $item
}

Discussion

Let’s say we’ve used @bad_users to compile a list of users who are over their allotted disk quotas. To call some complain subroutine for each user, we’d use:

foreach $user (@bad_users) {
        complain($user);
}

Rarely is this recipe so simply applied. Instead, we often use functions to generate the list:

foreach $var (sort keys %ENV) {
    print "$var=$ENV{$var}\n";
}

Here we’re using sort and keys to build a sorted list of environment variable names. If you use the list more than once, you’ll obviously keep it around by saving in an array. But for one-shot processing, it’s often tidier to process the list directly.

Not only can we add complexity to this formula by building up the list in the foreach, we can also add complexity by doing more work inside the code block. A common application of foreach is to gather information on every element of a list and then, based on that information, decide whether to do something. For instance, returning to the disk quota example:

foreach $user (@all_users) {
    $disk_space = get_usage($user);     # find out how much disk space in use
    if ($disk_space > $MAX_QUOTA) {     # if it's more than we want ...
        complain($user);                # ... then object vociferously
    }
}

More complicated program flow is possible. The code can call last to jump out of the loop, next to move on to the next element, or redo to jump back to the first statement inside the block. Use these to say “no point continuing with this one, I know it’s not what I’m looking for” (next), “I’ve found what I’m looking for, there’s no point in my checking the rest” (last), or “I’ve changed some things, I’d better run this loop’s calculations again” (redo).

The variable set to each value in the list is called a loop variable or iterator variable. If no iterator variable is supplied, the global variable $_ is used. $_ is the default variable for many of Perl’s string, list, and file functions. In brief code blocks, omitting $_ improves readability. (In long ones, though, too much implicit use hampers readability.) For example:

foreach (`who`) {
    if (/tchrist/) {
        print;
    }
}

or combining with a while loop:

while (<FH>) {              # $_ is set to the line just read
    chomp;                  # $_ has a trailing \n removed, if it had one
    foreach (split) {       # $_ is split on whitespace, into @_
                            # then $_ is set to each chunk in turn
        $_ = reverse;       # the characters in $_ are reversed
        print;              # $_ is printed
    }
}

Perhaps all these uses of $_ are starting to make you nervous. In particular, the foreach and the while both give values to $_. You might fear that at the end of the foreach, the full line as read into $_ with <FH> would be forever gone.

Fortunately, your fears would be unfounded, at least in this case. Perl won’t permanently clobber $_’s old value, because the foreach’s iterator variable (here, $_) is automatically preserved during the loop. It saves away any old value on entry and restores it upon exit.

However, there is some cause for concern. If the while had been the inner loop and the foreach the outer one, your fears would have been realized. Unlike a foreach loop, the while (<FH>) construct clobbers the value of the global $_ without first localizing it! So any routine—or block for that matter—that uses this construct with $_ should declare local $_.

If a lexical variable (one declared with my) is in scope, the temporary variable will be lexically scoped, private to that loop. Otherwise, it will be a dynamically scoped global variable. To avoid strange magic at a distance, write this more obviously and more clearly as:

foreach my $item (@array) {
    print "i = $item\n";
}

The foreach looping construct has another feature: each time through the loop, the iterator variable becomes not a copy of but rather an alias for the current element. This means that when you change that iterator variable, you really change each element in the list:

@array = (1,2,3);
foreach $item (@array) {
    $item--;
}
print "@array\n";
0 1 2

# multiply everything in @a and @b by seven
@a = ( .5, 3 ); @b = ( 0, 1 );
foreach $item (@a, @b) {
    $item *= 7;
}
print "@a @b\n";
3.5 21 0 7

You can’t change a constant, though, so this is illegal:

foreach $n (1, 2, 3) {
    $n **= 2;
}

This aliasing means that using a foreach loop to modify list values is both more readable and faster than the equivalent code using a three-part for loop and explicit indexing would be. This behavior is a feature, not a bug, that was introduced by design. If you didn’t know about it, you might accidentally change something. Now you know about it.

For example, to trim leading and trailing whitespace in a hash, we take advantage of how the values function works: the elements of its return list really are the values of the hash, and changing these changes the original hash. Because we use s/// directly on the list returned by the values function without copying these into a variable, we change the real hash itself.

# trim whitespace in the scalar, the array, and in all 
# the values in the hash
foreach ($scalar, @array, values %hash) {
    s/^\s+//;
    s/\s+$//;
}

For reasons hearkening back to the equivalent construct in the Unix Bourne shell, the for and foreach keywords are interchangeable:

for $item (@array) {  # same as foreach $item (@array)
    # do something
}

for (@array)      {   # same as foreach $_ (@array)
    # do something
}

This style often indicates that its author writes or maintains shell scripts, perhaps for Unix system administration. As such, their life is probably hard enough, so don’t speak too harshly of them. Remember, TMTOWTDI. This is just one of those ways.

If you aren’t fluent in Bourne shell, you might find it clearer to express “for each $thing in this @list" by saying foreach, to make your code look less like the shell and more like English. (But don’t try to make your English look like your code!)

See Also

The “For Loops,” “Foreach Loops,” and “Loop Control” sections of perlsyn(1) and Chapter 4 of Programming Perl; the “Temporary Values via local” section of perlsub(1); the “Scoped Declarations” section of Chapter 4 of Programming Perl; we talk about local in Recipe 10.13; we talk about my in Recipe 10.2

4.6. Iterating Over an Array by Reference

Problem

You have a reference to an array, and you want to use a loop to work with the array’s elements.

Solution

Use foreach or for to loop over the dereferenced array:

# iterate over elements of array in $ARRAYREF
foreach $item (@$ARRAYREF) {
    # do something with $item
}

for ($i = 0; $i <= $#$ARRAYREF; $i++) {
    # do something with $ARRAYREF->[$i]
}

Discussion

The solutions assume you have a scalar variable containing the array reference. This lets you do things like this:

@fruits = ( "Apple", "Blackberry" );
$fruit_ref = \@fruits;
foreach $fruit (@$fruit_ref) {
    print "$fruit tastes good in a pie.\n";
}
Apple tastes good in a pie
               Blackberry tastes good in a pie

We could have rewritten the foreach loop as a for loop like this:

for ($i=0; $i <= $#$fruit_ref; $i++) {
    print "$fruit_ref->[$i] tastes good in a pie.\n";
}

Frequently, though, the array reference is the result of a more complex expression. Use the @{ EXPR } notation to turn the result of the expression back into an array:

$namelist{felines} = \@rogue_cats;
foreach $cat ( @{ $namelist{felines} } ) {
    print "$cat purrs hypnotically..\n";
}
print "--More--\nYou are controlled.\n";

Again, we can replace the foreach with a for loop:

for ($i=0; $i <= $#{ $namelist{felines} }; $i++) {
    print "$namelist{felines}[$i] purrs hypnotically.\n";
}

See Also

perlref(1) and perllol(1); Chapter 8 of Programming Perl; Recipe 11.1; Recipe 4.5

4.7. Extracting Unique Elements from a List

Problem

You want to eliminate duplicate values from a list, such as when you build the list from a file or from the output of another command. This recipe is equally applicable to removing duplicates as they occur in input and to removing duplicates from an array you’ve already populated.

Solution

Use a hash to record which items have been seen, then keys to extract them. You can use Perl’s idea of truth to shorten and speed up your code.

Straightforward

%seen = ( );
@uniq = ( );
foreach $item (@list) {
    unless ($seen{$item}) {
        # if we get here, we have not seen it before
        $seen{$item} = 1;
        push(@uniq, $item);
    }
}

Faster

%seen = ( );
foreach $item (@list) {
    push(@uniq, $item) unless $seen{$item}++;
}

Similar but with user function

%seen = ( );
foreach $item (@list) {
    some_func($item) unless $seen{$item}++;
}

Faster but different

%seen = ( );
foreach $item (@list) {
    $seen{$item}++;
}
@uniq = keys %seen;

Faster and even more different

%seen = ( );
@uniq = grep { ! $seen{$_} ++ } @list;

Discussion

The question at the heart of the matter is “Have I seen this element before?” Hashes are ideally suited to such lookups. The first technique (Recipe 4.7.2.1) builds up the array of unique values as we go along, using a hash to record whether something is already in the array.

The second technique (Recipe 4.7.2.2) is the most natural way to write this sort of thing in Perl. It creates a new entry in the hash every time it sees an element that hasn’t been seen before, using the ++ operator. This has the side effect of making the hash record the number of times the element was seen. This time we only use the hash for its property of working like a set.

The third example (Recipe 4.7.2.3) is similar to the second but rather than storing the item away, we call some user-defined function with that item as its argument. If that’s all we’re doing, keeping a spare array of those unique values is unnecessary.

The next mechanism (Recipe 4.7.2.4) waits until it’s done processing the list to extract the unique keys from the %seen hash. This may be convenient, but the original order has been lost.

The final approach (Recipe 4.7.2.5) merges the construction of the %seen hash with the extraction of unique elements. This preserves the original order of elements.

Using a hash to record the values has two side effects: processing long lists can take a lot of memory, and the list returned by keys is unordered.

Here’s an example of processing input as it is read. We use `who` to gather information on the current user list, then extract the username from each line before updating the hash:

# generate a list of users logged in, removing duplicates
%ucnt = ( );
for (`who`) {
    s/\s.*\n//;   # kill from first space till end-of-line, yielding username
    $ucnt{$_}++;  # record the presence of this user
}
# extract and print unique keys
@users = sort keys %ucnt;
print "users logged in: @users\n";

See Also

The “Foreach Loops” section of perlsyn(1) and Chapter 4 of Programming Perl; the keys function in perlfunc(1) and Chapter 29 of Programming Perl; the “Hashes” section of Chapter 2 of Programming Perl; Chapter 5; we use hashes in a similar fashion in Recipe 4.8 and Recipe 4.9

4.8. Finding Elements in One Array but Not Another

Problem

You want to find elements that are in one array but not another.

Solution

You want to find elements in @A that aren’t in @B. Build a hash of the keys of @B to use as a lookup table. Then check each element in @A to see whether it is in @B.

Straightforward implementation

# assume @A and @B are already loaded
%seen = ( );                    # lookup table to test membership of B
@aonly = ( );                   # answer

# build lookup table
foreach $item (@B) { $seen{$item} = 1 }

# find only elements in @A and not in @B
foreach $item (@A) {
    unless ($seen{$item}) {
        # it's not in %seen, so add to @aonly
        push(@aonly, $item);
    }
}

More idiomatic version

my %seen;     # lookup table
my @aonly;    # answer

# build lookup table
@seen{@B} = ( );

foreach $item (@A) {
    push(@aonly, $item) unless exists $seen{$item};
}

Loopless version

my @A = ...;
my @B = ...;

my %seen;
@seen {@A} = ( );
delete @seen {@B};

my @aonly = keys %seen;

Discussion

As with nearly any problem in Perl that asks whether a scalar is in one list or another, this one uses a hash. First, process @B so that the %seen hash records each element from @B by setting its value to 1. Then process @A one element at a time, checking whether that particular element had been in @B by consulting the %seen hash.

The given code retains duplicate elements in @A. This can be fixed easily by adding the elements of @A to %seen as they are processed:

foreach $item (@A) {
    push(@aonly, $item) unless $seen{$item};
    $seen{$item} = 1;                       # mark as seen
}

The first two solutions differ mainly in how they build the hash. The first iterates through @B. The second uses a hash slice to initialize the hash. A hash slice is easiest illustrated by this example:

$hash{"key1"} = 1;
$hash{"key2"} = 2;

which is equivalent to:

@hash{"key1", "key2"} = (1,2);

The list in the curly braces holds the keys; the list on the right holds the values. We initialize %seen in the first solution by looping over each element in @B and setting the appropriate value of %seen to 1. In the second, we simply say:

@seen{@B} = ( );

This uses items in @B as keys for %seen, setting each corresponding value to undef, because there are fewer values on the right than places to put them. This works out here because we check for existence of the key, not logical truth or definedness of the value. If we needed true values, a slice could still shorten our code:

@seen{@B} = (1) x @B;

In the third solution, we make use of this property even further and avoid explicit loops altogether. (Not that avoiding loops should be construed as being particularly virtuous; we’re just showing you that there’s more than one way to do it.) The slice assignment makes any element that was in @A a key, and the slice deletion removes from the hash any keys that were elements of @B, leaving those that were only in @A.

A fairly common situation where this might arise is when you have two files and would like to know which lines from the second file either were or weren’t in the first. Here’s a simple solution based on this recipe:

open(OLD, $path1)        || die "can't open $path1: $!";
@seen{ <OLD> } = ( );
open(NEW, $path2)        || die "can't open $path2: $!";
while (<NEW>) {
    print if exists $seen{$_};  
}

This shows the lines in the second file that were already seen in the first one. Use unless instead of if to show the lines in the second file that were not in the first.

Imagine two files, the first containing the lines:

red
yellow
green
blue

and the second containing:

green
orange
purple
black
yellow

The output using if would be:

green 
yellow

and the output using unless would be:

orange 
purple
black

You could even do this from the command line; given a suitable cat(1) program, it’s easy:

% perl -e '@s{`cat OLD`}=( ); exists $s{$_} && print for `cat NEW`'
% perl -e '@s{`cat OLD`}=( ); exists $s{$_} || print for `cat NEW`'

You’d find that you just emulated these calls to the Unix fgrep(1) program:

% fgrep -Ff  OLD NEW
% fgrep -vFf OLD NEW

See Also

Hash slices are explained in perldata(1) and the “Variables” section of Chapter 2 of Programming Perl; Chapter 5; we use hashes in a similar fashion in Recipe 4.7 and Recipe 4.9

4.9. Computing Union, Intersection, or Difference of Unique Lists

Problem

You have a pair of lists, each holding unduplicated items. You’d like to find out which items are in both lists (intersection), one but not the other (difference), or either (union).

Solution

The following solutions need the listed initializations:

@a = (1, 3, 5, 6, 7, 8);
@b = (2, 3, 5, 7, 9);

@union = @isect = @diff = ( );
%union = %isect = ( );
%count = ( );

Simple solution for union and intersection

foreach $e (@a) { $union{$e} = 1 }

foreach $e (@b) {
    if ( $union{$e} ) { $isect{$e} = 1 }
    $union{$e} = 1;
}
@union = keys %union;
@isect = keys %isect;

More idiomatic version

foreach $e (@a, @b) { $union{$e}++ && $isect{$e}++ }

@union = keys %union;
@isect = keys %isect;

Union, intersection, and symmetric difference

foreach $e (@a, @b) { $count{$e}++ }

@union = keys %count;
foreach $e (keys %count) {
    if ($count{$e} =  = 2) {
        push @isect, $e;
    } else {
        push @diff, $e;
    }
}

Indirect solution

@isect = @diff = @union = ( );

foreach $e (@a, @b) { $count{$e}++ }

@union = keys %count;
foreach $e (keys %count) {
    push @{ $count{$e} =  = 2 ? \@isect : \@diff }, $e;
}

Discussion

The first solution most directly computes the union and intersection of two lists, neither containing duplicates. Two hashes are used to record whether a particular item goes in the union or the intersection. We put every element of the first array in the union hash, giving it a true value. Then, processing each element of the second array, we check whether that element is already present in the union. If it is, we put it in the intersection as well. In any event, it goes into the union. When we’re done, we extract the keys of both the union and intersection hashes. The values aren’t needed.

The second solution (Recipe 4.8.2.2) is essentially the same but relies on familiarity with the Perl (and awk, C, C++, and Java) ++ and && operators. By placing the ++ after the variable, we first look at its old value before incrementing it. The first time through it won’t be in the union, which makes the first part of the && false, so the second part is consequently ignored. The second time that we encounter the same element, it’s already present in the union, so we put it in the intersection.

The third solution uses just one hash to track how many times each element is seen. Once both arrays have their elements recorded in the hash, we grab those keys and put them in the union. Then we process those hash keys one at a time. Keys whose values are 2 were in both arrays, so they go in the intersection array. Keys whose values are 1 were in just one of the two arrays, so they go in the difference array. Elements of the output arrays are not in the same order as those in the input arrays.

The last solution, like the previous one, uses just one hash to count how many times each element is encountered. Here, though, we dynamically select one of two possible arrays by placing within the @{...} array-dereferencing block an expression whose evaluation yields a reference to whichever array is demanded by the situation.

In this recipe we compute the symmetric difference, not the simple difference. These are terms from set theory. A symmetric difference is the set of all elements that are members of either @A or @B, but not both. A simple difference is the set of members of @A but not @B, which we calculated in Recipe 4.8.

See Also

The “Hashes” section of Chapter 2 of Programming Perl; Chapter 5; we use hashes in a similar fashion in Recipe 4.7 and Recipe 4.8

4.10. Appending One Array to Another

Problem

You want to join two arrays by appending all elements of one to the other.

Solution

Use push:

# push
push(@ARRAY1, @ARRAY2);

Discussion

The push function is optimized for appending a list to the end of an array. You can take advantage of Perl’s list flattening to join two arrays, but this results in significantly more copying than push:

@ARRAY1 = (@ARRAY1, @ARRAY2);

Here’s an example of push in action:

@members = ("Time", "Flies");
@initiates = ("An", "Arrow");
push(@members, @initiates);
# @members is now ("Time", "Flies", "An", "Arrow")

To insert the elements of one array into the middle of another, use the splice function:

splice(@members, 2, 0, "Like", @initiates);
print "@members\n";
splice(@members, 0, 1, "Fruit");
splice(@members, -2, 2, "A", "Banana");
print "@members\n";

This is the output:

               Time Flies Like An Arrow
               Fruit Flies Like A Banana

See Also

The splice and push functions in perlfunc(1) and Chapter 29 of Programming Perl; the “List Values and Arrays” section of Chapter 2 of Programming Perl; the “List Value Constructors” section of perldata(1)

4.11. Reversing an Array

Problem

You want to reverse an array.

Solution

Use the reverse function:

# reverse @ARRAY into @REVERSED
@REVERSED = reverse @ARRAY;

Or process with a foreach loop on a reversed list:

foreach $element (reverse @ARRAY) {
    # do something with $element
}

Or use a for loop, starting with the index of the last element and working your way down:

for ($i = $#ARRAY; $i >= 0; $i--) {
    # do something with $ARRAY[$i]
}

Discussion

Called in list context, the reverse function reverses elements of its argument list. You can save a copy of that reversed list into an array, or just use foreach to walk through it directly if that’s all you need. The for loop processes the array elements in reverse order by using explicit indices. If you don’t need a reversed copy of the array, the for loop can save memory and time on very large arrays.

If you’re using reverse to reverse a list that you just sorted, you should have sorted it in the correct order to begin with. For example:

# two-step: sort then reverse
@ascending = sort { $a cmp $b } @users;
@descending = reverse @ascending;

# one-step: sort with reverse comparison
@descending = sort { $b cmp $a } @users;

See Also

The reverse function in perlfunc(1) and Chapter 29 of Programming Perl; we use reverse in Recipe 1.7

4.12. Processing Multiple Elements of an Array

Problem

You want to pop or shift multiple elements at a time.

Solution

Use splice:

# remove $N elements from front of @ARRAY (shift $N)
@FRONT = splice(@ARRAY, 0, $N);

# remove $N elements from the end of the array (pop $N)
@END = splice(@ARRAY, -$N);

Discussion

The splice function allows you to add elements, delete elements, or both, at any point in an array, not just at the ends. All other operations that modify an array’s length can also be written as a splice:

Direct method

Splice equivalent

push(@a, $x, $y)
splice(@a, @a, 0, $x, $y)
pop(@a)
splice(@a, -1)
shift(@a)
splice(@a, 0, 1)
unshift(@a, $x, $y)
splice(@a, 0, 0, $x, $y)
$a[$x] = $y
splice(@a, $x, 1, $y)
(@a, @a = ( ))
splice(@a)

Unlike pop and unshift, though, which always delete and return just one element at a time—and from the ends only—splice lets you specify the number of elements. This leads to code like the examples in the Solution.

It’s often convenient to wrap these splices as functions:

sub shift2 (\@) {
    return splice(@{$_[0]}, 0, 2);
}

sub pop2 (\@) {
    return splice(@{$_[0]}, -2);
}

This makes their behavior more apparent when you use them:

@friends = qw(Peter Paul Mary Jim Tim);
($this, $that) = shift2(@friends);
# $this contains Peter, $that has Paul, and
# @friends has Mary, Jim, and Tim

@beverages = qw(Dew Jolt Cola Sprite Fresca);
@pair = pop2(@beverages);
# $pair[0] contains Sprite, $pair[1] has Fresca,
# and @beverages has (Dew, Jolt, Cola)

The splice function returns the elements it removed from the array, so shift2 replaces the first two elements in @ARRAY with nothing (i.e., deletes them) and returns the two elements deleted. In pop2, the two elements at end of the array are removed and returned.

These two functions are prototyped to take an array reference as their argument to better mimic the built-in shift and pop functions. The caller doesn’t pass in an explicit reference using a backslash. Instead, the compiler, having seen the array reference prototype, arranges to pass the array by reference anyway. Advantages to this approach include efficiency, transparency, and compile-time parameter checking. One disadvantage is that the thing passed in must look like a real array with a leading @ sign, not just a scalar containing an array reference. If it did, you’d have to prepend an @, making it less transparent:

$line[5] = \@list;
@got = pop2( @{ $line[5] } );

This is another example of where a proper array and not a mere list is called for. The \@ prototype requires that whatever goes in that argument slot be an array. $line[5] isn’t an array, but an array reference. That’s why we need the “extra” @ sign.

See Also

The splice function in perlfunc(1) and Chapter 29 of Programming Perl; the “Prototypes” sections of perlsub(1) and Chapter 6 of Programming Perl; we use splice in Recipe 4.10

4.13. Finding the First List Element That Passes a Test

Problem

You want the first element in the list (or its index) that passes a test. Alternatively, you want to know whether any element passes the test. The test can be simple identity (“Is this element in the list?”)[1] or more complex (“I have a list of Employee objects, sorted from highest salary to lowest. Which manager has the highest salary?”). Simple cases normally require only the value of the element, but when the array itself will be altered, you probably need to know the index number of the first matching element.

Solution

To find a matching value, use foreach to loop over every element, and call last as soon as you find a match:

my ($match, $found, $item);
foreach $item (@array) {
    if (CRITERION) {
        $match = $item;  # must save
        $found = 1;
        last;
    }
}
if ($found) {
    ## do something with $match
} else {
    ## unfound
}

To find a matching index, use for to loop a variable over every array index, and call last as soon as you find a match:

my ($i, $match_idx);
for ($i = 0; $i < @array; $i++) {
    if (CRITERION) {
        $match_idx = $i;    # save the index
        last;
    }
}

if (defined $match_idx) {
    ## found in $array[$match_idx]
} else {
    ## unfound
}

The List::Util module, shipped standard with Perl as of v5.8 but available on CPAN for earlier versions, provides an even easier approach:

use List::Util qw(first);
$match = first { CRITERION } @list

Discussion

Lacking (until recently) a built-in mechanism to do this, we must write our own code to go through the list and test each element. We use foreach and for, and call last to ensure that we stop as soon as we find a match. Before we use last to stop looking, though, we save the value or index.

A common approach is to try to use grep here. But grep always tests all elements and finds all matches, so it’s inefficient if you want only the first match. However, grep might still be faster. That’s because there will be less source code if you use grep rather than writing your own loop. That means fewer internal Perl operations, and it is these that in practice often dominate runtimes.

Beyond a certain size of your data set, a loop that terminates early will still be faster—assuming it has the chance to do so. Empirical evidence suggests that for will be faster as long as you can exit before the first two-thirds of the list has been examined. It’s worthwhile to know how to do that.

We have to set $match when we want the value of the first matching element. We can’t just test $item at the end of the loop, because foreach automatically localizes the iterator variable and thereby prevents us from accessing the final loop value after the loop ends. See Recipe 4.5.

Here’s an example. Assume that @all_emps holds a list of Employee objects, sorted in descending order by salary. We wish to find the highest paid engineer, who will be the first engineer in the array. We only want to print the engineer’s name, so we want the value, not the index.

foreach $employee (@all_emps) {
    if ( $employee->category( ) eq 'engineer' ) {
        $top_engr = $employee;
        last;
    }
}
print "Highest paid engineer is: ", $highest_engineer->name( ), "\n";

When we’re searching and want only the index, we can save some code by remembering that $i will not be an acceptable array index if we don’t find a match. This mainly saves us code space, as not doing an assignment doesn’t really win much compared to the time spent testing list elements. It’s more obscure, because it tests if ($i < @ARRAY) to check whether we found a match, instead of the more obvious defined test in the previous solution.

for ($i = 0; $i < @ARRAY; $i++) {
    last if CRITERION;
}
if ($i < @ARRAY) {
    ## found and $i is the index
} else {
    ## not found
}

The first function from List::Util encapsulates the logic from an entire loop into a convenient, easy-to-use function. It acts just like a short-circuiting form of the built-in grep function that stops as soon as a match is found. While running, each list element is in a localized $_ variable. For example:

$first_odd = first { $_ % 2 =  = 1 } @ARRAY;

Or rewriting the previous employee loop:

$top_engr = first { $_->category( ) eq 'engineer' } @all_emps;

See Also

The “For Loops,” “Foreach Loops,” and “Loop Control” sections of perlsyn(1) and Chapter 4 of Programming Perl; the grep function in perlfunc(1) and Chapter 29 of Programming Perl

4.14. Finding All Elements in an Array Matching Certain Criteria

Problem

From a list, you want only the elements that match certain criteria.

This notion of extracting a subset of a larger list is common. It’s how you find all engineers in a list of employees, all users in the “staff” group, or all the filenames you’re interested in.

Solution

Use grep to apply a condition to all elements in the list and return only those for which the condition was true:

@MATCHING = grep { TEST ($_) } @LIST;

Discussion

This could also be accomplished with a foreach loop:

@matching = ( );
foreach (@list) {
    push(@matching, $_) if TEST ($_);
}

The Perl grep function is shorthand for all that looping and mucking about. It’s not really like the Unix grep command; it doesn’t have options to return line numbers or to negate the test, and it isn’t limited to regular-expression tests. For example, to filter out just the large numbers from an array or to find out which keys in a hash have very large values:

@bigs = grep { $_ > 1_000_000 } @nums;
@pigs = grep { $users{$_} > 1e7 } keys %users;

Here’s something that sets @matching to lines from the who command that start with "gnat “:

@matching = grep { /^gnat / } `who`;

Here’s another example:

@engineers = grep { $_->position( ) eq "Engineer" } @employees;

It extracts only those objects from the array @employees whose position method returns the string Engineer.

You could have even more complex tests in a grep:

@secondary_assistance = grep { $_->income >= 26_000 &&
                               $_->income <  30_000 }
                        @applicants;

But at that point you may decide it would be more legible to write a proper loop instead.

See Also

The “For Loops,” “Foreach Loops,” and “Loop Control” sections of perlsyn(1) and Chapter 4 of Programming Perl; the grep function in perlfunc(1) and Chapter 29 of Programming Perl; your system’s who(1) manpage, if it exists; Recipe 4.13

4.15. Sorting an Array Numerically

Problem

You want to sort a list of numbers, but Perl’s sort (by default) sorts in ASCII order.

Solution

Use Perl’s sort function and the <=> numerical comparison operator:

@sorted = sort { $a <=> $b } @unsorted;

Discussion

The sort function takes an optional code block, which lets you replace the default alphabetic comparison with your own subroutine. This comparison function is called each time sort has to compare two values. The values to compare are loaded into the special package variables $a and $b, which are automatically localized.

The comparison function should return a negative number if $a ought to appear before $b in the output list, 0 if they’re the same and their order doesn’t matter, or a positive number if $a ought to appear after $b. Perl has two operators that behave this way: <=> for sorting numbers in ascending numeric order, and cmp for sorting strings in ascending alphabetic order. By default, sort uses cmp-style comparisons.

Here’s code that sorts the list of PIDs in @pids, lets the user select one, then sends it a TERM signal followed by a KILL signal. We use a code block that compares $a to $b with <=> to sort numerically:

# @pids is an unsorted array of process IDs
foreach my $pid (sort { $a <=> $b } @pids) {
    print "$pid\n";
}
print "Select a process ID to kill:\n";
chomp ($pid = <>);
die "Exiting ... \n" unless $pid && $pid =~ /^\d+$/;
kill('TERM',$pid);
sleep 2;
kill('KILL',$pid);

If you use $a <=> $b or $a cmp $b, the list will be sorted in ascending order. For a descending sort, all we have to do is swap $a and $b in the sort subroutine:

@descending = sort { $b <=> $a } @unsorted;

Comparison routines must be consistent; that is, they should always return the same answer when called with the same values. Inconsistent comparison routines lead to infinite loops or core dumps, especially in older releases of Perl.

You can also say sort SUBNAME LIST where SUBNAME is the name of a comparison subroutine returning -1, 0, or +1. In the interests of speed, the normal calling conventions are bypassed, and the values to be compared magically appear for the duration of the subroutine in the global package variables $a and $b. Because of the odd way Perl calls this subroutine, it may not be recursive.

A word of warning: $a and $b are set in the package active in the call to sort, which may not be the same as the one that the SUBNAME function passed to sort was compiled in! For example:

package Sort_Subs;
sub revnum { $b <=> $a }

package Other_Pack;
@all = sort Sort_Subs::revnum 4, 19, 8, 3;

This will silently fail (unless you have -w in effect, in which case it will vocally fail) because the sort call sets the package variables $a and $b in its own package, Other_Pack, but the revnum function uses its own package’s versions. This is another reason why in-lining sort functions is easier, as in:

@all = sort { $b <=> $a } 4, 19, 8, 3;

For more on packages, see Chapter 10.

See Also

The cmp and <=> operators in perlop(1) and Chapter 3 of Programming Perl; the kill, sort, and sleep functions in perlfunc(1) and Chapter 29 of Programming Perl; Recipe 4.16

4.16. Sorting a List by Computable Field

Problem

You want to sort a list by something more complex than a simple string or numeric comparison.

This is common when working with objects (“sort by the employee’s salary”) or complex data structures (“sort by the third element in the array that this is a reference to”). It’s also applicable when you want to sort by more than one key; for instance, sorting by birthday and then by name when multiple people share the same birthday.

Solution

Use the customizable comparison routine in sort:

@ordered = sort { compare( ) } @unordered;

You can speed this up by precomputing the field.

@precomputed = map { [compute( ),$_] } @unordered;
@ordered_precomputed = sort { $a->[0] <=> $b->[0] } @precomputed;
@ordered = map { $_->[1] } @ordered_precomputed;

And, finally, you can combine the three steps:

@ordered = map { $_->[1] }
           sort { $a->[0] <=> $b->[0] }
           map { [compute( ), $_] }
           @unordered;

Discussion

The use of a comparison routine was explained in Recipe 4.15. As well as using built-in operators like <=>, you can construct more complex tests:

@ordered = sort { $a->name cmp $b->name } @employees;

You often see sort used like this in part of a foreach loop:

foreach $employee (sort { $a->name cmp $b->name } @employees) {
    print $employee->name, " earns \$", $employee->salary, "\n";
}

If you’re going to do a lot of work with elements in a particular order, it’s more efficient to sort once and work from that:

@sorted_employees = sort { $a->name cmp $b->name } @employees;
foreach $employee (@sorted_employees) {
    print $employee->name, " earns \$", $employee->salary, "\n";
}
# load %bonus
foreach $employee (@sorted_employees) {
    if ( $bonus{ $employee->ssn } ) {
      print $employee->name, " got a bonus!\n";
    }
}

We can put multiple comparisons in the routine and separate them with ||. || is a short-circuit operator: it returns the first true value it finds. This means we can sort by one kind of comparison, but if the elements are equal (the comparison returns 0), we can sort by another. This has the effect of a sort within a sort:

@sorted = sort { $a->name cmp $b->name
                           ||
                  $b->age <=> $a->age } @employees;

This first considers the names of the two employees to be compared. If they’re not equal, || stops and returns the result of the cmp (effectively sorting them in ascending order by name). If the names are equal, though, || keeps testing and returns the result of the <=> (sorting them in descending order by age). The result is a list that is sorted by name and by age within groups of the same name.

Let’s look at a real-life example of sorting. First we fetch all system users, represented as User::pwent objects. Then we sort them by name and print the sorted list:

use User::pwent qw(getpwent);
@users = ( );
# fetch all users
while (defined($user = getpwent)) {
    push(@users, $user);
}
@users = sort { $a->name cmp $b->name } @users;
foreach $user (@users) {
    print $user->name, "\n";
}

We can have more than simple comparisons, or combinations of simple comparisons. This code sorts a list of names by comparing the second letters of the names. It gets the second letters by using substr:

@sorted = sort { substr($a,1,1) cmp substr($b,1,1) } @names;

and here we sort by string length:

@sorted = sort { length $a <=> length $b } @strings;

The sort function calls the code block each time it needs to compare two elements, so the number of comparisons grows dramatically with the number of elements we’re sorting. Sorting 10 elements requires (on average) 46 comparisons, but sorting 1,000 elements requires 14,000 comparisons. A time-consuming operation like a split or a subroutine call for each comparison can easily make your program crawl.

Fortunately, we can remove this bottleneck by running the operation once per element prior to the sort. Use map to store the results of the operation in an array whose elements are anonymous arrays containing both the computed field and the original field. Then we sort this array of arrays on the precomputed field and use map to get the sorted original data. This map-sort-map concept is useful and common, so let’s look at it in depth.

Let’s apply map-sort-map to the sorting by string length example:

@temp   = map  {  [ length $_, $_ ]  } @strings;
@temp   = sort { $a->[0] <=> $b->[0] } @temp;
@sorted = map  {        $_->[1]      } @temp;

The first line creates a temporary array of strings and their lengths, using map. The second line sorts the temporary array by comparing the precomputed lengths. The third line turns the sorted temporary array of strings and lengths back into a sorted array of strings. This way, we calculate the length of each string only once.

Because the input to each line is the output of the previous line (the @temp array we make in line 1 is fed to sort in line 2, and that output is fed to map in line 3), we can combine it into one statement and eliminate the temporary array:

@sorted = map  { $_->[1] }
          sort { $a->[0] <=> $b->[0] }
          map  { [ length $_, $_ ] }
          @strings;

The operations now appear in reverse order. When you meet a map-sort-map, you should read it from the bottom up to determine the function:

@strings

The last part is the data to be sorted. Here it’s just an array, but later we’ll see that this can be a subroutine or even backticks. Anything that returns a list is fair game.

map

The map closer to the bottom builds the temporary list of anonymous arrays. This list contains the precomputed fields (length $_) and also records the original element ($_) by storing both in an anonymous array. Look at this map line to find out how the fields are computed.

sort

The sort line sorts the list of anonymous arrays by comparing the precomputed fields. It won’t tell you much, other than whether the list is sorted in ascending or descending order.

map

The map at the top of the statement turns the sorted list of anonymous arrays back into a list of the sorted original elements. It will generally be the same for every map-sort-map.

Here’s a more complicated example, which sorts by the first number that appears on each line in @fields:

@temp          = map  {    [ /(\d+)/, $_ ]  } @fields;
@sorted_temp   = sort { $a->[0] <=> $b->[0] } @temp;
@sorted_fields = map  {       $_->[1]       } @sorted_temp;

The regular expression mumbo jumbo in the first line extracts the first number from the line being processed by map. We use the regular expression /(\d+)/ in a list context to extract the number.

We can remove the temporary arrays in that code, giving us:

@sorted_fields = map  { $_->[1] }
                 sort { $a->[0] <=> $b->[0] }
                 map  { [ /(\d+)/, $_ ] }
                 @fields;

This final example compactly sorts colon-separated data, as from Unix’s passwd file. It sorts the file numerically by the fourth field (group id), then numerically by the third field (user id), and then alphabetically by the first field (username).

print map  { $_->[0] }             # whole line
      sort {
              $a->[1] <=> $b->[1]  # gid
                      ||
              $a->[2] <=> $b->[2]  # uid
                      ||
              $a->[3] cmp $b->[3]  # login
      }
      map  { [ $_, (split /:/)[3,2,0] ] }
      `cat /etc/passwd`;

See Also

The sort function in perlfunc(1) and Chapter 29 of Programming Perl; the cmp and <=> operators in perlop(1) and Chapter 3 of Programming Perl; Recipe 4.15

4.17. Implementing a Circular List

Problem

You want to create and manipulate a circular list.

Solution

Use unshift and pop (or push and shift) on a normal array.

Procedure

unshift(@circular, pop(@circular));  # the last shall be first
push(@circular, shift(@circular));   # and vice versa

Discussion

Circular lists are commonly used to repeatedly process things in order; for example, connections to a server. The code shown previously isn’t a true computer science circular list, with pointers and true circularity. Instead, the operations provide for moving the last element to the first position, and vice versa.

sub grab_and_rotate ( \@ ) {
    my $listref = shift;
    my $element = $listref->[0];
    push(@$listref, shift @$listref);
    return $element;
}

@processes = ( 1, 2, 3, 4, 5 );
while (1) {
    $process = grab_and_rotate(@processes);
    print "Handling process $process\n";
    sleep 1;
}

See Also

The unshift and push functions in perlfunc(1) and Chapter 29 of Programming Perl; Recipe 13.13

4.18. Randomizing an Array

Problem

You want to randomly shuffle the elements of an array. The obvious application is writing a card game, where you must shuffle a deck of cards, but it is equally applicable to any situation where you want to treat elements of an array in a random order.

Solution

Use the shuffle function from the standard List::Util module, which returns the elements of its input list in a random order.

use List::Util qw(shuffle);
@array = shuffle(@array);

Discussion

Shuffling is a surprisingly tricky process. It’s easy to write a bad shuffle:

sub naive_shuffle {                             # DON'T DO THIS
    for (my $i = 0; $i < @_; $i++) {
        my $j = int rand @_;                    # pick random element
        ($_[$i], $_[$j]) = ($_[$j], $_[$i]);    # swap 'em
    }
}

This algorithm is biased; the list’s possible permutations don’t all have the same probability of being generated. The proof of this is simple: take the case where we’re passed a three-element list. We generate three random numbers, each of which can have three possible values, yielding 27 possible outcomes. There are only six permutations of the three-element list, though. Because 27 isn’t evenly divisible by 6, some outcomes are more likely than others.

The List::Util module’s shuffle function avoids this bias to produce a more randomly shuffled result.

If all you want to do is pick one random element from the array, use:

$value = $array[ int(rand(@array)) ];

See Also

The rand function in perlfunc(1) and Chapter 29 of Programming Perl; for more on random numbers, see Recipe 2.6, Recipe 2.7, and Recipe 2.8; Recipe 4.20 provides another way to select a random permutation

4.19. Program: words

Have you ever wondered how programs like ls generate columns of sorted output that you read down the columns instead of across the rows? For example:

awk      cp       ed       login    mount    rmdir    sum
basename csh      egrep    ls       mt       sed      sync
cat      date     fgrep    mail     mv       sh       tar
chgrp    dd       grep     mkdir    ps       sort     touch
chmod    df       kill     mknod    pwd      stty     vi
chown    echo     ln       more     rm       su

Example 4-2 does this.

Example 4-2. words
  #!/usr/bin/perl -w
  # words - gather lines, present in columns
  
  use strict;
  
  my ($item, $cols, $rows, $maxlen);
  my ($xpixel, $ypixel, $mask, @data);
  
  getwinsize( );
  
  # first gather up every line of input,
  # remembering the longest line length seen
  $maxlen = 1;        
  while (<>) {
      my $mylen;
      s/\s+$//;
      $maxlen = $mylen if (($mylen = length) > $maxlen);
      push(@data, $_);
  }
  
  $maxlen += 1;               # to make extra space
  
  # determine boundaries of screen
  $cols = int($cols / $maxlen) || 1;
  $rows = int(($#data+$cols) / $cols);
  
  # pre-create mask for faster computation
  $mask = sprintf("%%-%ds ", $maxlen-1);
  
  # subroutine to check whether at last item on line
  sub EOL { ($item+1) % $cols =  = 0 }  
  
  # now process each item, picking out proper piece for this position
  for ($item = 0; $item < $rows * $cols; $item++) {
      my $target =  ($item % $cols) * $rows + int($item/$cols);
      my $piece = sprintf($mask, $target < @data ? $data[$target] : "");
      $piece =~ s/\s+$// if EOL( );  # don't blank-pad to EOL
      print $piece;
      print "\n" if EOL( );
  }
  
  # finish up if needed
  print "\n" if EOL( );
  
  # not portable -- linux only
  sub getwinsize {
      my $winsize = "\0" x 8;
      my $TIOCGWINSZ = 0x40087468;
      if (ioctl(STDOUT, $TIOCGWINSZ, $winsize)) {
          ($rows, $cols, $xpixel, $ypixel) = unpack('S4', $winsize);
      } else {
          $cols = 80;
      }
  }

The most obvious way to print out a sorted list in columns is to print each element of the list, one at a time, padded out to a particular width. Then when you’re about to hit the end of the line, generate a newline. But that only works if you’re planning on reading each row from left to right. If you instead expect to read it down each column, this approach won’t do.

The words program is a filter that generates output going down the columns. It reads all input, keeping track of the length of the longest line seen. Once everything has been read in, it divides the screen width by the length of the longest input record seen, yielding the expected number of columns.

Then the program goes into a loop that executes once per input record, but the output order isn’t in the obvious order. Imagine you had a list of nine items:

Wrong       Right
-----       -----
1 2 3       1 4 7
4 5 6       2 5 8
7 8 9       3 6 9

The words program does the necessary calculations to print out elements (1,4,7) on one line, (2,5,8) on the next, and (3,6,9) on the last.

To figure out the current window size, this program does an ioctl call. This works fine—on the system it was written for. On any other system, it won’t work. If that’s good enough for you, then good for you. Recipe 12.17 shows how to find this on your system using the ioctl.ph file, or with a C program. Recipe 15.4 shows a more portable solution, but that requires installing a CPAN module.

See Also

Recipe 15.4

4.20. Program: permute

Have you ever wanted to generate all possible permutations of an array or to execute some code for every possible permutation? For example:

% echo man bites dog | permute
dog bites man
            bites dog man
            dog man bites
            man dog bites
            bites man dog
            man bites dog

The number of permutations of a set is the factorial of the size of the set. This number grows extremely fast, so you don’t want to run it on many permutations:

Set Size            Permutations
1                   1
2                   2
3                   6
4                   24
5                   120
6                   720
7                   5040
8                   40320
9                   362880
10                  3628800
11                  39916800
12                  479001600
13                  6227020800
14                  87178291200
15                  1307674368000

Doing something for each alternative takes a correspondingly large amount of time. In fact, factorial algorithms exceed the number of particles in the universe with very small inputs. The factorial of 500 is greater than ten raised to the thousandth power!

use Math::BigInt;
sub factorial {
    my $n = shift;
    my $s = 1;
    $s *= $n-- while $n > 0;
    return $s;
}
print factorial(Math::BigInt->new("500"));
+1220136... (1035 digits total)

The two solutions that follow differ in the order of the permutations they return.

The solution in Example 4-3 uses a classic list permutation algorithm used by Lisp hackers. It’s relatively straightforward but makes unnecessary copies. It’s also hardwired to do nothing but print out its permutations.

Example 4-3. tsc-permute
  #!/usr/bin/perl -n
  # tsc_permute: permute each word of input
  permute([split], [  ]);
  sub permute {
      my @items = @{ $_[0] };
      my @perms = @{ $_[1] };
      unless (@items) {
          print "@perms\n";
      } else {
          my (@newitems,@newperms,$i);
          foreach $i (0 .. $#items) {
              @newitems = @items;
              @newperms = @perms;
              unshift(@newperms, splice(@newitems, $i, 1));
              permute( \@newitems, \@newperms);
          }
      }
  }

The solution in Example 4-4, provided by Mark-Jason Dominus, is faster (by around 25%) and more elegant. Rather than precalculate all permutations, his code generates the n th particular permutation. It is elegant in two ways. First, it avoids recursion except to calculate the factorial, which the permutation algorithm proper does not use. Second, it generates a permutation of integers rather than permute the actual data set.

He also uses a time-saving technique called memoizing. The idea is that a function that always returns a particular answer when called with a particular argument memorizes that answer. That way, the next time it’s called with the same argument, no further calculations are required. The factorial function uses a private array @fact to remember previously calculated factorial values as described in Recipe 10.3. This technique is so useful that there’s a standard module that will handle the value caching for you. If you just had a regular factorial function that didn’t have its own caching, you could add caching to the existing function this way:

use Memoize;
memoize("factorial");

You call n2perm with two arguments: the permutation number to generate (from 0 to factorial(N), where N is the size of your array) and the subscript of the array’s last element. The n2perm function calculates directions for the permutation in the n2pat subroutine. Then it converts those directions into a permutation of integers in the pat2perm subroutine. The directions are a list like (0 2 0 1 0), which means: “Splice out the 0th element, then the second element from the remaining list, then the 0th element, then the first, then the 0th.”

Example 4-4. mjd-permute
  #!/usr/bin/perl -w
  # mjd_permute: permute each word of input
  use strict;
  sub factorial($);  # forward reference to declare prototype
  
  while (<>) {
      my @data = split;
      my $num_permutations = factorial(scalar @data);
      for (my $i=0; $i < $num_permutations; $i++) {
          my @permutation = @data[n2perm($i, $#data)];
          print "@permutation\n";
      }
  }
  
  # Utility function: factorial with memoizing
  BEGIN {
    my @fact = (1);
    sub factorial($) {
        my $n = shift;
        return $fact[$n] if defined $fact[$n];
        $fact[$n] = $n * factorial($n - 1);
    }
  }
  
  # n2pat($N, $len) : produce the $N-th pattern of length $len
  sub n2pat {
      my $i   = 1;
      my $N   = shift;
      my $len = shift;
      my @pat;
      while ($i <= $len + 1) {   # Should really be just while ($N) { ...
          push @pat, $N % $i;
          $N = int($N/$i);
          $i++;
      }
      return @pat;
  }
  
  # pat2perm(@pat) : turn pattern returned by n2pat( ) into
  # permutation of integers.  XXX: splice is already O(N)
  sub pat2perm {
      my @pat    = @_;
      my @source = (0 .. $#pat);
      my @perm;
      push @perm, splice(@source, (pop @pat), 1) while @pat;
      return @perm;
  }
  
  # n2perm($N, $len) : generate the Nth permutation of $len objects
  sub n2perm {
      pat2perm(n2pat(@_));
  }

See Also

unshift and splice in perlfunc(1) or Chapter 29 of Programming Perl; the sections discussing closures in perlsub(1) and perlref(1) and Chapter 8 of Programming Perl; Recipe 2.6; Recipe 10.3



[1] But why didn’t you use a hash then?

Get Perl Cookbook, 2nd Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.