Chapter 4. Sequence Formats and Inheritance

This chapter applies concepts and techniques from previous chapters to a concrete project: handling sequence files. The chapter also introduces a few new techniques including a very important one called class inheritance. The code developed in this chapter will also be incorporated into later chapters.

Class inheritance allows you to define a new class by inheriting from other classes—altering or making additions as needed. It’s a style of software reuse that is particular to object-oriented design.

The first class developed in this chapter is a simple one: reading and writing files. Using inheritance, you can extend that class to a new one that can recognize, read, and write data in several different biological sequence datafile formats.

The goal is, as always, to learn enough about Perl to develop software for your own needs. The code in this chapter is designed with this goal in mind. In particular, the exercises at the end of the chapter will ask you to extend and improve the code in various ways.

Inheritance

You’ve seen the use of modules and how a Perl program can use all the code in a module by simply loading it with the use command. This is a simple and powerful method of software reuse , by which software can be written once but used many times by different programs.

You’ve also seen how object-oriented Perl defines classes in terms of modules, and how the use of classes, methods, and objects provides a more structured way to reuse Perl software.

There’s another way to reuse Perl classes. It’s possible for a class to inherit all the code and definitions of another base class . (This base class is sometimes called a superclass or a parent class .) The new derived class (a.k.a. subclass) can add more definitions or redefine certain definitions. Perl then automatically uses the definitions of everything in the old class (but treats them as if they were defined in this new derived class), unless it finds them first in the derived class.

In this chapter, I’ll first develop a class FileIO.pm, and then use the technique of inheritance to develop another class SeqFileIO.pm that inherits from FileIO.pm. This way of reusing software by inheritance is extremely convenient when writing object-oriented software. For instance, I make SeqFileIO do a lot of its work simply by inheriting the base class FileIO and then adding methods that handle sequence file formats. I could use the same base class FileIO to write a new class that specializes in handling HTML files, microarray datafiles, SNP database files, and so on. (See the exercises at the end of the chapter.)

When inheriting a class, it is sometimes necessary to do a bit more than just add new methods. In the SeqFileIO class, I add some attributes to the object, and as a result the hash %_attribute_properties also has to be changed. So in the new class I define a new hash with that name, and as a result the old definition from the base class is forgotten and the new, redefined hash is used. As you read the new class, compare it with the base class FileIO. Make note of what is new in the class (e.g., the various put methods), what is being redefined from the base class (e.g., the hash just mentioned), and what is being inherited from the base class (e.g., the new constructor.) This can help prepare you to write your own class that uses inheritance.

Occasionally, you want to invoke a method from a base class that has been overridden. You can use the special SUPER class for that purpose. I don’t use that in the code for this chapter, but you should be aware that it is possible to do.

FileIO.pm: A Class to Read and Write Files

Even though you can easily obtain excellent modules for reading and writing files, this chapter shows you how to build a simple one from scratch. One reason for doing this is to better understand the issues every bioinformatics programmer needs to face, such as how to organize files and keep track of their contents. Another reason is so you can see how to extend the class to deal with the multiple file format problem that is peculiar to bioinformatics.

It’s not uncommon for a biologist to use several different types of formats of files containing DNA or protein sequence data and translate from one format to another. Doing these translations by hand is very tedious. It’s also tedious to save alternate forms of the same sequence data in differently formatted files. You’ll see how to alleviate some of this pain by automating some of these tasks in a new class called SeqFileIO.pm.

Class inheritance is one of the main reasons why object-oriented software is so reusable. In order to see clearly how it works, let’s start with the simple class FileIO.pm and later use it to define a more complex class, SeqFileIO.pm.

FileIO is a simple class that reads and writes files, and stores simple information such as the file contents, date, and write permissions.

You know that it’s often possible to modify existing code to create your own program. When I wrote FileIO.pm, I simply made a copy of the Gene.pm module from Chapter 3 and modified it.

On my Linux system, I started by copying FileIO.pm from Gene.pm and giving it a new name:

cp Gene.pm FileIO.pm

I then edited the new file FileIO.pm changing the line near the top that says:

package Gene;

to:

package FileIO;

The filename must be the same as the class name, with an additional .pm.

Though I now needed to modify the module to do what I want, a surprising amount of the overall framework of the code—its constructor, accessor and mutator methods, and its basic data structures—remains the same. Gene.pm already contained such useful parts as a new constructor, a hash-based object data structure, accessor methods to retrieve values of the attributes of the object, and mutator methods to alter attribute values. These are likely to be needed by most classes that you’ll write in your own software projects.

Analysis of FileIO

Following is the code for FileIO, with commentary interspersed:

package FileIO;

#
# A simple IO class for sequence data files
#

use strict;
use warnings;
our $AUTOLOAD; # before Perl 5.6.0 say "use vars '$AUTOLOAD';"
use Carp;

# Class data and methods
{
    # A list of all attributes with defaults and read/write/required/noinit properties
    my %_attribute_properties = (
        _filename    => [ '',        'read.write.required'],
        _filedata    => [ [ ],       'read.write.noinit'],
        _date        => [ '',         'read.write.noinit'],
        _writemode   => [ '>',        'read.write.noinit'],
    );
        
    # Global variable to keep count of existing objects
    my $_count = 0;

    # Return a list of all attributes
    sub _all_attributes {
            keys %_attribute_properties;
    }

    # Check if a given property is set for a given attribute
    sub _permissions {
        my($self, $attribute, $permissions) = @_;
        $_attribute_properties{$attribute}[1] =~ /$permissions/;
    }

    # Return the default value for a given attribute
    sub _attribute_default {
            my($self, $attribute) = @_;
        $_attribute_properties{$attribute}[0];
    }

    # Manage the count of existing objects
    sub get_count {
        $_count;
    }
    sub _incr_count {
        ++$_count;
    }
    sub _decr_count {
        --$_count;
    }
}

In this first part of FileIO.pm file, the headers are exactly the same as in Gene.pm.

The opening block, which contains the class data and methods, also remains the same except for the hash %_attribute_properties. This new version of the hash has different attributes (the filename, the file data, the last modification date of the file, and the mode to use in writing a file) tailored to the needs of reading and writing files.

In addition to the read, write, and required properties, there is also a new “no initialization” (or noinit) property. An attribute with the noinit property may not be given an initial value when an object is created with a call to the new constructor. In this module, attributes such as data from the file, or the date on the file, are set only when the file is read or written. You may also have noticed that the default value for the filedata attribute is an anonymous array.

Note that each attribute has both read and write properties. This being the case, you can simply omit the listing of the properties. However, in the interest of future modification, when I may want to add some attribute that won’t have both properties, I’ve left in the specification of the two read and write properties. (Note that one method name, get_count, doesn’t start with an underscore; this encourages you to call this method to get a count of how many objects currently exist.)

The constructor method

You’ll notice in the following code that I have cut the new constructor down to the bare bones.

# The constructor method
# Called from class, e.g. $obj = FileIO->new(  );
sub new {
    my ($class, %arg) = @_;

    # Create a new object
    my $self = bless {  }, $class;

    $class->_incr_count(  );
    return $self;
}

Why did I do so? Read on.

The read method

The code continues with the read method:

# Called from object, e.g. $obj->read(  );
sub read {
    my ($self, %arg) = @_;

    # Set attributes
    foreach my $attribute ($self->_all_attributes(  )) {
        # E.g. attribute = "_filename",  argument = "filename"
        my($argument) = ($attribute =~ /^_(.*)/);

        # If explicitly given
        if (exists $arg{$argument}) {
            # If initialization is not allowed
            if($self->_permissions($attribute, 'noinit')) {
                croak("Cannot set $argument from read: use set_$argument");
            }
            $self->{$attribute} = $arg{$argument};
        # If not given, but required
        }elsif($self->_permissions($attribute, 'required')) {
            croak("No $argument attribute as required");
        # Set to the default
        }else{
            $self->{$attribute} = $self->_attribute_default($attribute);
        }
    }

    # Read file data
    unless( open( FileIOFH, $self->{_filename} ) ) {
        croak("Cannot open file " .  $self->{_filename} );
    }
    $self->{'_filedata'} = [ <FileIOFH> ];
    $self->{'_date'} = localtime((stat FileIOFH)[9]);
    close(FileIOFH);

}

This new read method has two parts. The first includes the initialization of the object’s attributes from the arguments and the defaults as specified in the %_attribute_properties hash. The second includes the reading of the file and the setting of the _filedata and _date attributes from the file’s contents and its last modification time.

The first loop in the program initializes the attributes. If an attribute is specified as an argument, the first test is to see if the noinit property is set. This forbids initializing the attribute, in which case the program croaks. Otherwise, the attribute is set.

If the attribute isn’t passed as an argument but has a required property (only the _filename attribute has the required property), the program croaks.

Finally, if the argument isn’t given and not required, the attribute is set to the default value.

After performing those initializations, the read method reads in the specified file. If it can’t open the file, the program croaks. (See the exercises for a discussion of this use of croak.)

The file is read by the line:

$self->{'_filedata'} = [ <FileIOFH> ];

In list context, the input operator on the opened filehandle, which is given by <FileIOFH> reads in the entire file. This is done within an anonymous array, as determined by the square brackets around the input operator angle brackets. A reference to this anonymous array containing the file’s contents is then assigned to the _filedata attribute.

stat and localtime functions

Finally, the Perl stat and localtime functions are called to generate a string with the file’s last modification time, which is assigned to the object attribute _date.

This method of reading a file makes many choices. For instance, the stat command returns an array with many more items of interest about a file, such as its size, owner, access permission modes, and so on (the tenth item of which is the modification time). As you develop your programs, you should be paying attention to details such as whether you need to save some of these additional attributes of a file, the last modification date, or notes about the kind of data in the file.

The next line of code in the program:

#
# N.B. no "clone" method is necessary
#

is yet another choice to think about. Are there occasions when cloning a file object makes sense? Maybe I’d like to clone a file object, make some small change to the data, give it a new filename, and write it out. Why have I left this out?

The write method

The code continues:

# Write files
# Called from object, e.g. $obj->write(  );
sub write {
    my ($self, %arg) = @_;

    foreach my $attribute ($self->_all_attributes(  )) {
        # E.g. attribute = "_filename",  argument = "filename"
        my($argument) = ($attribute =~ /^_(.*)/);

        # If explicitly given
        if (exists $arg{$argument}) {
            $self->{$attribute} = $arg{$argument};
        }
    }
    
    unless( open( FileIOFH, $self->get_writemode . $self->get_filename ) ) {
        croak("Cannot write to file " .  $self->get_filename);
    }
    unless( print FileIOFH $self->get_filedata ) {
        croak("Cannot write to file " .  $self->get_filename);
    }
    $self->set_date(scalar localtime((stat FileIOFH)[9]));
    close(FileIOFH);

    return 1;
}

The write method handles writing a file object out to an actual file. First, all arguments corresponding to attributes are set as requested. The file is then opened for writing, using the _writemode attribute to specify; for example, > for truncating the file before writing or >> for appending to the file. The print FileIOFH statement actually does the writing to the opened FileIOFH filehandle, retrieving the file data from the object with the get_filedata method defined by means of AUTOLOAD. Finally, the object’s _date attribute is reset to the new modification time.

AUTOLOAD

The next section of code is the AUTOLOAD method itself:

# This takes the place of such accessor definitions as:
#  sub get_attribute { ... }
# and of such mutator definitions as:
#  sub set_attribute { ... }
sub AUTOLOAD {
    my ($self, $newvalue) = @_;

    my ($operation, $attribute) = ($AUTOLOAD =~ /(get|set)(_\w+)$/);
    
    # Is this a legal method name?
    unless($operation && $attribute) {
        croak "Method name '$AUTOLOAD' is not in the recognized form\n";
    }
    unless(exists $self->{$attribute}) {
        croak "No such attribute '$attribute' exists in the class ", ref($self);
    }

    # AUTOLOAD accessors
    if($operation eq 'get') {
        unless($self->_permissions($attribute, 'read')) {
            croak "$attribute does not have read permission";
        }

        # Turn off strict references to enable symbol table manipulation
        no strict "refs";
        # Install this accessor definition in the symbol table
        *{$AUTOLOAD} = sub {
            my ($self) = @_;
            unless($self->_permissions($attribute, 'read')) {
                croak "$attribute does not have read permission";
            }
            if(ref($self->{$attribute}) eq 'ARRAY') {
                return @{$self->{$attribute}};
            }else{
                return $self->{$attribute};
            }
        };
        # Turn strict references back on
        no strict "refs";

        # Return the attribute value
        # The attribute could be a scalar or a reference to an array
        if(ref($self->{$attribute}) eq 'ARRAY') {
            return @{$self->{$attribute}};
        }else{
            return $self->{$attribute};
        }

    # AUTOLOAD mutators
    }elsif($operation eq 'set') {
        unless($self->_permissions($attribute, 'write')) {
            croak "$attribute does not have write permission";
        }

        # Turn off strict references to enable symbol table manipulation
        no strict "refs";
        # Install this mutator definition in the symbol table
        *{$AUTOLOAD} = sub {
               my ($self, $newvalue) = @_;
            unless($self->_permissions($attribute, 'write')) {
                croak "$attribute does not have write permission";
            }
            $self->{$attribute} = $newvalue;
        };
        # Turn strict references back on
        no strict "refs";

        # Set and return the attribute value
        $self->{$attribute} = $newvalue;
        return $self->{$attribute};
    }
}

This AUTOLOAD method has grown! There’s only one difference, however, between this code and the AUTOLOAD code for the Gene.pm class. The new set of attributes for FileIO.pm don’t all take simple scalar values, as was the case with Gene.pm. Another attribute, _filedata, is a reference to an anonymous array. In order for the accessors to return the correct data, they must check to see if an attribute is a scalar or a reference to an array; the accessors can then dereference and return the data from the method call.

So the accessors, and the definitions of them installed into the symbol table, test for an array reference and dereference it accordingly. Other than that, this AUTOLOAD method is exactly the same as that defined for Gene.pm.

You may also have noticed that sections of code in the AUTOLOAD method are almost identical to each other. Recall that AUTOLOAD is invoked when a method with no subroutine defining it is called. AUTOLOAD must do two things. First, it performs whatever method is requested; for example, if an accessor method is requested, it returns the appropriate value. Second, it defines the subroutine that implements the requested method and installs it in the symbol table so the next time the method is called, AUTOLOAD and its considerable overhead won’t be necessary. Because of these parameters, the code AUTOLOAD executes to handle the requested method is nearly identical with the method that AUTOLOAD also defines.

Finally, here are the last sections of the FileIO.pm program:

# When an object is no longer being used, this will be automatically called
# and will adjust the count of existing objects
sub DESTROY {
    my($self) = @_;
    $self->_decr_count(  );
}

# Other methods. They do not fall into the same form as the majority handled by AUTOLOAD
#

1;

The only change here is that there are no other methods (Gene.pm had a citation method).

Finishing FileIO

To finish FileIO.pm, here’s some very terse (too terse for anything but a textbook) POD documentation:

=head1 FileIO

FileIO: read and write file data

=head1 Synopsis

    use FileIO;

    my $obj = RawfileIO->read(
        filename => 'jkl'
    );

    print $obj->get_filename, "\n";
    print $obj->get_filedata;

    $obj->set_date('today');
    print $obj->get_date, "\n";

    print $obj->get_writemode, "\n";

    my @newdata = ("line1\n", "line2\n");
    $obj->set_filedata( \@newdata );

    $obj->write(filename => 'lkj');
    $obj->write(filename => 'lkj', writemode => '>>');

    my $o = RawfileIO->read(filename => 'lkj');
    print $o->get_filename, "\n";
    print $o->get_filedata;

    my $gene1 = Gene->new(
        name => 'biggene',
        organism => 'Mus musculus',
        chromosome => '2p',
        pdbref => 'pdb5775.ent',
        author => 'L.G.Jeho',
        date => 'August 23, 1989',
    );

    print "Gene name is ", $gene1->get_name(  );
    print "Gene organism is ", $gene1->_get_organism(  );
    print "Gene chromosome is ", $gene1->_get_chromosome(  );
    print "Gene pdbref is ", $gene1->_get_pdbref(  );
    print "Gene author is ", $gene1->_get_author(  );
    print "Gene date is ", $gene1->_get_date(  );

    $clone = $gene1->clone(name => 'biggeneclone');

    $gene1-> set_chromosome('2q');
    $gene1-> set_pdbref('pdb7557.ent');
    $gene1-> set_author('G.Mendel');
    $gene1-> set_date('May 25, 1865');

    $clone->citation('T. Morgan', 'October 3, 1912');

    print "Clone citation is ", $clone->citation;

=head1 AUTHOR

James Tisdall

=head1 COPYRIGHT

Copyright (c) 2003, James Tisdall

=cut

Testing the FileIO Class Module

Now that we’ve got a class module, complete with examples of its use, let’s write a small test program and see how it works. Since the examples in the documentation are, in effect, a small test program, let’s try running it. We’ll use the file file1.txt I created with my text editor that contains:

> sample dna  (This is a typical fasta header.)
agatggcggcgctgaggggtcttgggggctctaggccggccacctactgg
tttgcagcggagacgacgcatggggcctgcgcaataggagtacgctgcct
gggaggcgtgactagaagcggaagtagttgtgggcgcctttgcaaccgcc
tgggacgccgccgagtggtctgtgcaggttcgcgggtcgctggcgggggt
cgtgagggagtgcgccgggagcggagatatggagggagatggttcagacc
cagagcctccagatgccggggaggacagcaagtccgagaatggggagaat
acacctgagccactctcagatgaggaccta

I’ll take the code from the documentation pretty much as is, just adding strict and warnings. I’ll also include a use lib directive that adds my development library directory to the list of directories in @INC, which tells my computer’s Perl where to look for modules. (Recall that you can either edit this line, override it with the PERL5LIB environmental variable, or give your own directory on the command line.) I also add a few print statements to make the output easier to read:

#!/usr/bin/perl

use strict;
use warnings;
use lib "/home/tisdall/MasteringPerlBio/development/lib";

use FileIO;

my $obj = FileIO->new(  );

$obj->read(
  filename => 'file1.txt'
);

print "The file name is ", $obj->get_filename, "\n";
print "The contents of the file are:\n", $obj->get_filedata;
print "\nThe date of the file is ", $obj->get_date, "\n";

$obj->set_date('today');
print "The reset date of the file is ", $obj->get_date, "\n";

print "The write mode of the file is ", $obj->get_writemode, "\n";

print "\nResetting the data and filename\n";
my @newdata = ("line1\n", "line2\n");
$obj->set_filedata( \@newdata );

print "Writing a new file \"file2\"\n";
$obj->write(filename => 'file2');

print "Appending to the new file \"file2\"\n";
$obj->write(filename => 'file2', writemode => '>>');

print "Reading and printing the data from \"file2\":\n";

my $file2 = FileIO->new(  );

$file2->read(
  filename => 'file2'
);

print "The file name is ", $file2->get_filename, "\n";
print "The contents of the file are:\n", $file2->get_filedata;

I finally run the test program to get the following output:

The file name is file1.txt
The contents of the file are:
> sample dna  (This is a typical fasta header.)
agatggcggcgctgaggggtcttgggggctctaggccggccacctactgg
tttgcagcggagacgacgcatggggcctgcgcaataggagtacgctgcct
gggaggcgtgactagaagcggaagtagttgtgggcgcctttgcaaccgcc
tgggacgccgccgagtggtctgtgcaggttcgcgggtcgctggcgggggt
cgtgagggagtgcgccgggagcggagatatggagggagatggttcagacc
cagagcctccagatgccggggaggacagcaagtccgagaatggggagaat
acacctgagccactctcagatgaggaccta

The date of the file is Thu Dec  5 11:22:56 2002
The reset date of the file is today
The write mode of the file is >

Resetting the data and filename
Writing a new file "file2"
Appending to the new file "file2"
Reading and printing the data from "file2":
The file name is file2
The contents of the file are:
line1
line2
line1
line2

The module seems to be performing as hoped. So, now we have a simple module that reads and writes files and provides a few options for the write mode.

But, frankly, this isn’t too impressive. You’ve already been reading and writing files in Perl without the overhead of this FileIO module. The interface to the code is nice, and it’s good to have objects that contain the file data, but what has really been accomplished?

The real power of this approach is coming up next. Using class inheritance, this simple module can be extended relatively easily in a very useful direction.

It’s another case of the basic software engineering approach of making small, simple, generally useful tools, and then combining them into more powerful and specific applications. So, next, I’ll take my simple FileIO class and use it as a base class for a bioinformatics-specific class.

SeqFileIO.pm: Sequence File Formats

Our primary interest is bioinformatics.Can we extend the FileIO class to handle biological sequence datafiles? For example, can a class be written that takes a GenBank file and writes the sequence out in FASTA format?

Using the technique of inheritance, in this section I present a module for a new class SeqFileIO that performs several basic functions on sequence files of various formats. When you call this module’s read method, in addition to reading the file’s contents and setting the name, date, and write mode of the file, it automatically determines the format of the sequence file, extracts the sequence, and when available, extracts the annotation, ID, and accession number. In addition, a set of put methods makes it easy to present the sequence and annotation in other formats.[11]

Analysis of SeqFileIO.pm

The first part of the module SeqFileIO.pm contains the block with definitions of the new, or revised, class data and methods.

The first thing you should notice is the use command:

use base ( "FileIO" );

This Perl command tells the current package SeqFileIO it’s inheriting from the base class FileIO. Here’s another statement that’s often used for this purpose:

@ISA = ( "FileIO" );

The @ISA predefined variable tells a package that it “is a” version of some base class; it then can inherit methods from that base class. The use base directive sets the @ISA array to the base class(es), plus a little else besides. (Check perldoc base for the whole story.) Without getting bogged down in details use base works a little more robustly than just setting the @ISA array, so that’s what I’ll use here:

package SeqFileIO;

use base ( "FileIO" );

use strict;
use warnings;
#use vars '$AUTOLOAD';
use Carp;

# Class data and methods
{
    # A list of all attributes with defaults and read/write/required/noinit properties
    my %_attribute_properties = (
        _filename    => [ '',   'read.write.required'],
        _filedata    => [ [ ],  'read.write.noinit'],
        _date        => [ '',   'read.write.noinit'],
        _writemode   => [ '>',  'read.write.noinit'],
        _format      => [ '',   'read.write'],
        _sequence    => [ '',   'read.write'],
        _header      => [ '',   'read.write'],
        _id          => [ '',   'read.write'],
        _accession   => [ '',   'read.write'],
    );
        
    # Return a list of all attributes
    sub _all_attributes {
            keys %_attribute_properties;
    }

    # Check if a given property is set for a given attribute
    sub _permissions {
        my($self, $attribute, $permissions) = @_;
        $_attribute_properties{$attribute}[1] =~ /$permissions/;
    }

    # Return the default value for a given attribute
    sub _attribute_default {
            my($self, $attribute) = @_;
        $_attribute_properties{$attribute}[0];
    }

    my @_seqfileformats = qw(
        _raw
        _embl
        _fasta
        _gcg
        _genbank
        _pir
        _staden
    );
    sub isformat {
         my($self) = @_;

        for my $format (@_seqfileformats) {
            my $is_format = "is$format";

            if($self->$is_format) {
                return $format;
            }
        }
        return 'unknown';
    }

}

The power of inheritance

A comparison with the opening block in the base class FileIO is instructive. You’ll see that I’m redefining the %_attribute_properties hash and the methods in the block that access the hash as closures. This is necessary to extend the new class to handle new attributes that relate specifically to sequence datafiles:

_format

The format of the sequence datafile, such as FASTA or GenBank

_sequence

The sequence data extracted from the sequence datafile as a scalar string

_header

The header part of the annotation; defined somewhat loosely in this module

_id

The ID of the sequence datafile, such as a gene name or other identifier

_accession

The accession number of the sequence datafile, when provided

You’ll notice, in comparing this opening block with the opening block of the base class FileIO, that the methods and variables relating to counting the number of objects has been omitted in this new class.

Here’s where you see the power of inheritance for the first time. When the code in the new SeqFileIO.pm module tries to call, say, the get_count method, and Perl sees that it’s not defined in the module, it will go on a hunt for the method in the base class and use the definition it finds there. On the other hand, if it finds the method in SeqFileIO.pm, it just uses that; if the get_count method appeared in the base class but is redefined in SeqFileIO.pm, it uses that redefinition and ignores the older definition in the base class.

So, you don’t have to rewrite methods in the base class(es); you can just call them. Perl not only finds them, it also calls them as if they were defined in the new class. For example, ref($self) returns SeqFileIO, not FileIO, without regard to whether the method is defined in the new class or inherited from the old class.

Finally, there are some new definitions in this new version of the opening block. An array @_seqfileformats lists the sequence file formats the module knows about, and a method isformat calls the methods associated with each format (such as is_fasta defined later in the module) until it either finds what format the file is in or returns unknown.

The @_seqfileformats array uses the qw operator. This splits the words on whitespace (including newlines) and returns a list of quoted words. It’s a convenience for giving lists of words without having to quote each one or separate them by commas. (Check the perlop manpage for the section on “RegExp Quote-Like Operators” for all the variations and alternative quoting operators.)

Following the opening block, notice that there is no new method defined in this class. The simple, bare-bones new method from the FileIO class serves just as well for this new class; thanks to the inheritance mechanism, there’s no need to write a new one. A program that calls SeqFileIO->new will use the same method from the FileIO class, but the class name will be properly set to SeqFileIO for the new object created.

A new read method

The next part of the program has a rewrite of the read method. As a result, the read method from the parent FileIO class is being overridden:

# Called from object, e.g. $obj->read(  );
sub read {
    my ($self, %arg) = @_;

    # Set attributes
    foreach my $attribute ($self->_all_attributes(  )) {
        # E.g. attribute = "_filename",  argument = "filename"
        my($argument) = ($attribute =~ /^_(.*)/);

        # If explicitly given
        if (exists $arg{$argument}) {
            # If initialization is not allowed
            if($self->_permissions($attribute, 'noinit')) {
                croak("Cannot set $argument from read: use set_$argument");
            }
            $self->{$attribute} = $arg{$argument};
        # If not given, but required
        }elsif($self->_permissions($attribute, 'required')) {
            croak("No $argument attribute as required");
        # Set to the default
        }else{
            $self->{$attribute} = $self->_attribute_default($attribute);
        }
    }

    # Read file data
    unless( open( FileIOFH, $self->{_filename} ) ) {
        croak("Cannot open file " .  $self->{_filename} );
    }
    $self->{'_filedata'} = [ <FileIOFH> ];
    $self->{'_date'} = localtime((stat FileIOFH)[9]);
    $self->{'_format'} = $self->isformat;
    my $parsemethod = 'parse' . $self->{'_format'};
    $self->$parsemethod;

    close(FileIOFH);
}

The new read method just shown is almost exactly the same as the read function in the FileIO class. There are only three new lines of code, just before the end of the subroutine, preceding the call to the close function:

$self->{'_format'} = $self->isformat;
my $parsemethod = 'parse' . $self->{'_format'};
$self->$parsemethod;

These three new lines initialize the new attributes that were defined for this class. First, a call to isformat determines the format of the file and sets the _format attribute appropriately. The appropriate parse_ method name is then constructed, and finally, a call to that method is made. As you’re about to see, the parse_ methods extract the sequence, header, ID, and accession number (or as many of these as possible) from the file data and set the object’s attributes accordingly.

So, by the end of the read method, the file data has been read into the object, the format determined, and the interesting parts of the data (such as the sequence data) parsed out.

New Methods: is, parse, and put

The rest of the module consists of three groups of methods that handle the different sequence file formats. These methods didn’t appear in the more simple and generic FileIO module and must be defined here:

is_

Tests to see if data is in a particular sequence file format

parse_

Parses out the sequence, and when possible the header, ID, and accession number

put_

Takes the sequence attribute, and when available the header, ID, and accession number attributes, and writes them in the sequence file format named in the format attribute

is_ methods

The first group of methods tests an object’s file data to see if it conforms to a particular sequence file format:

sub is_raw {
  my($self) = @_;

  my $seq = join('', @{$self->{_filedata}} );
  ($seq =~ /^[ACGNT\s]+$/) ? return 'raw' : 0;
}

sub is_embl {
  my($self) = @_;

  my($begin,$seq,$end) = (0,0,0);

  foreach( @{$self->{_filedata}} ) {
    /^ID\s/ && $begin++;
    /^SQ\s/ && $seq++;
    /^\/\// && $end++;
    (($begin =  = 1) && ($seq =  = 1) && ($end =  = 1)) && return 'embl';
  }
  return;
}

sub is_fasta {
  my($self) = @_;

  my($flag) = 0;

  for(@{$self->{_filedata}}) {
    #This to avoid confusion with Primer, which can have input beginning ">"
    /^\*seq.*:/i && ($flag = 0) && last;
    if( /^>/ && $flag =  = 1) {
      last;
    }elsif( /^>/ && $flag =  = 0) {
      $flag = 1;
    }elsif( (! /^>/) && $flag =  = 0) { #first line must start with ">"
      last;
    }
  }
  $flag ? return 'fasta' : return;
}

sub is_gcg {
  my($self) = @_;

  my($i,$j) = (0,0);

  for(@{$self->{_filedata}}) {
    /^\s*$/ && next;
    /Length:.*Check:/ && ($i += 1);
    /^\s*\d+\s*[a-zA-Z\s]+/ && ($j += 1);
    ($i =  = 1) && ($j =  = 1) && return('gcg');
  }
  return;
}


sub is_genbank {
  my($self) = @_;

  my $Features = 0;

  for(@{$self->{_filedata}}) {
    /^LOCUS/ && ($Features += 1);
    /^DEFINITION/ && ($Features += 2);
    /^ACCESSION/ && ($Features += 4);
    /^ORIGIN/ &&  ($Features += 8);
    /^\/\// && ($Features += 16);
    ($Features =  = 31) && return 'genbank';
  }
  return;
}

sub is_pir {
  my($self) = @_;

  my($ent,$ti,$date,$org,$ref,$sum,$seq,$end) = (0,0,0,0,0,0,0,0);

  for(@{$self->{_filedata}}) {
    /ENTRY/ && $ent++;
    /TITLE/ && $ti++;
    /DATE/ && $date++;
    /ORGANISM/ && $org++;
    /REFERENCE/ && $ref++;
    /SUMMARY/ && $sum++;
    /SEQUENCE/ && $seq++;
    /\/\/\// && $end++;
    $ent =  = 1 && $ti =  = 1 && $date >= 1 && $org >= 1
      && $ref >= 1 && $sum =  = 1 && $seq =  = 1 && $end =  = 1
      && return 'pir';
  }
  return;
}

sub is_staden {
  my($self) = @_;
  for(@{$self->{_filedata}}) {
    /<-+([^-]*)-+>/ && return 'staden';
  }
  0;
}

is_ methods are designed to be fast. They don’t check to see if the file conforms to the official file format in every detail. Instead, they look to see if, given that the file is supposed to be a sequence file, there is a good indication that it is a particular file format. In other words, these methods perform heuristic, not rigorous, tests. If they are well conceived, these methods correctly identify the different formats without confusion and with a minimum of code and computation. However, they don’t ensure that a format is conforming in every respect to the official format definition. (See the exercises for other approaches to file format recognition.)

Also, note that these are fairly simple tests; for example, the is_raw method doesn’t check for the IUB ambiguity codes but only the four bases A, C, G, T, plus N for an undetermined base. Furthermore, some software systems have more than one format, for which only one format is shown here, such as PIR and GCG. The bottom line is that this code does what it is intended to do, but not more. It is, however, fairly short and easy to modify and extend, and I encourage you to try your hand at doing so in the exercises.

An interesting Perl note: the “leaning toothpick” regular expression notation for three forward slashes /// is a bit forbidding because each forward slash needs a backslash escape in front of it:

    /\/\/\// && $end++;

An alternative would be to use m and a separator other than /, which leads to more readable code:

    m!///! && $end++;

One more interesting Perl note: in this code, I return a false value from a subroutine like so:

return;

This is a bit confusing because unless you already know or check in the documentation for return, it’s not clear what’s happening. In a scalar context, return; returns the undefined value undef; in a list context, return; returns an empty list. So, no matter what context the subroutine is called from, a valid value is returned.

put_ methods

The next group of put_ methods returns an object’s sequence and annotation data in a particular sequence file format:

sub put_raw {
  my($self) = @_;

  my($out);
  ($out = $self->{_sequence}) =~ tr/a-z/A-Z/;
  return($out);
}

sub put_embl {
  my($self) = @_;

  my(@out,$tmp,$len,$i,$j,$a,$c,$g,$t,$o);

  $len = length($self->{_sequence});
  $a=($self->{_sequence} =~ tr/Aa//);
  $c=($self->{_sequence} =~ tr/Cc//);
  $g=($self->{_sequence} =~ tr/Gg//);
  $t=($self->{_sequence} =~ tr/Tt//);
  $o=($len - $a - $c - $g - $t);
  $i=0;
  $out[$i++] = sprintf("ID   %s %s\n",$self->{_header}, $self->{_id} );
  $out[$i++] = "XX\n";
  $out[$i++] = sprintf("SQ   sequence %d BP; %d A; %d C; %d G; %d T; %d other;\n",
              $len, $a, $c, $g, $t, $o);
  for($j = 0 ; $j < $len ; ) {
      $out[$i] .= sprintf("%s",substr($self->{_sequence},$j,10));
      $j += 10;
      if( $j < $len && $j % 60 != 0 ) {
        $out[$i] .= " ";
      }elsif ($j % 60 =  = 0 ) {
        $out[$i++] .= "\n";
      }
  }
  if($j % 60 != 0 ) {
    $out[$i++] .= "\n";
  }
  $out[$i] = "//\n";
  return @out;
}

sub put_fasta {
  my($self) = @_;

  my(@out,$len,$i,$j);

  $len = length($self->{_sequence});
  $i = 0;
  $out[$i++] = "> " . $self->{_header} . "\n";
  for($j=0; $j<$len ; $j += 50) {
    $out[$i++]=sprintf("%.50s\n",substr($self->{_sequence},$j,50));
  }
  return @out;
}

sub put_gcg {
  my($self) = @_;

  my(@out,$len,$i,$j,$cnt,$sum);
  $len = length($self->{_sequence});

  #calculate Checksum
  for($i=0; $i<$len ;$i++) {
    $cnt++;
    $sum += $cnt * ord(substr($self->{_sequence},$i,1));
    ($cnt =  = 57)&& ($cnt=0);
  }
  $sum %= 10000;

  $i = 0;             
  $out[$i++] = sprintf("%s\n",$self->{_header});
  $out[$i++] = sprintf("    %s Length: %d (today)  Check: %d  ..\n", $self->{_id},
                      $len, $sum);
  for($j = 0 ; $j < $len ; ) {
      if( $j % 50 =  = 0) {
        $out[$i] = sprintf("%8d  ",$j+1);
      }
      $out[$i] .= sprintf("%s",substr($self->{_sequence},$j,10));
      $j += 10;
      if( $j < $len && $j % 50 != 0 ) {
        $out[$i] .= " ";
      }elsif ($j % 50 =  = 0 ) {
        $out[$i++] .= "\n";
      }
  }
  if($j % 50 != 0 ) {
    $out[$i] .= "\n";
  }
  return @out;
}

sub put_genbank {
  my($self) = @_;

  my(@out,$len,$i,$j,$cnt,$sum);
  my($seq) = $self->{_sequence};

  $seq =~ tr/A-Z/a-z/;
  $len = length($seq);
  for($i=0; $i<$len ;$i++) {
    $cnt++;
    $sum += $cnt * ord(substr($seq,$i,1));
    ($cnt =  = 57) && ($cnt=0);
  }
  $sum %= 10000;
  $i = 0;
  $out[$i++] = sprintf("LOCUS       %s       %d bp\n",$self->{_id}, $len);
  $out[$i++] = sprintf("DEFINITION  %s , %d bases, %d sum.\n", $self->{_header},
                       $len, $sum);
  $out[$i++] = sprintf("ACCESSION  %s\n", $self->{_accession}, );
  $out[$i++] = sprintf("ORIGIN\n");
  for($j = 0 ; $j < $len ; ) {
      if( $j % 60 =  = 0) {
        $out[$i] = sprintf("%8d  ",$j+1);
      }
      $out[$i] .= sprintf("%s",substr($seq,$j,10));
      $j += 10;
      if( $j < $len && $j % 60 != 0 ) {
        $out[$i] .= " ";
      }elsif($j % 60 =  = 0 ) {
        $out[$i++] .= "\n";
      }
  }
  if($j % 60 != 0 ) {
    $out[$i] .= "\n";
    ++i;
  }
  $out[$i] = "//\n";
  return @out;
}

sub put_pir {
  my($self) = @_;

  my($seq) = $self->{_sequence};
  my(@out,$len,$i,$j,$cnt,$sum);
  $len = length($seq);
  for($i=0; $i<$len ;$i++) {
    $cnt++;
    $sum += $cnt * ord(substr($seq,$i,1));
    ($cnt=  =57) && ($cnt=0);
  }
  $sum %= 10000;
  $i = 0;
  $out[$i++] = sprintf("ENTRY           %s\n",$self->{_id});
  $out[$i++] = sprintf("TITLE           %s\n",$self->{_header});
  #JDT ACCESSION out if defined
  $out[$i++] = sprintf("DATE            %s\n",'');
  $out[$i++] = sprintf("REFERENCE       %s\n",'');
  $out[$i++] = sprintf("SUMMARY #Molecular-weight %d  #Length %d  #Checksum %d\
                      n",0,$len,$sum);
  $out[$i++] = sprintf("SEQUENCE\n");
  $out[$i++] = sprintf("              5      10      15      20      25      30\n");
  for($j=1; $seq && $j < $len ; $j += 30) {
    $out[$i++] = sprintf("%7d ",$j);
    $out[$i++] = sprintf("%s\n", join(' ',split(//,substr($seq, $j - 1,length($seq) 
                      < 30 ? length($seq) : 30))) );
  }
  $out[$i++] = sprintf("///\n");
  return @out;
}


sub put_staden {
  my($self) = @_;

  my($seq) = $self->{_sequence};
  my($i,$j,$len,@out);

  $i = 0;
  $len = length($self->{_sequence});
  $out[$i] = ";\<------------------\>\n";
  substr($out[$i],int((20-length($self->{_id}))/2),length($self->{_id})) = $self->
                          {_id};
  $i++;
  for($j=0; $j<$len ; $j+=60) {
    $out[$i++]=sprintf("%s\n",substr($self->{_sequence},$j,60));
  }
  return @out;
}

The put_ methods are, by necessity, more cognizant of the detailed rules of a particular file format.

Note the Perl function ord in put_gcg. This built-in function gives the numeric value of a character. Other Perl functions such as sprintf and substr can be reviewed as necessary by typing perldoc -f substr (for example) or by visiting the http://www.perdoc.com web page.

parse_ methods

parse_ , the third and final group of methods, parses the contents of a file in a particular format, extracting the sequence and, if possible, some additional header information.

sub parse_raw {
  my($self) = @_;

## Header and ID should be set in calling program after this 
  my($seq);
  
  $seq = join('',@{$self->{_filedata}});
  if( ($seq =~ /^([acgntACGNT\-\s]+)$/)) {
    ($self->{_sequence} = $seq) =~ s/\s//g;
  }else{
    carp("parse_raw failed");
  }
}


sub parse_embl {
  my($self) = @_;

  my($begin,$seq,$end,$count) = (0,0,0,0);
  my($sequence,$head,$acc,$id);

  for(@{$self->{_filedata}}) {
    ++$count;
    if(/^ID/) {
      $begin++;
      /^ID\s*(.*\S)\s*/ && ($id = ($head = $1)) =~ s/\s.*//;
    }elsif(/^SQ\s/){
      $seq++;
    }elsif(/^\/\//){
      $end++;
    }elsif($seq =  = 1){
      $sequence .= $_;
    }elsif(/^AC\s*(.*(;|\S)).*/){ #put this here - AC could be sequence
      $acc .= $1;
    }
    if($begin =  = 1 && $seq =  = 1 && $end =  = 1) {
      $sequence =~ tr/a-zA-Z//cd;
      $sequence =~ tr/a-z/A-Z/;
      $self->{_sequence} = $sequence;
      $self->{_header} = $head; 
      $self->{_id} = $id;
      $self->{_accession} = $acc;
      return 1;
    }
  }
  return;
}

sub parse_fasta {
  my($self) = @_;

  my($flag,$count) = (0,0);
  my($seq,$head,$id);

  for(@{$self->{_filedata}}) {
    #avoid confusion with Primer, which can have input beginning ">"
    /^\*seq.*:/i && ($flag =  = 0) && last;

    if(/^>/ && $flag =  = 1) { 
      last;
    }elsif(/^>/ && $flag =  = 0){
      /^>\s*(.*\S)\s*/ && ($id=($head=$1)) =~ s/\s.*//;
      $flag=1;
    }elsif( (! /^>/) && $flag =  = 1) { 
      $seq .= $_;
    }elsif( (! /^>/) && $flag =  = 0) {
       last;
    }
    ++$count;
  }
  if($flag) {
    $seq =~ tr/a-zA-Z-//cd;
    $seq =~ tr/a-z/A-Z/;

    $self->{_sequence} = $seq;
    $self->{_header} = $head;
    $self->{_id} = $id;
  }
}

sub parse_gcg {
  my($self) = @_;

  my($seq,$head,$id);
  my($count,$flag) = (0,0);

  for(@{$self->{_filedata}}) {
    if(/^\s*$/) { 
      ;
    }elsif($flag =  = 0 && /Length:.*Check:/){
      /^\s*(\S+).*Length:.*Check:/;
      $flag=1;
      ($id=$1) =~ s/\s.*//;
    }elsif($flag =  = 0 && /^\S/) {
      ($head = $_) =~ s/\n//; 
    }elsif($flag =  = 1 && /^\s*[^\d\s]/) {
      last;
    }elsif($flag =  = 1 && /^\s*\d+\s*[a-zA-Z \t]+$/) {
       $seq .= $_;
    }
    $count++;
  }
  $seq =~ tr/a-zA-Z//cd;
  $seq =~ tr/a-z/A-Z/;
  $head = $id unless $head;

  $self->{_sequence} = $seq;
  $self->{_header} = $head;
  $self->{_id} = $id;

  return 1;
}
#
#
sub parse_genbank {
  my($self) = @_;

  my($count,$features,$flag,$seqflag) = (0,0,0,0);
  my($seq,$head,$id,$acc);

  for(@{$self->{_filedata}}) {
    if( /^LOCUS/ && $flag =  = 0 ) {
      /^LOCUS\s*(.*\S)\s*$/;
      ($id=($head=$1)) =~ s/\s.*//;
      $features += 1;
      $flag = 1;
    }elsif( /^DEFINITION\s*(.*)/ && $flag =  = 1) {
      $head .= " $1";
      $features += 2;
    }elsif( /^ACCESSION/ && $flag =  = 1 ) {
      /^ACCESSION\s*(.*\S)\s*$/;
      $head .= " ".($acc=$1);
      $features += 4;
    }elsif( /^ORIGIN/ ) {
      $seqflag = 1;
      $features += 8;
    }elsif( /^\/\// ) {
      $features += 16;
    }elsif( $seqflag =  = 0 ) { 
      ; 
    }elsif($seqflag =  = 1) { 
      $seq .= $_;
    }
    ++$count;
    if($features =  = 31) {
      $seq =~ tr/a-zA-Z//cd;
      $seq =~ tr/a-z/A-Z/;

      $self->{_sequence} = $seq;
      $self->{_header} = $head;
      $self->{_id} = $id;
      $self->{_accession} = $acc;

      return 1;
    }
  }
  return;
}

sub parse_pir {
  my($self) = @_;

  my($begin,$tit,$date,$organism,$ref,$summary,$seq,$end,$count) = 
                     (0,0,0,0,0,0,0,0,0);
  my($flag,$seqflag) = (0,0);
  my($sequence,$header,$id,$acc);

  for(@{$self->{_filedata}}) {
    ++$count;
    if( /^ENTRY\s*(.*\S)\s*$/ && $flag =  = 0 ) {
      $header=$1;
      $flag=1;
      $begin++;
    }elsif( /^TITLE\s*(.*\S)\s*$/ && $flag =  = 1 ) {
      $header .= $1;
      $tit++;
    }elsif( /ORGANISM/ ) {
      $organism++;
    }elsif( /^ACCESSIONS\s*(.*\S)\s*$/ && $flag =  = 1 ) {
      ($id=($acc=$1)) =~ s/\s*//;
    }elsif( /DATE/ ) {
      $date++;
    }elsif( /REFERENCE/ ) {
      $ref++;
    }elsif( /SUMMARY/ ) {
      $summary++;
    }elsif( /^SEQUENCE/ ) {
      $seqflag = 1;
      $seq++;
    }elsif( /^\/\/\// && $flag =  = 1 ) {
      $end++;
    }elsif( $seqflag =  = 0) {
      next;
    }elsif( $seqflag =  = 1 && $flag =  = 1) {
      $sequence .= $_;
    }
    if( $begin =  = 1 && $tit =  = 1 && $date >= 1 && $organism >= 1
        && $ref >= 1 && $summary =  = 1 && $seq =  = 1 && $end =  = 1 
      ) {
      $sequence =~ tr/a-zA-Z//cd;
      $sequence =~ tr/a-z/A-Z/;

      $self->{_sequence} = $seq;
      $self->{_header} = $header;
      $self->{_id} = $id;
      $self->{_accession} = $acc;

      return 1;
    }
  }
  return;
}

sub parse_staden {
  my($self) = @_;

  my($flag,$count) = (0,0);
  my($seq,$head,$id);
  for(@{$self->{_filedata}}) {
    if( /<---*\s*(.*[^-\s])\s*-*--->(.*)/ && $flag =  = 0 ) {
      $id = $head = $1;
      $seq .= $2;
      $flag = 1;
    }elsif( /<---*(.*)-*--->/ && $flag =  = 1 ) {
      $count--;
      last;
    }elsif( $flag =  = 1 ) {
      $seq .= $_;
    }
    ++$count;
  }
  if( $flag ) {
    $seq =~ s/-/N/g;
    $seq =~ tr/a-zA-Z-//cd;
    $seq =~ tr/a-z/A-Z/;

    $self->{_sequence} = $seq;
    $self->{_header} = $head;
    $self->{_id} = $id;

    return 1;
  }
  return;
}

1;

That’s the end of the SeqFileIO.pm module that defines the SeqFileIO class.

You can see that to add the capability to handle a new sequence file format, you simply have to write new is_, put_, and parse_ methods, and add the name of the new format to the @_seqfiletypes array. So, extending this software to handle more sequence file formats is relatively easy. (See the exercises.)

To end this chapter, here is a small test program testSeqFileIO that exercises the main parts of the SeqFileIO.pm module, followed by its output. See the exercises for a discussion of ways to improve this test.

Testing SeqFileIO.pm

Here is the program testSeqFileIO to test the SeqFileIO module:

#!/usr/bin/perl

use strict;
use warnings;
use lib "/home/tisdall/MasteringPerlBio/development/lib";

use SeqFileIO;

#
# First test basic FileIO operations
#  (plus filetype attribute)
#

my $obj = SeqFileIO->new(  );

$obj->read(
  filename => 'file1.txt'
);

print "The file name is ", $obj->get_filename, "\n";
print "The contents of the file are:\n", $obj->get_filedata;
print "\nThe date of the file is ", $obj->get_date, "\n";
print "The filetype of the file is ", $obj->get_filetype, "\n";

$obj->set_date('today');
print "The reset date of the file is ", $obj->get_date, "\n";

print "The write mode of the file is ", $obj->get_writemode, "\n";

print "\nResetting the data and filename\n";
my @newdata = ("line1\n", "line2\n");
$obj->set_filedata( \@newdata );

print "Writing a new file \"file2.txt\"\n";
$obj->write(filename => 'file2.txt');

print "Appending to the new file \"file2.txt\"\n";
$obj->write(filename => 'file2.txt', writemode => '>>');

print "Reading and printing the data from \"file2.txt\":\n";

my $file2 = SeqFileIO->new(  );

$file2->read(
  filename => 'file2.txt'
);

print "The file name is ", $file2->get_filename, "\n";
print "The contents of the file are:\n", $file2->get_filedata;
print "The filetype of the file is ", $file2->get_filetype, "\n";

print <<'HEADER';

##########################################
#
# Test file format recognizing and reading
#
##########################################

HEADER

my $genbank = SeqFileIO->new(  );
$genbank->read(
  filename => 'record.gb'
);
print "The file name is ", $genbank->get_filename, "\n";
print "\nThe date of the file is ", $genbank->get_date, "\n";
print "The filetype of the file is ", $genbank->get_filetype, "\n";
print "The contents of the file are:\n", $genbank->get_filedata;

print "\n####################\n####################\n####################\n";

my $raw = SeqFileIO->new(  );
$raw->read(
  filename => 'record.raw'
);
print "The file name is ", $raw->get_filename, "\n";
print "\nThe date of the file is ", $raw->get_date, "\n";
print "The filetype of the file is ", $raw->get_filetype, "\n";
print "The contents of the file are:\n", $raw->get_filedata;

print "\n####################\n####################\n####################\n";

my $embl = SeqFileIO->new(  );
$embl->read(
  filename => 'record.embl'
);
print "The file name is ", $embl->get_filename, "\n";
print "\nThe date of the file is ", $embl->get_date, "\n";
print "The filetype of the file is ", $embl->get_filetype, "\n";
print "The contents of the file are:\n", $embl->get_filedata;

print "\n####################\n####################\n####################\n";

my $fasta = SeqFileIO->new(  );
$fasta->read(
  filename => 'record.fasta'
);
print "The file name is ", $fasta->get_filename, "\n";
print "\nThe date of the file is ", $fasta->get_date, "\n";
print "The filetype of the file is ", $fasta->get_filetype, "\n";
print "The contents of the file are:\n", $fasta->get_filedata;

print "\n####################\n####################\n####################\n";

my $gcg = SeqFileIO->new(  );
$gcg->read(
  filename => 'record.gcg'
);
print "The file name is ", $gcg->get_filename, "\n";
print "\nThe date of the file is ", $gcg->get_date, "\n";
print "The filetype of the file is ", $gcg->get_filetype, "\n";
print "The contents of the file are:\n", $gcg->get_filedata;

print "\n####################\n####################\n####################\n";

my $staden = SeqFileIO->new(  );
$staden->read(
  filename => 'record.staden'
);
print "The file name is ", $staden->get_filename, "\n";
print "\nThe date of the file is ", $staden->get_date, "\n";
print "The filetype of the file is ", $staden->get_filetype, "\n";
print "The contents of the file are:\n", $staden->get_filedata;

print "\n####################\n####################\n####################\n";

print <<'REFORMAT';

##########################################
#
# Test file format reformatting and writing
#
##########################################

REFORMAT

print "At this point there are ", $staden->get_count, " objects.\n\n";

print "######\n###### Testing put methods\n######\n\n";

print "\nPrinting staden data in raw format:\n";
print $staden->put_raw;

print "\nPrinting staden data in embl format:\n";
print $staden->put_embl;

print "\nPrinting staden data in fasta format:\n";
print $staden->put_fasta;

print "\nPrinting staden data in gcg format:\n";
print $staden->put_gcg;

print "\nPrinting staden data in genbank format:\n";
print $staden->put_genbank;

print "\nPrinting staden data in PIR format:\n";
print $staden->put_pir;

The test program depends on certain files being present on the system; the contents of the files are clear from the program output, and the files can be downloaded from this book’s web site, along with the rest of the programs for this book.

Results

Here is the output from testSeqFileIO:

The file name is file1.txt
The contents of the file are:
> sample dna  (This is a typical fasta header.)
agatggcggcgctgaggggtcttgggggctctaggccggccacctactgg
tttgcagcggagacgacgcatggggcctgcgcaataggagtacgctgcct
gggaggcgtgactagaagcggaagtagttgtgggcgcctttgcaaccgcc
tgggacgccgccgagtggtctgtgcaggttcgcgggtcgctggcgggggt
cgtgagggagtgcgccgggagcggagatatggagggagatggttcagacc
cagagcctccagatgccggggaggacagcaagtccgagaatggggagaat
acacctgagccactctcagatgaggaccta

The date of the file is Thu Dec  5 11:22:56 2002
The filetype of the file is _fasta
The reset date of the file is today
The write mode of the file is >

Resetting the data and filename
Writing a new file "file2.txt"
Appending to the new file "file2.txt"
Reading and printing the data from "file2.txt":
The file name is file2.txt
The contents of the file are:
line1
line2
line1
line2
The filetype of the file is _unknown

##########################################
#
# Test file format recognizing and reading
#
##########################################

The file name is record.gb

The date of the file is Sun Mar 30 14:30:09 2003
The filetype of the file is _genbank
The contents of the file are:
LOCUS       AB031069     2487 bp    mRNA            PRI       27-MAY-2000
DEFINITION  Sequence severely truncated for demonstration.
ACCESSION   AB031069
VERSION     AB031069.1  GI:8100074
KEYWORDS    .
SOURCE      Homo sapiens embryo male lung fibroblast cell_line:HuS-L12 cDNA to
            mRNA.
  ORGANISM  Homo sapiens
            Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
            Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo.
REFERENCE   1  (sites)
  AUTHORS   Fujino,T., Hasegawa,M., Shibata,S., Kishimoto,T., Imai,Si. and
            Takano,T.
  TITLE     PCCX1, a novel DNA-binding protein with PHD finger and CXXC domain,
            is regulated by proteolysis
  JOURNAL   Biochem. Biophys. Res. Commun. 271 (2), 305-310 (2000)
  MEDLINE   20261256
REFERENCE   2  (bases 1 to 2487)
  AUTHORS   Fujino,T., Hasegawa,M., Shibata,S., Kishimoto,T., Imai,S. and
            Takano,T.
  TITLE     Direct Submission
  JOURNAL   Submitted (15-AUG-1999) to the DDBJ/EMBL/GenBank databases.
            Tadahiro Fujino, Keio University School of Medicine, Department of
            Microbiology; Shinanomachi 35, Shinjuku-ku, Tokyo 160-8582, Japan
            (E-mail:fujino@microb.med.keio.ac.jp,
            Tel:+81-3-3353-1211(ex.62692), Fax:+81-3-5360-1508)
FEATURES             Location/Qualifiers
     source          1..2487
                     /organism="Homo sapiens"
                     /db_xref="taxon:9606"
                     /sex="male"
                     /cell_line="HuS-L12"
                     /cell_type="lung fibroblast"
                     /dev_stage="embryo"
     gene            229..2199
                     /gene="PCCX1"
     CDS             229..2199
                     /gene="PCCX1"
                     /note="a nuclear protein carrying a PHD finger and a CXXC
                     domain"
                     /codon_start=1
                     /product="protein containing CXXC domain 1"
                     /protein_id="BAA96307.1"
                     /db_xref="GI:8100075"
                     /translation="MEGDGSDPEPPDAGEDSKSENGENAPIYCICRKPDINCFMIGCD
                     NCNEWFHGDCIRITEKMAKAIREWYCRECREKDPKLEIRYRHKKSRERDGNERDSSEP
                     RDEGGGRKRPVPDPDLQRRAGSGTGVGAMLARGSASPHKSSPQPLVATPSQHHQQQQQ
                     QIKRSARMCGECEACRRTEDCGHCDFCRDMKKFGGPNKIRQKCRLRQCQLRARESYKY
                     FPSSLSPVTPSESLPRPRRPLPTQQQPQPSQKLGRIREDEGAVASSTVKEPPEATATP
                     EPLSDEDLPLDPDLYQDFCAGAFDDHGLPWMSDTEESPFLDPALRKRAVKVKHVKRRE
                     KKSEKKKEERYKRHRQKQKHKDKWKHPERADAKDPASLPQCLGPGCVRPAQPSSKYCS
                     DDCGMKLAANRIYEILPQRIQQWQQSPCIAEEHGKKLLERIRREQQSARTRLQEMERR
                     FHELEAIILRAKQQAVREDEESNEGDSDDTDLQIFCVSCGHPINPRVALRHMERCYAK
                     YESQTSFGSMYPTRIEGATRLFCDVYNPQSKTYCKRLQVLCPEHSRDPKVPADEVCGC
                     PLVRDVFELTGDFCRLPKRQCNRHYCWEKLRRAEVDLERVRVWYKLDELFEQERNVRT
                     AMTNRAGLLALMLHQTIQHDPLTTDLRSSADR"
BASE COUNT      564 a    715 c    768 g    440 t
ORIGIN      
        1 agatggcggc gctgaggggt cttgggggct ctaggccggc cacctactgg tttgcagcgg
       61 agacgacgca tggggcctgc gcaataggag tacgctgcct gggaggcgtg actagaagcg
      121 gaagtagttg tgggcgcctt tgcaaccgcc tgggacgccg ccgagtggtc tgtgcaggtt
      181 cgcgggtcgc tggcgggggt cgtgagggag tgcgccggga gcggagatat ggagggagat
      241 aaaaaaaaaa aaaaaaaaaa aaaaaaa
//

####################
####################
####################
The file name is record.raw

The date of the file is Sun Mar 30 14:30:39 2003
The filetype of the file is _raw
The contents of the file are:
AGATGGCGGCGCTGAGGGGTCTTGGGGGCTCTAGGCCGGCCACCTACTGGTTTGCAGCGGAGACGACGCATGGGGCCTGCGCAAT
AGGAGTACGCTGCCTGGGAGGCGTGACTAGAAGCGGAAGTAGTTGTGGGCGCCTTTGCAACCGCCTGGGACGCCGCCGAGTGGTC
TGTGCAGGTTCGCGGGTCGCTGGCGGGGGTCGTGAGGGAGTGCGCCGGGAGCGGAGATATGGAGGGAGATAAAAAAAAAAAAAAA
AAAAAAAAAAAA
####################
####################
####################
The file name is record.embl

The date of the file is Sun Mar 30 14:31:23 2003
The filetype of the file is _embl
The contents of the file are:
ID   AB031069     2487 bp    mRNA            PRI       27-MAY-2000 Sequence severely 
truncated for demonstration. AB031069 AB031069
XX
SQ   sequence 267 BP; 65 A; 54 C; 106 G; 42 T; 0 other;
AGATGGCGGC GCTGAGGGGT CTTGGGGGCT CTAGGCCGGC CACCTACTGG TTTGCAGCGG
AGACGACGCA TGGGGCCTGC GCAATAGGAG TACGCTGCCT GGGAGGCGTG ACTAGAAGCG
GAAGTAGTTG TGGGCGCCTT TGCAACCGCC TGGGACGCCG CCGAGTGGTC TGTGCAGGTT
CGCGGGTCGC TGGCGGGGGT CGTGAGGGAG TGCGCCGGGA GCGGAGATAT GGAGGGAGAT
AAAAAAAAAA AAAAAAAAAA AAAAAAA
//

####################
####################
####################
The file name is record.fasta

The date of the file is Sun Mar 30 14:31:40 2003
The filetype of the file is _fasta
The contents of the file are:
> AB031069     2487 bp    mRNA            PRI       27-MAY-2000 Sequence severely 
truncated for demonstration. AB031069
AGATGGCGGCGCTGAGGGGTCTTGGGGGCTCTAGGCCGGCCACCTACTGG
TTTGCAGCGGAGACGACGCATGGGGCCTGCGCAATAGGAGTACGCTGCCT
GGGAGGCGTGACTAGAAGCGGAAGTAGTTGTGGGCGCCTTTGCAACCGCC
TGGGACGCCGCCGAGTGGTCTGTGCAGGTTCGCGGGTCGCTGGCGGGGGT
CGTGAGGGAGTGCGCCGGGAGCGGAGATATGGAGGGAGATAAAAAAAAAA
AAAAAAAAAAAAAAAAA

####################
####################
####################
The file name is record.gcg

The date of the file is Sun Mar 30 14:31:47 2003
The filetype of the file is _gcg
The contents of the file are:
AB031069     2487 bp    mRNA            PRI       27-MAY-2000 Sequence severely
truncated for demonstration. AB031069
    AB031069 Length: 267 (today)  Check: 4285  ..
       1  AGATGGCGGC GCTGAGGGGT CTTGGGGGCT CTAGGCCGGC CACCTACTGG
      51  TTTGCAGCGG AGACGACGCA TGGGGCCTGC GCAATAGGAG TACGCTGCCT
     101  GGGAGGCGTG ACTAGAAGCG GAAGTAGTTG TGGGCGCCTT TGCAACCGCC
     151  TGGGACGCCG CCGAGTGGTC TGTGCAGGTT CGCGGGTCGC TGGCGGGGGT
     201  CGTGAGGGAG TGCGCCGGGA GCGGAGATAT GGAGGGAGAT AAAAAAAAAA
     251  AAAAAAAAAA AAAAAAA

####################
####################
####################
The file name is record.staden

The date of the file is Sun Mar 30 14:32:01 2003
The filetype of the file is _staden
The contents of the file are:
;<----AB031069------>
AGATGGCGGCGCTGAGGGGTCTTGGGGGCTCTAGGCCGGCCACCTACTGGTTTGCAGCGG
AGACGACGCATGGGGCCTGCGCAATAGGAGTACGCTGCCTGGGAGGCGTGACTAGAAGCG
GAAGTAGTTGTGGGCGCCTTTGCAACCGCCTGGGACGCCGCCGAGTGGTCTGTGCAGGTT
CGCGGGTCGCTGGCGGGGGTCGTGAGGGAGTGCGCCGGGAGCGGAGATATGGAGGGAGAT
AAAAAAAAAAAAAAAAAAAAAAAAAAA

####################
####################
####################

##########################################
#
# Test file format reformatting and writing
#
##########################################

At this point there are 8 objects.

######
###### Testing put methods
######

Printing staden data in raw format:
AGATGGCGGCGCTGAGGGGTCTTGGGGGCTCTAGGCCGGCCACCTACTGGTTTGCAGCGGAGACGACGCATGGGGCCTGCGCAAT
AGGAGTACGCTGCCTGGGAGGCGTGACTAGAAGCGGAAGTAGTTGTGGGCGCCTTTGCAACCGCCTGGGACGCCGCCGAGTGGTC
TGTGCAGGTTCGCGGGTCGCTGGCGGGGGTCGTGAGGGAGTGCGCCGGGAGCGGAGATATGGAGGGAGATAAAAAAAAAAAAAAAA
AAAAAAAAAAA
Printing staden data in embl format:
ID   AB031069 AB031069
XX
SQ   sequence 267 BP; 65 A; 54 C; 106 G; 42 T; 0 other;
AGATGGCGGC GCTGAGGGGT CTTGGGGGCT CTAGGCCGGC CACCTACTGG TTTGCAGCGG
AGACGACGCA TGGGGCCTGC GCAATAGGAG TACGCTGCCT GGGAGGCGTG ACTAGAAGCG
GAAGTAGTTG TGGGCGCCTT TGCAACCGCC TGGGACGCCG CCGAGTGGTC TGTGCAGGTT
CGCGGGTCGC TGGCGGGGGT CGTGAGGGAG TGCGCCGGGA GCGGAGATAT GGAGGGAGAT
AAAAAAAAAA AAAAAAAAAA AAAAAAA
//

Printing staden data in fasta format:
> AB031069
AGATGGCGGCGCTGAGGGGTCTTGGGGGCTCTAGGCCGGCCACCTACTGG
TTTGCAGCGGAGACGACGCATGGGGCCTGCGCAATAGGAGTACGCTGCCT
GGGAGGCGTGACTAGAAGCGGAAGTAGTTGTGGGCGCCTTTGCAACCGCC
TGGGACGCCGCCGAGTGGTCTGTGCAGGTTCGCGGGTCGCTGGCGGGGGT
CGTGAGGGAGTGCGCCGGGAGCGGAGATATGGAGGGAGATAAAAAAAAAA
AAAAAAAAAAAAAAAAA

Printing staden data in gcg format:
AB031069
    AB031069 Length: 267 (today)  Check: 4285  ..
       1  AGATGGCGGC GCTGAGGGGT CTTGGGGGCT CTAGGCCGGC CACCTACTGG
      51  TTTGCAGCGG AGACGACGCA TGGGGCCTGC GCAATAGGAG TACGCTGCCT
     101  GGGAGGCGTG ACTAGAAGCG GAAGTAGTTG TGGGCGCCTT TGCAACCGCC
     151  TGGGACGCCG CCGAGTGGTC TGTGCAGGTT CGCGGGTCGC TGGCGGGGGT
     201  CGTGAGGGAG TGCGCCGGGA GCGGAGATAT GGAGGGAGAT AAAAAAAAAA
     251  AAAAAAAAAA AAAAAAA

Printing staden data in genbank format:
LOCUS       AB031069       267 bp
DEFINITION  AB031069 , 267 bases, 829 sum.
ACCESSION  
ORIGIN
       1  agatggcggc gctgaggggt cttgggggct ctaggccggc cacctactgg tttgcagcgg
      61  agacgacgca tggggcctgc gcaataggag tacgctgcct gggaggcgtg actagaagcg
     121  gaagtagttg tgggcgcctt tgcaaccgcc tgggacgccg ccgagtggtc tgtgcaggtt
     181  cgcgggtcgc tggcgggggt cgtgagggag tgcgccggga gcggagatat ggagggagat
     241  aaaaaaaaaa aaaaaaaaaa aaaaaaa
//

Printing staden data in PIR format:
ENTRY           AB031069
TITLE           AB031069
DATE            
REFERENCE       
SUMMARY         #Molecular-weight 0  #Length 267  #Checksum 4285
SEQUENCE
                5        10        15        20        25        30
      1 A G A T G G C G G C G C T G A G G G G T C T T G G G G G C T
     31 C T A G G C C G G C C A C C T A C T G G T T T G C A G C G G
     61 A G A C G A C G C A T G G G G C C T G C G C A A T A G G A G
     91 T A C G C T G C C T G G G A G G C G T G A C T A G A A G C G
    121 G A A G T A G T T G T G G G C G C C T T T G C A A C C G C C
    151 T G G G A C G C C G C C G A G T G G T C T G T G C A G G T T
    181 C G C G G G T C G C T G G C G G G G G T C G T G A G G G A G
    211 T G C G C C G G G A G C G G A G A T A T G G A G G G A G A T
    241 A A A A A A A A A A A A A A A A A A A A A A A A A A A
///

Resources

  • Inheritance is a fundamental OO technique; see Section 3.14 for Perl OO reference material that includes discussion of inheritance.

  • For programming with sequence data file formats, see the C program readseq by Don Gilbert (at http://iubio.bio.indiana.edu/soft/molbio/readseq).

  • See the Bioperl project at http://www.bioperl.org for alternate ways to handle this programming task in Perl.

  • For a more rigorous but slower approach to parsing sequence files (which is sometimes what you want) see the module Parse::RecDescent by Damien Conway at CPAN.

  • Each sequence file format has documentation that describes it. These formats sometimes change to keep up with the changing nature of biological data. One of the following exercises challenges you to find the documentation for one of the formats and to improve the code in this chapter for that format.

Exercises

Exercise 4.1

Write an object-oriented module DNAsequence whose object has one attribute, a sequence of DNA, and two methods, get_dna and set_dna. Start with the code for Gene.pm, but see how far you can whittle it down to the minimum amount of code necessary to implement this new class.

Exercise 4.2

The FileIO.pm module implements objects that read and write file data. However, they can, depending on the program, deviate substantially from what are actually present in files on your computer. For instance, you can read in all the files in a folder, and then change the filenames and data of all the objects, without writing them out. Is this a good thing or a bad thing?

Exercise 4.3

In the text, you are asked why the new constructor for FileIO.pm has been whittled down to the bare bones. You can see that all it does is create an empty object. What functionality has been moved out of the new constructor and into the read and write methods? Does it make more sense to do without a new constructor entirely and instead have the read and write methods create objects? Try rewriting the code that way. Alternately, does it make sense to try rewriting the code so that both reading and writing are handled by the new constructor? Is creating an object sometimes logically distinct from initializing it?

Exercise 4.4

Use FileIO.pm as a base class for a new class that manages the annotation of a pipeline in your laboratory. For example, perhaps your lab gets sequence from your ABI machine, screens it for vectors, assesses the quality of the sequencing run, searches your local database to determine if you’ve seen it or something like it before, then searches GenBank to see what other known sequences it matches or resembles, and finally adds it to an assembly project. Each step has a person or persons, a timestamp for the beginning and ending of each phase, and data. You want to be able to track the work done on each sequence that emerges from your ABI. (This is just an example. Pick a set of jobs that you actually do in your lab.)

Exercise 4.5

For each sequence file format handled by the SeqFileIO.pm module, find the documentation that specifies the format. Compare the documentation with the is_, parse_, and put_ method to recognize, read, and write files in each format. How can you improve this code? Make it more complete? Faster?

Exercise 4.6

My parse_ methods are somewhat ad hoc. They don’t really parse the whole file according to the definition of the format. They just extract the sequence and a small amount of annotation. Take one of the formats and write a more complete parser for it. What are the advantages and disadvantages of a simple versus a more complete parser in this code? How about for other applications you may want to develop in the future?

Exercise 4.7

Use the parser you developed in Exercise 4.6 to do a more complete job of identifying a file in the same format in the module’s is_ method.

Exercise 4.8

Add a new sequence file format to SeqFileIO.

Exercise 4.9

In FileIO.pm, and in many other places in this book, the program calls croak and exits when a problem arises (such as when unsuccessfully attempting to open a file for reading). Such drastic measures are sometimes desirable; for example, you may want to kill the program if a security problem is discovered in which someone is attempting to read a forbidden file. Or, when developing software, you may like your program to print an informative message and die when a problem occurs, as that might help you develop the program faster.

However, very often what you really want is for the program to notice the error and take some appropriate steps, not simply die. If a file cannot be opened, it may be something as simple as the user of the program mistyping the filename, and what you’d like is to give the user another couple of chances to type the name in correctly. Rewrite FileIO.pm without calling croak. This may entail checking for the success or failure of certain operations and taking reasonable actions on failure. Should the class module take all such actions, or should the program that uses the class module be expected to behave appropriately when a failure is reported?

Exercise 4.10

The AUTOLOAD method in FileIO.pm tests for attributes that are scalars and references to arrays. The need for this comes from the list of attributes given in the %_my_attribute_properties hash. Each attribute hash value is an anonymous array with two elements: default value and properties. From the default value you can see that a value is either a scalar (a string in this case) or an anonymous array (a reference to an array). The code that AUTOLOAD installs for accessor routines then checks if the attribute is either a scalar or a reference to an array.

This AUTOLOAD method is inherited by SeqFileIO.pm. One of the modifications that SeqFileIO.pm makes is defining its own %_my_attribute_properties to handle the new attributes that it defines, such as _sequence. In this case, all the attributes are either scalars or references to arrays, as before. What modifications are necessary if some other data type is needed for a new attribute by a class that inherited FileIO.pm? How can you rewrite FileIO.pm to make it easier to write classes that inherit it?

Exercise 4.11

The test program testSeqFileIO has certain shortcomings. For one thing, it repeats blocks of code that can be replaced with a short loop (with a little rewriting). Another problem is that it doesn’t test everything in the class.

Rewrite testSeqFileIO so that it’s clearer and more comprehensive. By default, make it just give a short summary of the number of tests performed and the number of tests passed, but add a verbose flag so that it prints out all its tests in detail when desired. The module SeqFileIO.pm is lacking POD documentation.Add POD documentation to the module that is fairly easily cut and pasted into a test program for the module.

Exercise 4.12

In SeqFileIO.pm, the hash %_all_attribute_properties changed from the base class and needed to be redefined. However, the code for the _all_attributes, _attribute_default, and _permissions helper methods didn’t change. Why then did the new class SeqFileIO redefine these methods? (Hint: are these helper methods closures?) SeqFileIO.pm is also lacking POD documentation. Try adding POD documentation to the module soy that it can be easily cut and pasted into a test program for the module.

Exercise 4.13

The h2xs program that ships with Perl simplifies module creation, and even helps you create the Makefile.PL that you’ll need to add your own module to CPAN or to your local installation (which helps you bypass the somewhat awkward use lib directive that appears in the programs in this book). See also the perlxstut, the ExtUtils::MakeMaker, and the AUTOLOAD manpages. In particular, see the -X option to h2xs. Write a module starting from the use of h2xs.

Exercise 4.14

The open calls in the read methods of the classes in this chapter specify a filehandle FileIOFH. Alternatives include using lexical scalars as filehandles or the IO::Handle package. Rewrite the read methods so files are opened with these alternative types of filehandles. What costs or benefits result from these rewritings? (See the perlopentut part of the Perl documentation.)

Exercise 4.15

In the AUTOLOAD method, a copy of the file data is returned from the get_filedata accessor; this will protect the actual file data in the object, but it makes a copy of a potentially very large amount of data, which can overtax your system. Discuss alternatives for this behavior, and implement one of them.

Exercise 4.16

Reading in a few hundred large files (as can easily happen with the modules in this chapter) can overtax your system, causing the system, or at least the program, to crash. Design two alternative methods that avoid this overuse of memory. For instance, you can avoid reading in a file until the sequence data is actually needed. You can also reread the data into the program each time needed but not save it in your object. Finally, you can reclaim memory from older files. Implement one of these methods or some other. What other parts of the code need to be altered?



[11] Don Gilbert’s readseq package (see http://iubio.bio.indiana.edu/soft/molbio/readseq and ftp://ftp.bio.indiana.edu/molbio/readseq/classic/src) is the classic program (written in C) for reading and writing multiple sequence file formats.

Get Mastering Perl for Bioinformatics now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.