Chapter 4. Sequence Formats and Inheritance
This chapter applies concepts and techniques from previous chapters to a concrete project: handling sequence files. The chapter also introduces a few new techniques including a very important one called class inheritance. The code developed in this chapter will also be incorporated into later chapters.
Class inheritance allows you to define a new class by inheriting from other classesâaltering or making additions as needed. Itâs a style of software reuse that is particular to object-oriented design.
The first class developed in this chapter is a simple one: reading and writing files. Using inheritance, you can extend that class to a new one that can recognize, read, and write data in several different biological sequence datafile formats.
The goal is, as always, to learn enough about Perl to develop software for your own needs. The code in this chapter is designed with this goal in mind. In particular, the exercises at the end of the chapter will ask you to extend and improve the code in various ways.
Inheritance
Youâve seen the use
of modules and how a Perl program can use all the code in a module by
simply loading it with the use
command. This is a
simple and powerful method of software
reuse
, by which software can be written once
but used many times by different programs.
Youâve also seen how object-oriented Perl defines classes in terms of modules, and how the use of classes, methods, and objects provides a more structured way to reuse Perl software.
Thereâs another way to reuse Perl classes. Itâs possible for a class to inherit all the code and definitions of another base class . (This base class is sometimes called a superclass or a parent class .) The new derived class (a.k.a. subclass) can add more definitions or redefine certain definitions. Perl then automatically uses the definitions of everything in the old class (but treats them as if they were defined in this new derived class), unless it finds them first in the derived class.
In this chapter, Iâll first develop a class
FileIO.pm
, and then use the technique of
inheritance to develop another class SeqFileIO.pm
that inherits from FileIO.pm
. This way of reusing
software by inheritance is extremely convenient when writing
object-oriented software. For instance, I make
SeqFileIO
do a lot of its work simply by
inheriting the base class FileIO
and then adding
methods that handle sequence file formats. I could use the same base
class FileIO
to write a new class that specializes
in handling HTML files, microarray datafiles, SNP database files, and
so on. (See the exercises at the end of the chapter.)
When inheriting a class, it is sometimes necessary to do a bit more
than just add new methods. In the SeqFileIO
class,
I add some attributes to the object, and as a result the hash
%_attribute_properties
also has to be changed. So
in the new class I define a new hash with that name, and as a result
the old definition from the base class is forgotten and the new,
redefined hash is used. As you read the new class, compare it with
the base class FileIO
. Make note of what is new in
the class (e.g., the various put
methods), what is
being redefined from the base class (e.g., the hash just mentioned),
and what is being inherited from the base class (e.g., the
new
constructor.) This can help prepare you to
write your own class that uses inheritance.
Occasionally, you want to invoke a method from a base class that has been overridden. You can use the special SUPER class for that purpose. I donât use that in the code for this chapter, but you should be aware that it is possible to do.
FileIO.pm: A Class to Read and Write Files
Even though you can easily obtain excellent modules for reading and writing files, this chapter shows you how to build a simple one from scratch. One reason for doing this is to better understand the issues every bioinformatics programmer needs to face, such as how to organize files and keep track of their contents. Another reason is so you can see how to extend the class to deal with the multiple file format problem that is peculiar to bioinformatics.
Itâs not uncommon for a biologist to use several
different types of formats of files containing DNA or protein
sequence data and translate from one format to another. Doing these
translations by hand is very tedious. Itâs also
tedious to save alternate forms of the same sequence data in
differently formatted files. Youâll see how to
alleviate some of this pain by automating some of these tasks in a
new class called SeqFileIO.pm
.
Class inheritance is one of the main reasons why object-oriented
software is so reusable. In order to see clearly how it works,
letâs start with the simple class
FileIO.pm
and later use it to define a more
complex class, SeqFileIO.pm
.
FileIO
is a simple class that reads and writes
files, and stores simple information such as the file contents, date,
and write permissions.
You know that itâs often possible to modify existing
code to create your own program. When I wrote
FileIO.pm
, I simply made a copy of the
Gene.pm
module from Chapter 3
and modified it.
On my Linux system, I started by copying FileIO.pm
from Gene.pm
and giving it a new name:
cp Gene.pm FileIO.pm
I then edited the new file FileIO.pm
changing the
line near the top that says:
package Gene;
to:
package FileIO;
The filename must be the same as the class name, with an additional
.pm
.
Though I now needed to modify the module to do what I want, a
surprising amount of the overall framework of the codeâits
constructor, accessor and mutator methods, and its basic data
structuresâremains the same. Gene.pm
already
contained such useful parts as a new
constructor,
a hash-based object data structure, accessor methods to retrieve
values of the attributes of the object, and mutator methods to alter
attribute values. These are likely to be needed by most classes that
youâll write in your own software projects.
Analysis of FileIO
Following is the code for FileIO
, with commentary
interspersed:
package FileIO; # # A simple IO class for sequence data files # use strict; use warnings; our $AUTOLOAD; # before Perl 5.6.0 say "use vars '$AUTOLOAD';" use Carp; # Class data and methods { # A list of all attributes with defaults and read/write/required/noinit properties my %_attribute_properties = ( _filename => [ '', 'read.write.required'], _filedata => [ [ ], 'read.write.noinit'], _date => [ '', 'read.write.noinit'], _writemode => [ '>', 'read.write.noinit'], ); # Global variable to keep count of existing objects my $_count = 0; # Return a list of all attributes sub _all_attributes { keys %_attribute_properties; } # Check if a given property is set for a given attribute sub _permissions { my($self, $attribute, $permissions) = @_; $_attribute_properties{$attribute}[1] =~ /$permissions/; } # Return the default value for a given attribute sub _attribute_default { my($self, $attribute) = @_; $_attribute_properties{$attribute}[0]; } # Manage the count of existing objects sub get_count { $_count; } sub _incr_count { ++$_count; } sub _decr_count { --$_count; } }
In this first part of FileIO.pm
file, the headers
are exactly the same as in Gene.pm
.
The opening block, which contains the class data and methods, also
remains the same except for the hash
%_attribute_properties
. This new version of the
hash has different attributes (the filename, the file data, the last
modification date of the file, and the mode to use in writing a file)
tailored to the needs of reading and writing files.
In addition to the read
, write
,
and required
properties, there is also a new
âno initializationâ (or
noinit
) property. An attribute with the
noinit
property may not be given an initial value
when an object is created with a call to the new
constructor. In this module, attributes such as data from the file,
or the date on the file, are set only when the file is read or
written. You may also have noticed that the default value for the
filedata
attribute is an anonymous array.
Note
that each attribute has both read
and
write
properties. This being the case, you can
simply omit the listing of the properties. However, in the interest
of future modification, when I may want to add some attribute that
wonât have both properties, Iâve
left in the specification of the two read
and
write
properties. (Note that one method name,
get_count
, doesnât start with an
underscore; this encourages you to call this method to get a count of
how many objects currently exist.)
The constructor method
Youâll notice in the following code that I have cut
the new
constructor down to the bare bones.
# The constructor method # Called from class, e.g. $obj = FileIO->new( ); sub new { my ($class, %arg) = @_; # Create a new object my $self = bless { }, $class; $class->_incr_count( ); return $self; }
Why did I do so? Read on.
The read method
The code continues with the
read
method:
# Called from object, e.g. $obj->read( ); sub read { my ($self, %arg) = @_; # Set attributes foreach my $attribute ($self->_all_attributes( )) { # E.g. attribute = "_filename", argument = "filename" my($argument) = ($attribute =~ /^_(.*)/); # If explicitly given if (exists $arg{$argument}) { # If initialization is not allowed if($self->_permissions($attribute, 'noinit')) { croak("Cannot set $argument from read: use set_$argument"); } $self->{$attribute} = $arg{$argument}; # If not given, but required }elsif($self->_permissions($attribute, 'required')) { croak("No $argument attribute as required"); # Set to the default }else{ $self->{$attribute} = $self->_attribute_default($attribute); } } # Read file data unless( open( FileIOFH, $self->{_filename} ) ) { croak("Cannot open file " . $self->{_filename} ); } $self->{'_filedata'} = [ <FileIOFH> ]; $self->{'_date'} = localtime((stat FileIOFH)[9]); close(FileIOFH); }
This new read
method has two parts. The first
includes the initialization of the objectâs
attributes from the arguments and the defaults as specified in the
%_attribute_properties
hash. The second includes
the reading of the file and the setting of the
_filedata
and _date
attributes
from the fileâs contents and its last modification
time.
The first loop in the program initializes the attributes. If an
attribute is specified as an argument, the first test is to see if
the noinit
property is set. This forbids
initializing the attribute, in which case the program
croak
s. Otherwise, the attribute is set.
If the attribute isnât passed as an argument but has
a required
property (only the
_filename
attribute has the
required
property), the program
croak
s.
Finally, if the argument isnât given and not required, the attribute is set to the default value.
After performing those initializations, the read
method reads in the specified file. If it canât open
the file, the program croak
s. (See the exercises
for a discussion of this use of croak
.)
The file is read by the line:
$self->{'_filedata'} = [ <FileIOFH> ];
In list context, the input operator on the opened filehandle, which
is given by <FileIOFH>
reads in the entire
file. This is done within an anonymous array, as determined by the
square brackets around the input operator angle brackets. A reference
to this anonymous array containing the fileâs
contents is then assigned to the _filedata
attribute.
stat and localtime functions
Finally, the Perl
stat
and localtime
functions
are called to generate a string with the fileâs last
modification time, which is assigned to the object attribute
_date
.
This method of reading a file makes many choices. For instance, the
stat
command returns an array with many more items
of interest about a file, such as its size, owner, access permission
modes, and so on (the tenth item of which is the modification time).
As you develop your programs, you should be paying attention to
details such as whether you need to save some of these additional
attributes of a file, the last modification date, or notes about the
kind of data in the file.
The next line of code in the program:
# # N.B. no "clone" method is necessary #
is yet another choice to think about. Are there occasions when cloning a file object makes sense? Maybe Iâd like to clone a file object, make some small change to the data, give it a new filename, and write it out. Why have I left this out?
The write method
# Write files # Called from object, e.g. $obj->write( ); sub write { my ($self, %arg) = @_; foreach my $attribute ($self->_all_attributes( )) { # E.g. attribute = "_filename", argument = "filename" my($argument) = ($attribute =~ /^_(.*)/); # If explicitly given if (exists $arg{$argument}) { $self->{$attribute} = $arg{$argument}; } } unless( open( FileIOFH, $self->get_writemode . $self->get_filename ) ) { croak("Cannot write to file " . $self->get_filename); } unless( print FileIOFH $self->get_filedata ) { croak("Cannot write to file " . $self->get_filename); } $self->set_date(scalar localtime((stat FileIOFH)[9])); close(FileIOFH); return 1; }
The write
method handles writing a file object out
to an actual file. First, all arguments corresponding to attributes
are set as requested. The file is then opened for writing, using the
_writemode
attribute to specify; for example, >
for truncating the file before writing or >> for appending to
the file. The print FileIOFH
statement actually
does the writing to the opened FileIOFH
filehandle, retrieving the file data from the object with the
get_filedata
method defined by means of
AUTOLOAD
. Finally, the objectâs
_date
attribute is reset to the new modification
time.
AUTOLOAD
The
next section of code is the AUTOLOAD
method
itself:
# This takes the place of such accessor definitions as: # sub get_attribute { ... } # and of such mutator definitions as: # sub set_attribute { ... } sub AUTOLOAD { my ($self, $newvalue) = @_; my ($operation, $attribute) = ($AUTOLOAD =~ /(get|set)(_\w+)$/); # Is this a legal method name? unless($operation && $attribute) { croak "Method name '$AUTOLOAD' is not in the recognized form\n"; } unless(exists $self->{$attribute}) { croak "No such attribute '$attribute' exists in the class ", ref($self); } # AUTOLOAD accessors if($operation eq 'get') { unless($self->_permissions($attribute, 'read')) { croak "$attribute does not have read permission"; } # Turn off strict references to enable symbol table manipulation no strict "refs"; # Install this accessor definition in the symbol table *{$AUTOLOAD} = sub { my ($self) = @_; unless($self->_permissions($attribute, 'read')) { croak "$attribute does not have read permission"; } if(ref($self->{$attribute}) eq 'ARRAY') { return @{$self->{$attribute}}; }else{ return $self->{$attribute}; } }; # Turn strict references back on no strict "refs"; # Return the attribute value # The attribute could be a scalar or a reference to an array if(ref($self->{$attribute}) eq 'ARRAY') { return @{$self->{$attribute}}; }else{ return $self->{$attribute}; } # AUTOLOAD mutators }elsif($operation eq 'set') { unless($self->_permissions($attribute, 'write')) { croak "$attribute does not have write permission"; } # Turn off strict references to enable symbol table manipulation no strict "refs"; # Install this mutator definition in the symbol table *{$AUTOLOAD} = sub { my ($self, $newvalue) = @_; unless($self->_permissions($attribute, 'write')) { croak "$attribute does not have write permission"; } $self->{$attribute} = $newvalue; }; # Turn strict references back on no strict "refs"; # Set and return the attribute value $self->{$attribute} = $newvalue; return $self->{$attribute}; } }
This AUTOLOAD
method has grown!
Thereâs only one difference, however, between this
code and the AUTOLOAD
code for the
Gene.pm
class. The new set of attributes for
FileIO.pm
donât all take simple
scalar values, as was the case with Gene.pm
.
Another attribute, _filedata
, is a reference to an
anonymous array. In order for the accessors to return the correct
data, they must check to see if an attribute is a scalar or a
reference to an array; the accessors can then dereference and return
the data from the method call.
So the accessors, and the definitions of them installed into the
symbol table, test for an array reference and dereference it
accordingly. Other than that, this AUTOLOAD
method
is exactly the same as that defined for Gene.pm
.
You may also have noticed that sections of code in the
AUTOLOAD
method are almost identical to each
other. Recall that AUTOLOAD
is invoked when a
method with no subroutine defining it is called.
AUTOLOAD
must do two things. First, it performs
whatever method is requested; for example, if an accessor method is
requested, it returns the appropriate value. Second, it defines the
subroutine that implements the requested method and installs it in
the symbol table so the next time the method is called,
AUTOLOAD
and its considerable overhead
wonât be necessary. Because of these parameters, the
code AUTOLOAD
executes to handle the requested
method is nearly identical with the method that
AUTOLOAD
also defines.
Finally, here are the last sections of the
FileIO.pm
program:
# When an object is no longer being used, this will be automatically called # and will adjust the count of existing objects sub DESTROY { my($self) = @_; $self->_decr_count( ); } # Other methods. They do not fall into the same form as the majority handled by AUTOLOAD # 1;
The only change here is that there are no other methods
(Gene.pm
had a citation method).
Finishing FileIO
To finish FileIO.pm
, hereâs some
very terse (too terse for anything but a textbook) POD documentation:
=head1 FileIO FileIO: read and write file data =head1 Synopsis use FileIO; my $obj = RawfileIO->read( filename => 'jkl' ); print $obj->get_filename, "\n"; print $obj->get_filedata; $obj->set_date('today'); print $obj->get_date, "\n"; print $obj->get_writemode, "\n"; my @newdata = ("line1\n", "line2\n"); $obj->set_filedata( \@newdata ); $obj->write(filename => 'lkj'); $obj->write(filename => 'lkj', writemode => '>>'); my $o = RawfileIO->read(filename => 'lkj'); print $o->get_filename, "\n"; print $o->get_filedata; my $gene1 = Gene->new( name => 'biggene', organism => 'Mus musculus', chromosome => '2p', pdbref => 'pdb5775.ent', author => 'L.G.Jeho', date => 'August 23, 1989', ); print "Gene name is ", $gene1->get_name( ); print "Gene organism is ", $gene1->_get_organism( ); print "Gene chromosome is ", $gene1->_get_chromosome( ); print "Gene pdbref is ", $gene1->_get_pdbref( ); print "Gene author is ", $gene1->_get_author( ); print "Gene date is ", $gene1->_get_date( ); $clone = $gene1->clone(name => 'biggeneclone'); $gene1-> set_chromosome('2q'); $gene1-> set_pdbref('pdb7557.ent'); $gene1-> set_author('G.Mendel'); $gene1-> set_date('May 25, 1865'); $clone->citation('T. Morgan', 'October 3, 1912'); print "Clone citation is ", $clone->citation; =head1 AUTHOR James Tisdall =head1 COPYRIGHT Copyright (c) 2003, James Tisdall =cut
Testing the FileIO Class Module
Now that weâve got a
class module, complete with examples of its use,
letâs write a small test program and see how it
works. Since the examples in the documentation are, in effect, a
small test program, letâs try running it.
Weâll use the file file1.txt
I
created with my text editor that contains:
> sample dna (This is a typical fasta header.) agatggcggcgctgaggggtcttgggggctctaggccggccacctactgg tttgcagcggagacgacgcatggggcctgcgcaataggagtacgctgcct gggaggcgtgactagaagcggaagtagttgtgggcgcctttgcaaccgcc tgggacgccgccgagtggtctgtgcaggttcgcgggtcgctggcgggggt cgtgagggagtgcgccgggagcggagatatggagggagatggttcagacc cagagcctccagatgccggggaggacagcaagtccgagaatggggagaat acacctgagccactctcagatgaggaccta
Iâll take the code from the documentation pretty
much as is, just adding strict
and
warnings
. Iâll also include a
use
lib
directive that adds my
development library directory to the list of directories in
@INC
, which tells my computerâs
Perl where to look for modules. (Recall that you can either edit this
line, override it with the PERL5LIB
environmental variable, or give your own directory on the command
line.) I also add a few print statements to make the output easier to
read:
#!/usr/bin/perl use strict; use warnings; use lib "/home/tisdall/MasteringPerlBio/development/lib"; use FileIO; my $obj = FileIO->new( ); $obj->read( filename => 'file1.txt' ); print "The file name is ", $obj->get_filename, "\n"; print "The contents of the file are:\n", $obj->get_filedata; print "\nThe date of the file is ", $obj->get_date, "\n"; $obj->set_date('today'); print "The reset date of the file is ", $obj->get_date, "\n"; print "The write mode of the file is ", $obj->get_writemode, "\n"; print "\nResetting the data and filename\n"; my @newdata = ("line1\n", "line2\n"); $obj->set_filedata( \@newdata ); print "Writing a new file \"file2\"\n"; $obj->write(filename => 'file2'); print "Appending to the new file \"file2\"\n"; $obj->write(filename => 'file2', writemode => '>>'); print "Reading and printing the data from \"file2\":\n"; my $file2 = FileIO->new( ); $file2->read( filename => 'file2' ); print "The file name is ", $file2->get_filename, "\n"; print "The contents of the file are:\n", $file2->get_filedata;
I finally run the test program to get the following output:
The file name is file1.txt The contents of the file are: > sample dna (This is a typical fasta header.) agatggcggcgctgaggggtcttgggggctctaggccggccacctactgg tttgcagcggagacgacgcatggggcctgcgcaataggagtacgctgcct gggaggcgtgactagaagcggaagtagttgtgggcgcctttgcaaccgcc tgggacgccgccgagtggtctgtgcaggttcgcgggtcgctggcgggggt cgtgagggagtgcgccgggagcggagatatggagggagatggttcagacc cagagcctccagatgccggggaggacagcaagtccgagaatggggagaat acacctgagccactctcagatgaggaccta The date of the file is Thu Dec 5 11:22:56 2002 The reset date of the file is today The write mode of the file is > Resetting the data and filename Writing a new file "file2" Appending to the new file "file2" Reading and printing the data from "file2": The file name is file2 The contents of the file are: line1 line2 line1 line2
The module seems to be performing as hoped. So, now we have a simple module that reads and writes files and provides a few options for the write mode.
But, frankly, this isnât too impressive.
Youâve already been reading and writing files in
Perl without the overhead of this FileIO
module.
The interface to the code is nice, and itâs good to
have objects that contain the file data, but what has really been
accomplished?
The real power of this approach is coming up next. Using class inheritance, this simple module can be extended relatively easily in a very useful direction.
Itâs another case of the basic software engineering
approach of making small, simple, generally useful tools, and then
combining them into more powerful and specific applications. So,
next, Iâll take my simple FileIO
class and use it as a base class for a
bioinformatics-specific class.
SeqFileIO.pm: Sequence File Formats
Our
primary interest is bioinformatics.Can we extend the
FileIO
class to handle biological sequence
datafiles? For example, can a class be written that takes a GenBank
file and writes the sequence out in FASTA format?
Using the technique of inheritance, in this section I present a
module for a new class SeqFileIO
that performs
several basic functions on sequence files of various formats. When
you call this moduleâs read
method, in addition to reading the fileâs contents
and setting the name, date, and write mode of the file, it
automatically determines the format of the sequence file, extracts
the sequence, and when available, extracts the annotation, ID, and
accession number. In addition, a set of put
methods makes it easy to present the sequence and annotation in other
formats.[11]
Analysis of SeqFileIO.pm
The first part of the module SeqFileIO.pm
contains
the block with definitions of the new, or revised, class data and
methods.
The first thing you should notice is the use
command:
use base ( "FileIO" );
This Perl command tells the current package
SeqFileIO
itâs inheriting from
the base class FileIO
. Hereâs
another statement thatâs often used for this
purpose:
@ISA = ( "FileIO" );
The @ISA
predefined variable tells a package that
it âis aâ version of some base
class; it then can inherit methods from that base class. The
use base
directive sets the
@ISA
array to the base class(es), plus a little
else besides. (Check perldoc
base
for the whole story.) Without getting bogged
down in details use base
works a little more
robustly than just setting the @ISA
array, so
thatâs what Iâll use here:
package SeqFileIO; use base ( "FileIO" ); use strict; use warnings; #use vars '$AUTOLOAD'; use Carp; # Class data and methods { # A list of all attributes with defaults and read/write/required/noinit properties my %_attribute_properties = ( _filename => [ '', 'read.write.required'], _filedata => [ [ ], 'read.write.noinit'], _date => [ '', 'read.write.noinit'], _writemode => [ '>', 'read.write.noinit'], _format => [ '', 'read.write'], _sequence => [ '', 'read.write'], _header => [ '', 'read.write'], _id => [ '', 'read.write'], _accession => [ '', 'read.write'], ); # Return a list of all attributes sub _all_attributes { keys %_attribute_properties; } # Check if a given property is set for a given attribute sub _permissions { my($self, $attribute, $permissions) = @_; $_attribute_properties{$attribute}[1] =~ /$permissions/; } # Return the default value for a given attribute sub _attribute_default { my($self, $attribute) = @_; $_attribute_properties{$attribute}[0]; } my @_seqfileformats = qw( _raw _embl _fasta _gcg _genbank _pir _staden ); sub isformat { my($self) = @_; for my $format (@_seqfileformats) { my $is_format = "is$format"; if($self->$is_format) { return $format; } } return 'unknown'; } }
The power of inheritance
A comparison with the opening block in the base class
FileIO
is instructive. Youâll see
that Iâm redefining the
%_attribute_properties
hash and the methods in the
block that access the hash as closures. This is necessary to extend
the new class to handle new attributes that relate specifically to
sequence datafiles:
-
_format
The format of the sequence datafile, such as FASTA or GenBank
-
_sequence
The sequence data extracted from the sequence datafile as a scalar string
-
_header
The header part of the annotation; defined somewhat loosely in this module
-
_id
The ID of the sequence datafile, such as a gene name or other identifier
-
_accession
The accession number of the sequence datafile, when provided
Youâll notice, in comparing this opening block with
the opening block of the base class FileIO
, that
the methods and variables relating to counting the number of objects
has been omitted in this new class.
Hereâs where you see the power of inheritance for
the first time. When the code in the new
SeqFileIO.pm
module tries to call, say, the
get_count
method, and Perl sees that
itâs not defined in the module, it will go on a hunt
for the method in the base class and use the definition it finds
there. On the other hand, if it finds the method in
SeqFileIO.pm
, it just uses that; if the
get_count
method appeared in the base class but is
redefined in SeqFileIO.pm
, it uses that
redefinition and ignores the older definition in the base class.
So, you donât have to rewrite methods in the base
class(es); you can just call them. Perl not only finds them, it also
calls them as if they were defined in the new class. For example,
ref($self)
returns SeqFileIO
,
not FileIO
, without regard to whether the method
is defined in the new class or inherited from the old class.
Finally, there are some new definitions in this new version of the
opening block. An array @_seqfileformats
lists the
sequence file formats the module knows about, and a method
isformat
calls the methods associated with each
format (such as is_fasta
defined later in the
module) until it either finds what format the file is in or returns
unknown
.
The @_seqfileformats
array uses the
qw
operator. This splits the words on whitespace
(including newlines) and returns a list of quoted words.
Itâs a convenience for giving lists of words without
having to quote each one or separate them by commas. (Check the
perlop
manpage for the section on
âRegExp Quote-Like Operatorsâ for
all the variations and alternative quoting operators.)
Following the opening block, notice that there is no
new
method defined in this class. The simple,
bare-bones new
method from the
FileIO
class serves just as well for this new
class; thanks to the inheritance mechanism, thereâs
no need to write a new one. A program that calls
SeqFileIO->new
will use the same method from
the FileIO
class, but the class name will be
properly set to SeqFileIO
for the new object
created.
A new read method
The next part of the program has a rewrite of the
read
method. As a result, the
read
method from the parent
FileIO
class is being overridden:
# Called from object, e.g. $obj->read( ); sub read { my ($self, %arg) = @_; # Set attributes foreach my $attribute ($self->_all_attributes( )) { # E.g. attribute = "_filename", argument = "filename" my($argument) = ($attribute =~ /^_(.*)/); # If explicitly given if (exists $arg{$argument}) { # If initialization is not allowed if($self->_permissions($attribute, 'noinit')) { croak("Cannot set $argument from read: use set_$argument"); } $self->{$attribute} = $arg{$argument}; # If not given, but required }elsif($self->_permissions($attribute, 'required')) { croak("No $argument attribute as required"); # Set to the default }else{ $self->{$attribute} = $self->_attribute_default($attribute); } } # Read file data unless( open( FileIOFH, $self->{_filename} ) ) { croak("Cannot open file " . $self->{_filename} ); } $self->{'_filedata'} = [ <FileIOFH> ]; $self->{'_date'} = localtime((stat FileIOFH)[9]); $self->{'_format'} = $self->isformat; my $parsemethod = 'parse' . $self->{'_format'}; $self->$parsemethod; close(FileIOFH); }
The new read
method just shown is almost exactly
the same as the read
function in the
FileIO
class. There are only three new lines of
code, just before the end of the subroutine, preceding the call to
the close
function:
$self->{'_format'} = $self->isformat; my $parsemethod = 'parse' . $self->{'_format'}; $self->$parsemethod;
These three new lines initialize the new attributes that were defined
for this class. First, a call to isformat
determines the format of the file and sets the
_format
attribute appropriately. The appropriate
parse_
method name is then constructed, and
finally, a call to that method is made. As youâre
about to see, the parse_
methods extract the
sequence, header, ID, and accession number (or as many of these as
possible) from the file data and set the objectâs
attributes accordingly.
So, by the end of the read
method, the file data
has been read into the object, the format determined, and the
interesting parts of the data (such as the sequence data) parsed out.
New Methods: is, parse, and put
The rest of the module consists of three groups of methods that
handle the different sequence file formats. These methods
didnât appear in the more simple and generic
FileIO
module and must be defined here:
-
is_
Tests to see if data is in a particular sequence file format
-
parse_
Parses out the sequence, and when possible the header, ID, and accession number
-
put_
Takes the sequence attribute, and when available the header, ID, and accession number attributes, and writes them in the sequence file format named in the format attribute
is_ methods
The first group of methods tests an objectâs file data to see if it conforms to a particular sequence file format:
sub is_raw { my($self) = @_; my $seq = join('', @{$self->{_filedata}} ); ($seq =~ /^[ACGNT\s]+$/) ? return 'raw' : 0; } sub is_embl { my($self) = @_; my($begin,$seq,$end) = (0,0,0); foreach( @{$self->{_filedata}} ) { /^ID\s/ && $begin++; /^SQ\s/ && $seq++; /^\/\// && $end++; (($begin = = 1) && ($seq = = 1) && ($end = = 1)) && return 'embl'; } return; } sub is_fasta { my($self) = @_; my($flag) = 0; for(@{$self->{_filedata}}) { #This to avoid confusion with Primer, which can have input beginning ">" /^\*seq.*:/i && ($flag = 0) && last; if( /^>/ && $flag = = 1) { last; }elsif( /^>/ && $flag = = 0) { $flag = 1; }elsif( (! /^>/) && $flag = = 0) { #first line must start with ">" last; } } $flag ? return 'fasta' : return; } sub is_gcg { my($self) = @_; my($i,$j) = (0,0); for(@{$self->{_filedata}}) { /^\s*$/ && next; /Length:.*Check:/ && ($i += 1); /^\s*\d+\s*[a-zA-Z\s]+/ && ($j += 1); ($i = = 1) && ($j = = 1) && return('gcg'); } return; } sub is_genbank { my($self) = @_; my $Features = 0; for(@{$self->{_filedata}}) { /^LOCUS/ && ($Features += 1); /^DEFINITION/ && ($Features += 2); /^ACCESSION/ && ($Features += 4); /^ORIGIN/ && ($Features += 8); /^\/\// && ($Features += 16); ($Features = = 31) && return 'genbank'; } return; } sub is_pir { my($self) = @_; my($ent,$ti,$date,$org,$ref,$sum,$seq,$end) = (0,0,0,0,0,0,0,0); for(@{$self->{_filedata}}) { /ENTRY/ && $ent++; /TITLE/ && $ti++; /DATE/ && $date++; /ORGANISM/ && $org++; /REFERENCE/ && $ref++; /SUMMARY/ && $sum++; /SEQUENCE/ && $seq++; /\/\/\// && $end++; $ent = = 1 && $ti = = 1 && $date >= 1 && $org >= 1 && $ref >= 1 && $sum = = 1 && $seq = = 1 && $end = = 1 && return 'pir'; } return; } sub is_staden { my($self) = @_; for(@{$self->{_filedata}}) { /<-+([^-]*)-+>/ && return 'staden'; } 0; }
is_
methods are designed to be fast. They
donât check to see if the file conforms to the
official file format in every detail. Instead, they look to see if,
given that the file is supposed to be a sequence file, there is a
good indication that it is a particular file format. In other words,
these methods perform heuristic, not rigorous, tests. If they are
well conceived, these methods correctly identify the different
formats without confusion and with a minimum of code and computation.
However, they donât ensure that a format is
conforming in every respect to the official format definition. (See
the exercises for other approaches to file format recognition.)
Also, note that these are fairly simple tests; for example, the
is_raw
method doesnât check for
the IUB ambiguity codes but only the four bases A, C, G, T, plus N
for an undetermined base. Furthermore, some software systems have
more than one format, for which only one format is shown here, such
as PIR and GCG. The bottom line is that this code does what it is
intended to do, but not more. It is, however, fairly short and easy
to modify and extend, and I encourage you to try your hand at doing
so in the exercises.
An interesting Perl note: the âleaning toothpickâ regular expression notation for three forward slashes /// is a bit forbidding because each forward slash needs a backslash escape in front of it:
/\/\/\// && $end++;
An alternative would be to use m
and a separator
other than /, which leads to more readable code:
m!///! && $end++;
One more interesting Perl note: in this code, I return a false value from a subroutine like so:
return;
This is a bit confusing because unless you already know or check in
the documentation for return
,
itâs not clear whatâs happening. In
a scalar context, return;
returns the undefined
value undef
; in a list context,
return;
returns an empty list. So, no matter what
context the subroutine is called from, a valid value is
returned.
put_ methods
The next group of put_
methods returns an
objectâs sequence and annotation data in a
particular sequence file format:
sub put_raw { my($self) = @_; my($out); ($out = $self->{_sequence}) =~ tr/a-z/A-Z/; return($out); } sub put_embl { my($self) = @_; my(@out,$tmp,$len,$i,$j,$a,$c,$g,$t,$o); $len = length($self->{_sequence}); $a=($self->{_sequence} =~ tr/Aa//); $c=($self->{_sequence} =~ tr/Cc//); $g=($self->{_sequence} =~ tr/Gg//); $t=($self->{_sequence} =~ tr/Tt//); $o=($len - $a - $c - $g - $t); $i=0; $out[$i++] = sprintf("ID %s %s\n",$self->{_header}, $self->{_id} ); $out[$i++] = "XX\n"; $out[$i++] = sprintf("SQ sequence %d BP; %d A; %d C; %d G; %d T; %d other;\n", $len, $a, $c, $g, $t, $o); for($j = 0 ; $j < $len ; ) { $out[$i] .= sprintf("%s",substr($self->{_sequence},$j,10)); $j += 10; if( $j < $len && $j % 60 != 0 ) { $out[$i] .= " "; }elsif ($j % 60 = = 0 ) { $out[$i++] .= "\n"; } } if($j % 60 != 0 ) { $out[$i++] .= "\n"; } $out[$i] = "//\n"; return @out; } sub put_fasta { my($self) = @_; my(@out,$len,$i,$j); $len = length($self->{_sequence}); $i = 0; $out[$i++] = "> " . $self->{_header} . "\n"; for($j=0; $j<$len ; $j += 50) { $out[$i++]=sprintf("%.50s\n",substr($self->{_sequence},$j,50)); } return @out; } sub put_gcg { my($self) = @_; my(@out,$len,$i,$j,$cnt,$sum); $len = length($self->{_sequence}); #calculate Checksum for($i=0; $i<$len ;$i++) { $cnt++; $sum += $cnt * ord(substr($self->{_sequence},$i,1)); ($cnt = = 57)&& ($cnt=0); } $sum %= 10000; $i = 0; $out[$i++] = sprintf("%s\n",$self->{_header}); $out[$i++] = sprintf(" %s Length: %d (today) Check: %d ..\n", $self->{_id}, $len, $sum); for($j = 0 ; $j < $len ; ) { if( $j % 50 = = 0) { $out[$i] = sprintf("%8d ",$j+1); } $out[$i] .= sprintf("%s",substr($self->{_sequence},$j,10)); $j += 10; if( $j < $len && $j % 50 != 0 ) { $out[$i] .= " "; }elsif ($j % 50 = = 0 ) { $out[$i++] .= "\n"; } } if($j % 50 != 0 ) { $out[$i] .= "\n"; } return @out; } sub put_genbank { my($self) = @_; my(@out,$len,$i,$j,$cnt,$sum); my($seq) = $self->{_sequence}; $seq =~ tr/A-Z/a-z/; $len = length($seq); for($i=0; $i<$len ;$i++) { $cnt++; $sum += $cnt * ord(substr($seq,$i,1)); ($cnt = = 57) && ($cnt=0); } $sum %= 10000; $i = 0; $out[$i++] = sprintf("LOCUS %s %d bp\n",$self->{_id}, $len); $out[$i++] = sprintf("DEFINITION %s , %d bases, %d sum.\n", $self->{_header}, $len, $sum); $out[$i++] = sprintf("ACCESSION %s\n", $self->{_accession}, ); $out[$i++] = sprintf("ORIGIN\n"); for($j = 0 ; $j < $len ; ) { if( $j % 60 = = 0) { $out[$i] = sprintf("%8d ",$j+1); } $out[$i] .= sprintf("%s",substr($seq,$j,10)); $j += 10; if( $j < $len && $j % 60 != 0 ) { $out[$i] .= " "; }elsif($j % 60 = = 0 ) { $out[$i++] .= "\n"; } } if($j % 60 != 0 ) { $out[$i] .= "\n"; ++i; } $out[$i] = "//\n"; return @out; } sub put_pir { my($self) = @_; my($seq) = $self->{_sequence}; my(@out,$len,$i,$j,$cnt,$sum); $len = length($seq); for($i=0; $i<$len ;$i++) { $cnt++; $sum += $cnt * ord(substr($seq,$i,1)); ($cnt= =57) && ($cnt=0); } $sum %= 10000; $i = 0; $out[$i++] = sprintf("ENTRY %s\n",$self->{_id}); $out[$i++] = sprintf("TITLE %s\n",$self->{_header}); #JDT ACCESSION out if defined $out[$i++] = sprintf("DATE %s\n",''); $out[$i++] = sprintf("REFERENCE %s\n",''); $out[$i++] = sprintf("SUMMARY #Molecular-weight %d #Length %d #Checksum %d\ n",0,$len,$sum); $out[$i++] = sprintf("SEQUENCE\n"); $out[$i++] = sprintf(" 5 10 15 20 25 30\n"); for($j=1; $seq && $j < $len ; $j += 30) { $out[$i++] = sprintf("%7d ",$j); $out[$i++] = sprintf("%s\n", join(' ',split(//,substr($seq, $j - 1,length($seq) < 30 ? length($seq) : 30))) ); } $out[$i++] = sprintf("///\n"); return @out; } sub put_staden { my($self) = @_; my($seq) = $self->{_sequence}; my($i,$j,$len,@out); $i = 0; $len = length($self->{_sequence}); $out[$i] = ";\<------------------\>\n"; substr($out[$i],int((20-length($self->{_id}))/2),length($self->{_id})) = $self-> {_id}; $i++; for($j=0; $j<$len ; $j+=60) { $out[$i++]=sprintf("%s\n",substr($self->{_sequence},$j,60)); } return @out; }
The put_
methods are, by necessity, more cognizant
of the detailed rules of a particular file format.
Note the Perl function ord
in
put_gcg
. This built-in function gives the numeric
value of a character. Other Perl functions such as
sprintf
and substr
can be
reviewed as necessary by typing perldoc
-f
substr
(for example) or by
visiting the http://www.perdoc.com web page.
parse_ methods
parse_
, the third and final group of methods,
parses the contents of a file in a particular format, extracting the
sequence and, if possible, some additional header information.
sub parse_raw { my($self) = @_; ## Header and ID should be set in calling program after this my($seq); $seq = join('',@{$self->{_filedata}}); if( ($seq =~ /^([acgntACGNT\-\s]+)$/)) { ($self->{_sequence} = $seq) =~ s/\s//g; }else{ carp("parse_raw failed"); } } sub parse_embl { my($self) = @_; my($begin,$seq,$end,$count) = (0,0,0,0); my($sequence,$head,$acc,$id); for(@{$self->{_filedata}}) { ++$count; if(/^ID/) { $begin++; /^ID\s*(.*\S)\s*/ && ($id = ($head = $1)) =~ s/\s.*//; }elsif(/^SQ\s/){ $seq++; }elsif(/^\/\//){ $end++; }elsif($seq = = 1){ $sequence .= $_; }elsif(/^AC\s*(.*(;|\S)).*/){ #put this here - AC could be sequence $acc .= $1; } if($begin = = 1 && $seq = = 1 && $end = = 1) { $sequence =~ tr/a-zA-Z//cd; $sequence =~ tr/a-z/A-Z/; $self->{_sequence} = $sequence; $self->{_header} = $head; $self->{_id} = $id; $self->{_accession} = $acc; return 1; } } return; } sub parse_fasta { my($self) = @_; my($flag,$count) = (0,0); my($seq,$head,$id); for(@{$self->{_filedata}}) { #avoid confusion with Primer, which can have input beginning ">" /^\*seq.*:/i && ($flag = = 0) && last; if(/^>/ && $flag = = 1) { last; }elsif(/^>/ && $flag = = 0){ /^>\s*(.*\S)\s*/ && ($id=($head=$1)) =~ s/\s.*//; $flag=1; }elsif( (! /^>/) && $flag = = 1) { $seq .= $_; }elsif( (! /^>/) && $flag = = 0) { last; } ++$count; } if($flag) { $seq =~ tr/a-zA-Z-//cd; $seq =~ tr/a-z/A-Z/; $self->{_sequence} = $seq; $self->{_header} = $head; $self->{_id} = $id; } } sub parse_gcg { my($self) = @_; my($seq,$head,$id); my($count,$flag) = (0,0); for(@{$self->{_filedata}}) { if(/^\s*$/) { ; }elsif($flag = = 0 && /Length:.*Check:/){ /^\s*(\S+).*Length:.*Check:/; $flag=1; ($id=$1) =~ s/\s.*//; }elsif($flag = = 0 && /^\S/) { ($head = $_) =~ s/\n//; }elsif($flag = = 1 && /^\s*[^\d\s]/) { last; }elsif($flag = = 1 && /^\s*\d+\s*[a-zA-Z \t]+$/) { $seq .= $_; } $count++; } $seq =~ tr/a-zA-Z//cd; $seq =~ tr/a-z/A-Z/; $head = $id unless $head; $self->{_sequence} = $seq; $self->{_header} = $head; $self->{_id} = $id; return 1; } # # sub parse_genbank { my($self) = @_; my($count,$features,$flag,$seqflag) = (0,0,0,0); my($seq,$head,$id,$acc); for(@{$self->{_filedata}}) { if( /^LOCUS/ && $flag = = 0 ) { /^LOCUS\s*(.*\S)\s*$/; ($id=($head=$1)) =~ s/\s.*//; $features += 1; $flag = 1; }elsif( /^DEFINITION\s*(.*)/ && $flag = = 1) { $head .= " $1"; $features += 2; }elsif( /^ACCESSION/ && $flag = = 1 ) { /^ACCESSION\s*(.*\S)\s*$/; $head .= " ".($acc=$1); $features += 4; }elsif( /^ORIGIN/ ) { $seqflag = 1; $features += 8; }elsif( /^\/\// ) { $features += 16; }elsif( $seqflag = = 0 ) { ; }elsif($seqflag = = 1) { $seq .= $_; } ++$count; if($features = = 31) { $seq =~ tr/a-zA-Z//cd; $seq =~ tr/a-z/A-Z/; $self->{_sequence} = $seq; $self->{_header} = $head; $self->{_id} = $id; $self->{_accession} = $acc; return 1; } } return; } sub parse_pir { my($self) = @_; my($begin,$tit,$date,$organism,$ref,$summary,$seq,$end,$count) = (0,0,0,0,0,0,0,0,0); my($flag,$seqflag) = (0,0); my($sequence,$header,$id,$acc); for(@{$self->{_filedata}}) { ++$count; if( /^ENTRY\s*(.*\S)\s*$/ && $flag = = 0 ) { $header=$1; $flag=1; $begin++; }elsif( /^TITLE\s*(.*\S)\s*$/ && $flag = = 1 ) { $header .= $1; $tit++; }elsif( /ORGANISM/ ) { $organism++; }elsif( /^ACCESSIONS\s*(.*\S)\s*$/ && $flag = = 1 ) { ($id=($acc=$1)) =~ s/\s*//; }elsif( /DATE/ ) { $date++; }elsif( /REFERENCE/ ) { $ref++; }elsif( /SUMMARY/ ) { $summary++; }elsif( /^SEQUENCE/ ) { $seqflag = 1; $seq++; }elsif( /^\/\/\// && $flag = = 1 ) { $end++; }elsif( $seqflag = = 0) { next; }elsif( $seqflag = = 1 && $flag = = 1) { $sequence .= $_; } if( $begin = = 1 && $tit = = 1 && $date >= 1 && $organism >= 1 && $ref >= 1 && $summary = = 1 && $seq = = 1 && $end = = 1 ) { $sequence =~ tr/a-zA-Z//cd; $sequence =~ tr/a-z/A-Z/; $self->{_sequence} = $seq; $self->{_header} = $header; $self->{_id} = $id; $self->{_accession} = $acc; return 1; } } return; } sub parse_staden { my($self) = @_; my($flag,$count) = (0,0); my($seq,$head,$id); for(@{$self->{_filedata}}) { if( /<---*\s*(.*[^-\s])\s*-*--->(.*)/ && $flag = = 0 ) { $id = $head = $1; $seq .= $2; $flag = 1; }elsif( /<---*(.*)-*--->/ && $flag = = 1 ) { $count--; last; }elsif( $flag = = 1 ) { $seq .= $_; } ++$count; } if( $flag ) { $seq =~ s/-/N/g; $seq =~ tr/a-zA-Z-//cd; $seq =~ tr/a-z/A-Z/; $self->{_sequence} = $seq; $self->{_header} = $head; $self->{_id} = $id; return 1; } return; } 1;
Thatâs the end of the
SeqFileIO.pm
module that defines the
SeqFileIO
class.
You can see that to add the capability to handle a new sequence file
format, you simply have to write new is_
,
put_
, and parse_
methods, and
add the name of the new format to the
@_seqfiletypes
array. So, extending this software
to handle more sequence file formats is relatively easy. (See the
exercises.)
To end this chapter, here is a small test program
testSeqFileIO
that exercises the main parts of the
SeqFileIO.pm
module, followed by its output. See
the exercises for a discussion of ways to improve this
test.
Testing SeqFileIO.pm
Here is the program
testSeqFileIO
to test the
SeqFileIO
module:
#!/usr/bin/perl use strict; use warnings; use lib "/home/tisdall/MasteringPerlBio/development/lib"; use SeqFileIO; # # First test basic FileIO operations # (plus filetype attribute) # my $obj = SeqFileIO->new( ); $obj->read( filename => 'file1.txt' ); print "The file name is ", $obj->get_filename, "\n"; print "The contents of the file are:\n", $obj->get_filedata; print "\nThe date of the file is ", $obj->get_date, "\n"; print "The filetype of the file is ", $obj->get_filetype, "\n"; $obj->set_date('today'); print "The reset date of the file is ", $obj->get_date, "\n"; print "The write mode of the file is ", $obj->get_writemode, "\n"; print "\nResetting the data and filename\n"; my @newdata = ("line1\n", "line2\n"); $obj->set_filedata( \@newdata ); print "Writing a new file \"file2.txt\"\n"; $obj->write(filename => 'file2.txt'); print "Appending to the new file \"file2.txt\"\n"; $obj->write(filename => 'file2.txt', writemode => '>>'); print "Reading and printing the data from \"file2.txt\":\n"; my $file2 = SeqFileIO->new( ); $file2->read( filename => 'file2.txt' ); print "The file name is ", $file2->get_filename, "\n"; print "The contents of the file are:\n", $file2->get_filedata; print "The filetype of the file is ", $file2->get_filetype, "\n"; print <<'HEADER'; ########################################## # # Test file format recognizing and reading # ########################################## HEADER my $genbank = SeqFileIO->new( ); $genbank->read( filename => 'record.gb' ); print "The file name is ", $genbank->get_filename, "\n"; print "\nThe date of the file is ", $genbank->get_date, "\n"; print "The filetype of the file is ", $genbank->get_filetype, "\n"; print "The contents of the file are:\n", $genbank->get_filedata; print "\n####################\n####################\n####################\n"; my $raw = SeqFileIO->new( ); $raw->read( filename => 'record.raw' ); print "The file name is ", $raw->get_filename, "\n"; print "\nThe date of the file is ", $raw->get_date, "\n"; print "The filetype of the file is ", $raw->get_filetype, "\n"; print "The contents of the file are:\n", $raw->get_filedata; print "\n####################\n####################\n####################\n"; my $embl = SeqFileIO->new( ); $embl->read( filename => 'record.embl' ); print "The file name is ", $embl->get_filename, "\n"; print "\nThe date of the file is ", $embl->get_date, "\n"; print "The filetype of the file is ", $embl->get_filetype, "\n"; print "The contents of the file are:\n", $embl->get_filedata; print "\n####################\n####################\n####################\n"; my $fasta = SeqFileIO->new( ); $fasta->read( filename => 'record.fasta' ); print "The file name is ", $fasta->get_filename, "\n"; print "\nThe date of the file is ", $fasta->get_date, "\n"; print "The filetype of the file is ", $fasta->get_filetype, "\n"; print "The contents of the file are:\n", $fasta->get_filedata; print "\n####################\n####################\n####################\n"; my $gcg = SeqFileIO->new( ); $gcg->read( filename => 'record.gcg' ); print "The file name is ", $gcg->get_filename, "\n"; print "\nThe date of the file is ", $gcg->get_date, "\n"; print "The filetype of the file is ", $gcg->get_filetype, "\n"; print "The contents of the file are:\n", $gcg->get_filedata; print "\n####################\n####################\n####################\n"; my $staden = SeqFileIO->new( ); $staden->read( filename => 'record.staden' ); print "The file name is ", $staden->get_filename, "\n"; print "\nThe date of the file is ", $staden->get_date, "\n"; print "The filetype of the file is ", $staden->get_filetype, "\n"; print "The contents of the file are:\n", $staden->get_filedata; print "\n####################\n####################\n####################\n"; print <<'REFORMAT'; ########################################## # # Test file format reformatting and writing # ########################################## REFORMAT print "At this point there are ", $staden->get_count, " objects.\n\n"; print "######\n###### Testing put methods\n######\n\n"; print "\nPrinting staden data in raw format:\n"; print $staden->put_raw; print "\nPrinting staden data in embl format:\n"; print $staden->put_embl; print "\nPrinting staden data in fasta format:\n"; print $staden->put_fasta; print "\nPrinting staden data in gcg format:\n"; print $staden->put_gcg; print "\nPrinting staden data in genbank format:\n"; print $staden->put_genbank; print "\nPrinting staden data in PIR format:\n"; print $staden->put_pir;
The test program depends on certain files being present on the system; the contents of the files are clear from the program output, and the files can be downloaded from this bookâs web site, along with the rest of the programs for this book.
Results
Here is the output from testSeqFileIO
:
The file name is file1.txt The contents of the file are: > sample dna (This is a typical fasta header.) agatggcggcgctgaggggtcttgggggctctaggccggccacctactgg tttgcagcggagacgacgcatggggcctgcgcaataggagtacgctgcct gggaggcgtgactagaagcggaagtagttgtgggcgcctttgcaaccgcc tgggacgccgccgagtggtctgtgcaggttcgcgggtcgctggcgggggt cgtgagggagtgcgccgggagcggagatatggagggagatggttcagacc cagagcctccagatgccggggaggacagcaagtccgagaatggggagaat acacctgagccactctcagatgaggaccta The date of the file is Thu Dec 5 11:22:56 2002 The filetype of the file is _fasta The reset date of the file is today The write mode of the file is > Resetting the data and filename Writing a new file "file2.txt" Appending to the new file "file2.txt" Reading and printing the data from "file2.txt": The file name is file2.txt The contents of the file are: line1 line2 line1 line2 The filetype of the file is _unknown ########################################## # # Test file format recognizing and reading # ########################################## The file name is record.gb The date of the file is Sun Mar 30 14:30:09 2003 The filetype of the file is _genbank The contents of the file are: LOCUS AB031069 2487 bp mRNA PRI 27-MAY-2000 DEFINITION Sequence severely truncated for demonstration. ACCESSION AB031069 VERSION AB031069.1 GI:8100074 KEYWORDS . SOURCE Homo sapiens embryo male lung fibroblast cell_line:HuS-L12 cDNA to mRNA. ORGANISM Homo sapiens Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo. REFERENCE 1 (sites) AUTHORS Fujino,T., Hasegawa,M., Shibata,S., Kishimoto,T., Imai,Si. and Takano,T. TITLE PCCX1, a novel DNA-binding protein with PHD finger and CXXC domain, is regulated by proteolysis JOURNAL Biochem. Biophys. Res. Commun. 271 (2), 305-310 (2000) MEDLINE 20261256 REFERENCE 2 (bases 1 to 2487) AUTHORS Fujino,T., Hasegawa,M., Shibata,S., Kishimoto,T., Imai,S. and Takano,T. TITLE Direct Submission JOURNAL Submitted (15-AUG-1999) to the DDBJ/EMBL/GenBank databases. Tadahiro Fujino, Keio University School of Medicine, Department of Microbiology; Shinanomachi 35, Shinjuku-ku, Tokyo 160-8582, Japan (E-mail:fujino@microb.med.keio.ac.jp, Tel:+81-3-3353-1211(ex.62692), Fax:+81-3-5360-1508) FEATURES Location/Qualifiers source 1..2487 /organism="Homo sapiens" /db_xref="taxon:9606" /sex="male" /cell_line="HuS-L12" /cell_type="lung fibroblast" /dev_stage="embryo" gene 229..2199 /gene="PCCX1" CDS 229..2199 /gene="PCCX1" /note="a nuclear protein carrying a PHD finger and a CXXC domain" /codon_start=1 /product="protein containing CXXC domain 1" /protein_id="BAA96307.1" /db_xref="GI:8100075" /translation="MEGDGSDPEPPDAGEDSKSENGENAPIYCICRKPDINCFMIGCD NCNEWFHGDCIRITEKMAKAIREWYCRECREKDPKLEIRYRHKKSRERDGNERDSSEP RDEGGGRKRPVPDPDLQRRAGSGTGVGAMLARGSASPHKSSPQPLVATPSQHHQQQQQ QIKRSARMCGECEACRRTEDCGHCDFCRDMKKFGGPNKIRQKCRLRQCQLRARESYKY FPSSLSPVTPSESLPRPRRPLPTQQQPQPSQKLGRIREDEGAVASSTVKEPPEATATP EPLSDEDLPLDPDLYQDFCAGAFDDHGLPWMSDTEESPFLDPALRKRAVKVKHVKRRE KKSEKKKEERYKRHRQKQKHKDKWKHPERADAKDPASLPQCLGPGCVRPAQPSSKYCS DDCGMKLAANRIYEILPQRIQQWQQSPCIAEEHGKKLLERIRREQQSARTRLQEMERR FHELEAIILRAKQQAVREDEESNEGDSDDTDLQIFCVSCGHPINPRVALRHMERCYAK YESQTSFGSMYPTRIEGATRLFCDVYNPQSKTYCKRLQVLCPEHSRDPKVPADEVCGC PLVRDVFELTGDFCRLPKRQCNRHYCWEKLRRAEVDLERVRVWYKLDELFEQERNVRT AMTNRAGLLALMLHQTIQHDPLTTDLRSSADR" BASE COUNT 564 a 715 c 768 g 440 t ORIGIN 1 agatggcggc gctgaggggt cttgggggct ctaggccggc cacctactgg tttgcagcgg 61 agacgacgca tggggcctgc gcaataggag tacgctgcct gggaggcgtg actagaagcg 121 gaagtagttg tgggcgcctt tgcaaccgcc tgggacgccg ccgagtggtc tgtgcaggtt 181 cgcgggtcgc tggcgggggt cgtgagggag tgcgccggga gcggagatat ggagggagat 241 aaaaaaaaaa aaaaaaaaaa aaaaaaa // #################### #################### #################### The file name is record.raw The date of the file is Sun Mar 30 14:30:39 2003 The filetype of the file is _raw The contents of the file are: AGATGGCGGCGCTGAGGGGTCTTGGGGGCTCTAGGCCGGCCACCTACTGGTTTGCAGCGGAGACGACGCATGGGGCCTGCGCAAT AGGAGTACGCTGCCTGGGAGGCGTGACTAGAAGCGGAAGTAGTTGTGGGCGCCTTTGCAACCGCCTGGGACGCCGCCGAGTGGTC TGTGCAGGTTCGCGGGTCGCTGGCGGGGGTCGTGAGGGAGTGCGCCGGGAGCGGAGATATGGAGGGAGATAAAAAAAAAAAAAAA AAAAAAAAAAAA #################### #################### #################### The file name is record.embl The date of the file is Sun Mar 30 14:31:23 2003 The filetype of the file is _embl The contents of the file are: ID AB031069 2487 bp mRNA PRI 27-MAY-2000 Sequence severely truncated for demonstration. AB031069 AB031069 XX SQ sequence 267 BP; 65 A; 54 C; 106 G; 42 T; 0 other; AGATGGCGGC GCTGAGGGGT CTTGGGGGCT CTAGGCCGGC CACCTACTGG TTTGCAGCGG AGACGACGCA TGGGGCCTGC GCAATAGGAG TACGCTGCCT GGGAGGCGTG ACTAGAAGCG GAAGTAGTTG TGGGCGCCTT TGCAACCGCC TGGGACGCCG CCGAGTGGTC TGTGCAGGTT CGCGGGTCGC TGGCGGGGGT CGTGAGGGAG TGCGCCGGGA GCGGAGATAT GGAGGGAGAT AAAAAAAAAA AAAAAAAAAA AAAAAAA // #################### #################### #################### The file name is record.fasta The date of the file is Sun Mar 30 14:31:40 2003 The filetype of the file is _fasta The contents of the file are: > AB031069 2487 bp mRNA PRI 27-MAY-2000 Sequence severely truncated for demonstration. AB031069 AGATGGCGGCGCTGAGGGGTCTTGGGGGCTCTAGGCCGGCCACCTACTGG TTTGCAGCGGAGACGACGCATGGGGCCTGCGCAATAGGAGTACGCTGCCT GGGAGGCGTGACTAGAAGCGGAAGTAGTTGTGGGCGCCTTTGCAACCGCC TGGGACGCCGCCGAGTGGTCTGTGCAGGTTCGCGGGTCGCTGGCGGGGGT CGTGAGGGAGTGCGCCGGGAGCGGAGATATGGAGGGAGATAAAAAAAAAA AAAAAAAAAAAAAAAAA #################### #################### #################### The file name is record.gcg The date of the file is Sun Mar 30 14:31:47 2003 The filetype of the file is _gcg The contents of the file are: AB031069 2487 bp mRNA PRI 27-MAY-2000 Sequence severely truncated for demonstration. AB031069 AB031069 Length: 267 (today) Check: 4285 .. 1 AGATGGCGGC GCTGAGGGGT CTTGGGGGCT CTAGGCCGGC CACCTACTGG 51 TTTGCAGCGG AGACGACGCA TGGGGCCTGC GCAATAGGAG TACGCTGCCT 101 GGGAGGCGTG ACTAGAAGCG GAAGTAGTTG TGGGCGCCTT TGCAACCGCC 151 TGGGACGCCG CCGAGTGGTC TGTGCAGGTT CGCGGGTCGC TGGCGGGGGT 201 CGTGAGGGAG TGCGCCGGGA GCGGAGATAT GGAGGGAGAT AAAAAAAAAA 251 AAAAAAAAAA AAAAAAA #################### #################### #################### The file name is record.staden The date of the file is Sun Mar 30 14:32:01 2003 The filetype of the file is _staden The contents of the file are: ;<----AB031069------> AGATGGCGGCGCTGAGGGGTCTTGGGGGCTCTAGGCCGGCCACCTACTGGTTTGCAGCGG AGACGACGCATGGGGCCTGCGCAATAGGAGTACGCTGCCTGGGAGGCGTGACTAGAAGCG GAAGTAGTTGTGGGCGCCTTTGCAACCGCCTGGGACGCCGCCGAGTGGTCTGTGCAGGTT CGCGGGTCGCTGGCGGGGGTCGTGAGGGAGTGCGCCGGGAGCGGAGATATGGAGGGAGAT AAAAAAAAAAAAAAAAAAAAAAAAAAA #################### #################### #################### ########################################## # # Test file format reformatting and writing # ########################################## At this point there are 8 objects. ###### ###### Testing put methods ###### Printing staden data in raw format: AGATGGCGGCGCTGAGGGGTCTTGGGGGCTCTAGGCCGGCCACCTACTGGTTTGCAGCGGAGACGACGCATGGGGCCTGCGCAAT AGGAGTACGCTGCCTGGGAGGCGTGACTAGAAGCGGAAGTAGTTGTGGGCGCCTTTGCAACCGCCTGGGACGCCGCCGAGTGGTC TGTGCAGGTTCGCGGGTCGCTGGCGGGGGTCGTGAGGGAGTGCGCCGGGAGCGGAGATATGGAGGGAGATAAAAAAAAAAAAAAAA AAAAAAAAAAA Printing staden data in embl format: ID AB031069 AB031069 XX SQ sequence 267 BP; 65 A; 54 C; 106 G; 42 T; 0 other; AGATGGCGGC GCTGAGGGGT CTTGGGGGCT CTAGGCCGGC CACCTACTGG TTTGCAGCGG AGACGACGCA TGGGGCCTGC GCAATAGGAG TACGCTGCCT GGGAGGCGTG ACTAGAAGCG GAAGTAGTTG TGGGCGCCTT TGCAACCGCC TGGGACGCCG CCGAGTGGTC TGTGCAGGTT CGCGGGTCGC TGGCGGGGGT CGTGAGGGAG TGCGCCGGGA GCGGAGATAT GGAGGGAGAT AAAAAAAAAA AAAAAAAAAA AAAAAAA // Printing staden data in fasta format: > AB031069 AGATGGCGGCGCTGAGGGGTCTTGGGGGCTCTAGGCCGGCCACCTACTGG TTTGCAGCGGAGACGACGCATGGGGCCTGCGCAATAGGAGTACGCTGCCT GGGAGGCGTGACTAGAAGCGGAAGTAGTTGTGGGCGCCTTTGCAACCGCC TGGGACGCCGCCGAGTGGTCTGTGCAGGTTCGCGGGTCGCTGGCGGGGGT CGTGAGGGAGTGCGCCGGGAGCGGAGATATGGAGGGAGATAAAAAAAAAA AAAAAAAAAAAAAAAAA Printing staden data in gcg format: AB031069 AB031069 Length: 267 (today) Check: 4285 .. 1 AGATGGCGGC GCTGAGGGGT CTTGGGGGCT CTAGGCCGGC CACCTACTGG 51 TTTGCAGCGG AGACGACGCA TGGGGCCTGC GCAATAGGAG TACGCTGCCT 101 GGGAGGCGTG ACTAGAAGCG GAAGTAGTTG TGGGCGCCTT TGCAACCGCC 151 TGGGACGCCG CCGAGTGGTC TGTGCAGGTT CGCGGGTCGC TGGCGGGGGT 201 CGTGAGGGAG TGCGCCGGGA GCGGAGATAT GGAGGGAGAT AAAAAAAAAA 251 AAAAAAAAAA AAAAAAA Printing staden data in genbank format: LOCUS AB031069 267 bp DEFINITION AB031069 , 267 bases, 829 sum. ACCESSION ORIGIN 1 agatggcggc gctgaggggt cttgggggct ctaggccggc cacctactgg tttgcagcgg 61 agacgacgca tggggcctgc gcaataggag tacgctgcct gggaggcgtg actagaagcg 121 gaagtagttg tgggcgcctt tgcaaccgcc tgggacgccg ccgagtggtc tgtgcaggtt 181 cgcgggtcgc tggcgggggt cgtgagggag tgcgccggga gcggagatat ggagggagat 241 aaaaaaaaaa aaaaaaaaaa aaaaaaa // Printing staden data in PIR format: ENTRY AB031069 TITLE AB031069 DATE REFERENCE SUMMARY #Molecular-weight 0 #Length 267 #Checksum 4285 SEQUENCE 5 10 15 20 25 30 1 A G A T G G C G G C G C T G A G G G G T C T T G G G G G C T 31 C T A G G C C G G C C A C C T A C T G G T T T G C A G C G G 61 A G A C G A C G C A T G G G G C C T G C G C A A T A G G A G 91 T A C G C T G C C T G G G A G G C G T G A C T A G A A G C G 121 G A A G T A G T T G T G G G C G C C T T T G C A A C C G C C 151 T G G G A C G C C G C C G A G T G G T C T G T G C A G G T T 181 C G C G G G T C G C T G G C G G G G G T C G T G A G G G A G 211 T G C G C C G G G A G C G G A G A T A T G G A G G G A G A T 241 A A A A A A A A A A A A A A A A A A A A A A A A A A A ///
Resources
Inheritance is a fundamental OO technique; see Section 3.14 for Perl OO reference material that includes discussion of inheritance.
For programming with sequence data file formats, see the C program readseq by Don Gilbert (at http://iubio.bio.indiana.edu/soft/molbio/readseq).
See the Bioperl project at http://www.bioperl.org for alternate ways to handle this programming task in Perl.
For a more rigorous but slower approach to parsing sequence files (which is sometimes what you want) see the module
Parse::RecDescent
by Damien Conway at CPAN.Each sequence file format has documentation that describes it. These formats sometimes change to keep up with the changing nature of biological data. One of the following exercises challenges you to find the documentation for one of the formats and to improve the code in this chapter for that format.
Exercises
- Exercise 4.1
Write an object-oriented module
DNAsequence
whose object has one attribute, a sequence of DNA, and two methods,get_dna
andset_dna
. Start with the code forGene.pm
, but see how far you can whittle it down to the minimum amount of code necessary to implement this new class.- Exercise 4.2
The
FileIO.pm
module implements objects that read and write file data. However, they can, depending on the program, deviate substantially from what are actually present in files on your computer. For instance, you can read in all the files in a folder, and then change the filenames and data of all the objects, without writing them out. Is this a good thing or a bad thing?- Exercise 4.3
In the text, you are asked why the
new
constructor forFileIO.pm
has been whittled down to the bare bones. You can see that all it does is create an empty object. What functionality has been moved out of thenew
constructor and into theread
andwrite
methods? Does it make more sense to do without anew
constructor entirely and instead have theread
andwrite
methods create objects? Try rewriting the code that way. Alternately, does it make sense to try rewriting the code so that both reading and writing are handled by thenew
constructor? Is creating an object sometimes logically distinct from initializing it?- Exercise 4.4
Use
FileIO.pm
as a base class for a new class that manages the annotation of a pipeline in your laboratory. For example, perhaps your lab gets sequence from your ABI machine, screens it for vectors, assesses the quality of the sequencing run, searches your local database to determine if youâve seen it or something like it before, then searches GenBank to see what other known sequences it matches or resembles, and finally adds it to an assembly project. Each step has a person or persons, a timestamp for the beginning and ending of each phase, and data. You want to be able to track the work done on each sequence that emerges from your ABI. (This is just an example. Pick a set of jobs that you actually do in your lab.)- Exercise 4.5
For each sequence file format handled by the
SeqFileIO.pm
module, find the documentation that specifies the format. Compare the documentation with theis_
,parse_
, andput_
method to recognize, read, and write files in each format. How can you improve this code? Make it more complete? Faster?- Exercise 4.6
My
parse_
methods are somewhat ad hoc. They donât really parse the whole file according to the definition of the format. They just extract the sequence and a small amount of annotation. Take one of the formats and write a more complete parser for it. What are the advantages and disadvantages of a simple versus a more complete parser in this code? How about for other applications you may want to develop in the future?- Exercise 4.7
Use the parser you developed in Exercise 4.6 to do a more complete job of identifying a file in the same format in the moduleâs
is_
method.- Exercise 4.8
Add a new sequence file format to
SeqFileIO
.- Exercise 4.9
In
FileIO.pm
, and in many other places in this book, the program callscroak
and exits when a problem arises (such as when unsuccessfully attempting to open a file for reading). Such drastic measures are sometimes desirable; for example, you may want to kill the program if a security problem is discovered in which someone is attempting to read a forbidden file. Or, when developing software, you may like your program to print an informative message and die when a problem occurs, as that might help you develop the program faster.However, very often what you really want is for the program to notice the error and take some appropriate steps, not simply die. If a file cannot be opened, it may be something as simple as the user of the program mistyping the filename, and what youâd like is to give the user another couple of chances to type the name in correctly. Rewrite
FileIO.pm
without callingcroak
. This may entail checking for the success or failure of certain operations and taking reasonable actions on failure. Should the class module take all such actions, or should the program that uses the class module be expected to behave appropriately when a failure is reported?- Exercise 4.10
The
AUTOLOAD
method inFileIO.pm
tests for attributes that are scalars and references to arrays. The need for this comes from the list of attributes given in the%_my_attribute_properties
hash. Each attribute hash value is an anonymous array with two elements: default value and properties. From the default value you can see that a value is either a scalar (a string in this case) or an anonymous array (a reference to an array). The code thatAUTOLOAD
installs for accessor routines then checks if the attribute is either a scalar or a reference to an array.This
AUTOLOAD
method is inherited bySeqFileIO.pm
. One of the modifications thatSeqFileIO.pm
makes is defining its own%_my_attribute_properties
to handle the new attributes that it defines, such as_sequence
. In this case, all the attributes are either scalars or references to arrays, as before. What modifications are necessary if some other data type is needed for a new attribute by a class that inheritedFileIO.pm
? How can you rewriteFileIO.pm
to make it easier to write classes that inherit it?- Exercise 4.11
The test program
testSeqFileIO
has certain shortcomings. For one thing, it repeats blocks of code that can be replaced with a short loop (with a little rewriting). Another problem is that it doesnât test everything in the class.Rewrite
testSeqFileIO
so that itâs clearer and more comprehensive. By default, make it just give a short summary of the number of tests performed and the number of tests passed, but add averbose
flag so that it prints out all its tests in detail when desired. The moduleSeqFileIO.pm
is lacking POD documentation.Add POD documentation to the module that is fairly easily cut and pasted into a test program for the module.- Exercise 4.12
In
SeqFileIO.pm
, the hash%_all_attribute_properties
changed from the base class and needed to be redefined. However, the code for the_all_attributes
,_attribute_default
, and_permissions
helper methods didnât change. Why then did the new classSeqFileIO
redefine these methods? (Hint: are these helper methods closures?)SeqFileIO.pm
is also lacking POD documentation. Try adding POD documentation to the module soy that it can be easily cut and pasted into a test program for the module.- Exercise 4.13
The
h2xs
program that ships with Perl simplifies module creation, and even helps you create theMakefile.PL
that youâll need to add your own module to CPAN or to your local installation (which helps you bypass the somewhat awkwarduse
lib
directive that appears in the programs in this book). See also theperlxstut
, theExtUtils::MakeMaker
, and theAUTOLOAD
manpages. In particular, see the-X
option toh2xs
. Write a module starting from the use ofh2xs
.- Exercise 4.14
The
open
calls in theread
methods of the classes in this chapter specify a filehandleFileIOFH
. Alternatives include using lexical scalars as filehandles or theIO::Handle
package. Rewrite theread
methods so files are opened with these alternative types of filehandles. What costs or benefits result from these rewritings? (See theperlopentut
part of the Perl documentation.)- Exercise 4.15
In the
AUTOLOAD
method, a copy of the file data is returned from theget_filedata
accessor; this will protect the actual file data in the object, but it makes a copy of a potentially very large amount of data, which can overtax your system. Discuss alternatives for this behavior, and implement one of them.- Exercise 4.16
Reading in a few hundred large files (as can easily happen with the modules in this chapter) can overtax your system, causing the system, or at least the program, to crash. Design two alternative methods that avoid this overuse of memory. For instance, you can avoid reading in a file until the sequence data is actually needed. You can also reread the data into the program each time needed but not save it in your object. Finally, you can reclaim memory from older files. Implement one of these methods or some other. What other parts of the code need to be altered?
[11] Don Gilbertâs
readseq
package (see http://iubio.bio.indiana.edu/soft/molbio/readseq
and ftp://ftp.bio.indiana.edu/molbio/readseq/classic/src)
is the classic program (written in C) for reading and writing
multiple sequence file formats.
Get Mastering Perl for Bioinformatics now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.