Preface

The history of biological research is filled with examples of new laboratory techniques which, at first, are suitable topics for doctoral theses but eventually become so widely useful and standard that they are learned by most undergraduates. The use of computer programming in biology research is such an increasingly standard skill for many biologists. Bioinformatics is one of the most rapidly growing areas of biological science. Fundamentally, it’s a cross-disciplinary study, combining the questions of computer science and programming with those of biological research.

As active sciences evolve, unifying principles and techniques developed in one field are often found to be useful in other areas. As a result, the established boundaries between disciplines are sometimes blurred, and the new principles and techniques may result in new ways of seeing the science as a whole. For instance, molecular biology has developed a set of techniques over the past 50 years that has also proved useful throughout much of biology in general. Similarly, the methods of bioinformatics are finding fertile ground in such fields as genetics, biochemistry, molecular biology, evolutionary science, development, cell studies, clinical research, and field biology.

In my view, bioinformatics , which I define broadly as the use of computers in biological research, is becoming a foundational science for a broad range of biological studies. Just as it’s now commonplace to find a geneticist or a field biologist using the techniques of molecular biology as a routine part of her research, so can you frequently find that same researcher applying the techniques of bioinformatics. Molecular biology and bioinformatics may not be the researcher’s main areas of interest, but the tools from molecular biology and bioinformatics have become standard in searching for the answers to the questions of interest. The Perl programming language plays no small part in that search for answers.

About This Book

This book is a continuation of my previous book, Beginning Perl for Bioinformatics (also by O’Reilly & Associates). As the title implies, Mastering Perl for Bioinformatics moves you to a more advanced level of Perl programming in bioinformatics. In this volume, I cover such topics as advanced data structures, object-oriented programming, modules, relational databases, web programming, and more advanced algorithms. The main goal of this book is to help you learn to write Perl programs that support your research in biology and enable you to adapt and use programs written by others.

In the process of honing your programming skills, you will also learn the fundamentals of bioinformatics. For many readers, the material presented in these two books will be sufficient to support their goals in the laboratory. However, this book is not a comprehensive survey of bioinformatics techniques. Both Mastering Perl for Bioinformatics and Beginning Perl for Bioinformatics emphasize the computer programming aspects of bioinformatics. As a serious student, you should expect to follow this groundwork with further study in the bioinformatics literature. Even the Perl programming language has more complexity than can fit in this cross-disciplinary text.

Readers already familiar with basic Perl and the elements of DNA and proteins can use Mastering Perl for Bioinformatics without reference to Beginning Perl for Bioinformatics. However, the two books together make a complete course suitable for undergraduates, graduate students, and professional biologists who need to learn programming for biology research.

A companion web site at http://www.oreilly.com/catalog/mperlbio includes all the program code in the book.

What You Need to Know to Use This Book

This book assumes that you have some experience with Perl, including a working knowledge of writing, saving, and running programs; basic Perl syntax; control structures such as loops and conditional tests; the most common operators such as addition, subtraction, and string concatenation; input and output from the user, files, and other programs; subroutines; the basic data types of scalar, array, and hash; and regular expressions for searching and for altering strings. In other words, you should be able to program Perl well enough to extract data from sources such as GenBank and the Protein Data Bank using pattern matching and regular expressions.

If you are new to Perl but feel you can forge ahead using a language summary and examples of programs, Appendix A provides a summary of the important parts of the Perl language. Previous programming experience in a high-level language such as C, Java, or FORTRAN (or any similar language); some experience at using subroutines to break a large problem into smaller, appropriately interrelated parts; and a tinkerer’s delight in taking things apart and seeing what makes them tick may be all the computer-science prerequisites you need.

This book is primarily written for biologists, so it assumes you know the elementary facts about DNA, proteins, and restriction enzymes; how to represent DNA and protein data in a Perl program; how to search for motifs; and the structure and use of the databases GenBank, PDB, and Rebase. Because the book assumes you are a biologist, biology concepts are not explained in detail in order to concentrate on programming skills.

Biological data appears in many forms. The most important sources of biological data include the repository of public genetic data called GenBank (Genetic Data Bank) and the repository of public protein structure data called PDB (Protein Data Bank). Many other similar sources of biological data such as Rebase (Restriction Enzyme Database) are in wide use. All the databases just mentioned are most commonly distributed as text files, which makes Perl a good programming tool to find and extract information from the databases.

Organization of This Book

Here’s a quick summary of what the book covers. If you’re still relatively new to Perl you may want to work through the chapters in order. If you have some programming experience and are looking for ways to approach problems in bioinformatics with Perl, feel free to skip around.

Part I

Chapter 1

Modules are the standard Perl way of “packaging” useful programs so that other programmers can easily use previous work. Such standard modules as CGI, for instance, put the power of interactive web site programming within reach of a programmer who knows basic Perl. Also discussed in later chapters are Bioperl, for manipulating biological data, and DBI, for gaining access to relational databases. Modules are sometimes considered the most important part of Perl because that’s where a lot of the functionality of Perl has been placed. In this chapter I show how to write your own modules, as well as how to find useful modules and use them in your programs.

Chapter 2

Complex data structures and references are fundamentally important to Perl. The basic Perl data structures of scalar, array, and hash go a long way toward solving many (perhaps most) Perl programming problems. However, many commonly used data structures such as multidimensional arrays, for instance, require more sophisticated Perl data structures to handle them. Perl enables you to define quite complex data structures, and we’ll see how all that works.

String algorithms are standard techniques used in bioinformatics for finding important data in biological sequences; with them, you can compare two sequences, align two or more sequences, assemble a collection of sequence fragments, and so forth. String algorithms underlie many of the most commonly used programs in biology research, such as BLAST. In this chapter, a string matching algorithm that finds the closest match to a motif, based on the technique of dynamic programming, is presented in the form of a working Perl program.

Chapter 3

Object-oriented programming is a standard approach to designing programs. I assume, as a prerequisite, that you are familiar with the programming style called declarative programming. (For example, C and FORTRAN are declarative; C++ and Java are object-oriented; Perl can be either.) It’s important for the Perl programmer to be familiar with the object-oriented approach. For instance, modules are usually defined in an object-oriented manner.

This chapter presents, step by step, the concepts and techniques of object-oriented Perl programming, in the context of a module that defines a simple class for keeping track of genes.

Chapter 4

In this chapter, object-oriented programming is further explored in the context of developing software to convert sequence files to alternate formats (FASTA, GCG, etc.). The concept of class inheritance is introduced and implemented.

Chapter 5

This chapter further develops object-oriented programming by writing a class that handles Rebase restriction enzyme data, a class that calculates restriction maps, and a class that draws restriction maps.

Part II

Chapter 6

Relational databases are important in programming because they save, organize, and retrieve data sets. This chapter introduces relational databases and the SQL language and includes information on designing and administering databases. I take a close look at how one such relational database management system, the popular MySQL, is used from the Perl language.

Chapter 7

Web programming is one of Perl’s areas of strength. In this chapter, I start an example that puts a laboratory up on the Web using Perl and the CGI module. The software developed in previous chapters for restriction mapping is made accessible from the Web.

Chapter 8

Using computer graphics to display data is one of the most important programming skills in bioinformatics. In this chapter, graphics programs are used to dynamically display the output of restriction maps and data presented as graphs on the Web. The Perl module GD is discussed and used to generate maps on the fly from web page queries.

Chapter 9

Bioperl is a set of modules used by Perl programmers to write bioinformatics applications. In this chapter you’ll see an introduction of the Bioperl project. Bioperl is open source (free under a very nonrestrictive copyright) and developed by a group of volunteers, many based in supportive research organizations. In recent years it has achieved critical mass and is now adequately documented and fairly broad in scope. If you do Perl bioinformatics programming, you should certainly be aware of what Bioperl has to offer, to avoid reinventing the wheel.

Part III

Appendix A

This appendix summarizes the parts of Perl we’ve covered.

Appendix B

This appendix outlines how to install Perl.

Conventions Used in This Book

The following conventions are used in this book:

Constant width

Used for arrays, classes, code examples, loops, modules, namespaces, objects, packages, statements, and to show the output of commands.

Italics

Used for commands, directory names, filenames, example URLs, variables, and for new terms where they are defined.

Tip

This icon designates a note, which is an important aside to the nearby text.

Warning

This icon designates a warning relating to the nearby text.

Comments and Questions

Please address comments and questions concerning this book to the publisher:

O’Reilly & Associates, Inc.
1005 Gravenstein Highway North
Sebastopol, CA 95472
(800) 998-9938 (in the United States or Canada)
(707) 829-0515 (international or local)
(707) 829-0104 (fax)

There is a web page for this book, which lists errata, examples, or any additional information. You can access this page at:

http://www.oreilly.com/catalog/mperlbio

To comment or ask technical questions about this book, send email to:

For more information about books, conferences, Resource Centers, and the O’Reilly Network, see the O’Reilly web site at:

http://www.oreilly.com

Acknowledgments

My editor, Lorrie LeJeune, deserves special thanks for her work in developing the bioinformatics titles at O’Reilly. Her level of expertise is rare in any field. I thank Lorrie, Tim O’Reilly, and their colleagues for making it possible to bring these books to the public. I thank my technical reviewers for their invaluable expert help: Joel Greshock, Joe Johnston, Andrew Martin, and Sean Quinlan. I also thank Dr. Michael Caudy for his helpful suggestions in Chapter 3. I thank again those individuals mentioned in the first volume, especially those friends who have supported me during the writing of this book. I am also grateful to all those readers of the first volume who took the time and trouble to point out errors and weaknesses; their comments have substantially improved this volume as well. I thank Eamon Grennan and Jay Parini for their patient help with my writing. And I especially thank my much-loved children Rose, Eamon, and Joe, who are my most sincere teachers.

Get Mastering Perl for Bioinformatics now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.