From the interaction of species and populations, to the function of tissues and cells within an individual organism, biology is defined as the study of living things. In the course of that study, biologists collect and interpret data. Now, at the beginning of the 21st century, we use sophisticated laboratory technology that allows us to collect data faster than we can interpret it. We have vast volumes of DNA sequence data at our fingertips. But how do we figure out which parts of that DNA control the various chemical processes of life? We know the function and structure of some proteins, but how do we determine the function of new proteins? And how do we predict what a protein will look like, based on knowledge of its sequence? We understand the relatively simple code that translates DNA into protein. But how do we find meaningful new words in the code and add them to the DNA-protein dictionary?
Bioinformatics is the science of using information to understand biology; it's the tool we can use to help us answer these questions and many others like them. Unfortunately, with all the hype about mapping the human genome, bioinformatics has achieved buzzword status; the term is being used in a number of ways, depending on who is using it. Strictly speaking, bioinformatics is a subset of the larger field of computational biology , the application of quantitative analytical techniques in modeling biological systems. In this book, we stray from bioinformatics into computational biology and back again. The distinctions between the two aren't important for our purpose here, which is to cover a range of tools and techniques we believe are critical for molecular biologists who want to understand and apply the basic computational tools that are available today.
The field of bioinformatics relies heavily on work by experts in statistical methods and pattern recognition. Researchers come to bioinformatics from many fields, including mathematics, computer science, and linguistics. Unfortunately, biology is a science of the specific as well as the general. Bioinformatics is full of pitfalls for those who look for patterns and make predictions without a complete understanding of where biological data comes from and what it means. By providing algorithms, databases, user interfaces, and statistical tools, bioinformatics makes it possible to do exciting things such as compare DNA sequences and generate results that are potentially significant. "Potentially significant" is perhaps the most important phrase. These new tools also give you the opportunity to overinterpret data and assign meaning where none really exists. We can't overstate the importance of understanding the limitations of these tools. But once you gain that understanding and become an intelligent consumer of bioinformatics methods, the speed at which your research progresses can be truly amazing.
An organism's hereditary and functional information is stored as DNA, RNA, and proteins, all of which are linear chains composed of smaller molecules. These macromolecules are assembled from a fixed alphabet of well-understood chemicals: DNA is made up of four deoxyribonucleotides (adenine, thymine, cytosine, and guanine), RNA is made up from the four ribonucleotides (adenine, uracil, cytosine, and guanine), and proteins are made from the 20 amino acids. Because these macromolecules are linear chains of defined components, they can be represented as sequences of symbols. These sequences can then be compared to find similarities that suggest the molecules are related by form or function.
Sequence comparison is possibly the most useful computational tool to emerge for molecular biologists. The World Wide Web has made it possible for a single public database of genome sequence data to provide services through a uniform interface to a worldwide community of users. With a commonly used computer program called fsBLAST, a molecular biologist can compare an uncharacterized DNA sequence to the entire publicly held collection of DNA sequences. In the next section, we present an example of how sequence comparison using the BLAST program can help you gain insight into a real disease.
Fruit flies (Drosophila melanogaster ) are a popular model system for the study of development of animals from embryo to adult. Fruit flies have a gene called eyeless, which, if it's "knocked out" (i.e., eliminated from the genome using molecular biology methods), results in fruit flies with no eyes. It's obvious that the eyeless gene plays a role in eye development.
Researchers have identified a human gene responsible for a condition called aniridia . In humans who are missing this gene (or in whom the gene has mutated just enough for its protein product to stop functioning properly), the eyes develop without irises.
If the gene for aniridia is inserted into an eyeless drosophila "knock out," it causes the production of normal drosophila eyes. It's an interesting coincidence. Could there be some similarity in how eyeless and aniridia function, even though flies and humans are vastly different organisms? Possibly. To gain insight into how eyeless and aniridia work together, we can compare their sequences. Always bear in mind, however, that genes have complex effects on one another. Careful experimentation is required to get a more definitive answer.
As little as 15 years ago, looking for similarities between eyeless and aniridia DNA sequences would have been like looking for a needle in a haystack. Most scientists compared the respective gene sequences by hand-aligning them one under the other in a word processor and looking for matches character by character. This was time-consuming, not to mention hard on the eyes.
In the late 1980s, fast computer programs for comparing sequences changed molecular biology forever. Pairwise comparison of biological sequences is the foundation of most widely used bioinformatics techniques. Many tools that are widely available to the biology community—including everything from multiple alignment, phylogenetic analysis, motif identification, and homology-modeling software, to web-based database search services—rely on pairwise sequence-comparison algorithms as a core element of their function.
These days, a biologist can find dozens of sequence matches in seconds using sequence-alignment programs such as BLAST and FASTA. These programs are so commonly used that the first encounter you have with bioinformatics tools and biological databases will probably be through the National Center for Biotechnology Information's (NCBI) BLAST web interface. Figure 1-1 shows a standard form for submitting data to NCBI for a BLAST search.
It's important to remember that biological sequence (DNA or protein) has a chemical function, but when it's reduced to a single-letter code, it also functions as a unique label, almost like a bar code. From the information technology point of view, sequence information is priceless. The sequence label can be applied to a gene, its product, its function, its role in cellular metabolism, and so on. The user searching for information related to a particular gene can then use rapid pairwise sequence comparison to access any information that's been linked to that sequence label.
The most important thing about these sequence labels, though, is that they don't just uniquely identify a particular gene; they also contain biologically meaningful patterns that allow users to compare different labels, connect information, and make inferences. So not only can the labels connect all the information about one gene, they can help users connect information about genes that are slightly or even dramatically different in sequence.
If simple labels were all that was needed to make sense of biological data, you could just slap a unique number (e.g., a GenBank ID) onto every DNA sequence and be done with it. But biological sequences are related by evolution, so a partial pattern match between two sequence labels is a significant find. BLAST differs from simple keyword searching in its ability to detect partial matches along the entire length of a protein sequence.
When the two sequences are compared using BLAST, you'll find that eyeless is a partial match for aniridia. The text that follows is the raw data that's returned from this BLAST search:
pir||A41644 homeotic protein aniridia - human Length = 447 Score = 256 bits (647), Expect = 5e-67 Identities = 128/146 (87%), Positives = 134/146 (91%), Gaps = 1/146 (0%) Query: 24 IERLPSLEDMAHKGHSGVNQLGGVFVGGRPLPDSTRQKIVELAHSGARPCDISRILQVSN 83 I R P+ M + HSGVNQLGGVFV GRPLPDSTRQKIVELAHSGARPCDISRILQVSN Sbjct: 17 IPRPPARASMQNS-HSGVNQLGGVFVNGRPLPDSTRQKIVELAHSGARPCDISRILQVSN 75 Query: 84 GCVSKILGRYYETGSIRPRAIGGSKPRVATAEVVSKISQYKRECPSIFAWEIRDRLLQEN 143 GCVSKILGRYYETGSIRPRAIGGSKPRVAT EVVSKI+QYKRECPSIFAWEIRDRLL E Sbjct: 76 GCVSKILGRYYETGSIRPRAIGGSKPRVATPEVVSKIAQYKRECPSIFAWEIRDRLLSEG 135 Query: 144 VCTNDNIPSVSSINRVLRNLAAQKEQ 169 VCTNDNIPSVSSINRVLRNLA++K+Q Sbjct: 136 VCTNDNIPSVSSINRVLRNLASEKQQ 161 Score = 142 bits (354), Expect = 1e-32 Identities = 68/80 (85%), Positives = 74/80 (92%) Query: 398 TEDDQARLILKRKLQRNRTSFTNDQIDSLEKEFERTHYPDVFARERLAGKIGLPEARIQV 457 +++ Q RL LKRKLQRNRTSFT +QI++LEKEFERTHYPDVFARERLA KI LPEARIQV Sbjct: 222 SDEAQMRLQLKRKLQRNRTSFTQEQIEALEKEFERTHYPDVFARERLAAKIDLPEARIQV 281 Query: 458 WFSNRRAKWRREEKLRNQRR 477 WFSNRRAKWRREEKLRNQRR Sbjct: 282 WFSNRRAKWRREEKLRNQRR 301
The output shows local alignments of two high-scoring matching regions in the protein sequences of the eyeless and aniridia genes. In each set of three lines, the query sequence (the eyeless sequence that was submitted to the BLAST server) is on the top line, and the aniridia sequence is on the bottom line. The middle line shows where the two sequences match. If there is a letter on the middle line, the sequences match exactly at that position. If there is a plus sign on the middle line, the two sequences are different at that position, but there is some chemical similarity between the amino acids (e.g., D and E, aspartic and glutamic acid). If there is nothing on the middle line, the two sequences don't match at that position.
In this example, you can see that, if you submit the whole
eyeless gene sequence and look (as standard
keyword searches do) for an exact match, you won't find
anything. The local sequence regions make up only part of the
complete proteins: the region from 24-169 in
eyeless matches the region from 17-161 in the
human aniridia gene, and the region from 398-477
in eyeless matches the region from 222-301 in
aniridia. The rest of the sequence doesn't
match! Even the two regions shown, which match closely, don't
match 100%, as they would have to, in order to be found in a keyword
However, this partial match is significant. It tells us that the human aniridia gene, which we don't know much about, is substantially related in sequence to the fruit fly's eyeless gene. And we do know a lot about the eyeless gene, from its structure and function (it's a DNA binding protein that promotes the activity of other genes) to its effects on the phenotype—the form of the grown fruit fly.
BLAST finds local regions that match even in pairs of sequences that aren't exactly the same overall. It extends matches beyond a single-character difference in the sequence, and it keeps trying to extend them in all directions until the overall score of the sequence match gets too small. As a result, BLAST can detect patterns that are imperfectly replicated from sequence to sequence, and hence distant relationships that are inexact but still biologically meaningful.
Depending on the quality of the match between two labels, you can transfer the information attached to one label to the other. A high-quality sequence match between two full-length sequences may suggest the hypothesis that their functions are similar, although it's important to remember that the identification is only tentative until it's been experimentally verified. In the case of the eyeless and aniridia genes, scientists hope that studying the role of the eyeless gene in Drosophila eye development will help us understand how aniridia works in human eye development.
Much of what we currently think of as part of bioinformatics—sequence comparison, sequence database searching, sequence analysis—is more complicated than just designing and populating databases. Bioinformaticians (or computational biologists) go beyond just capturing, managing, and presenting data, drawing inspiration from a wide variety of quantitative fields, including statistics, physics, computer science, and engineering. Figure 1-2 shows how quantitative science intersects with biology at every level, from analysis of sequence data and protein structure, to metabolic modeling, to quantitative analysis of populations and ecology.
Bioinformatics is first and foremost a component of the biological sciences. The main goal of bioinformatics isn't developing the most elegant algorithms or the most arcane analyses; the goal is finding out how living things work. Like the molecular biology methods that greatly expanded what biologists were capable of studying, bioinformatics is a tool and not an end in itself. Bioinformaticians are the tool-builders, and it's critical that they understand biological problems as well as computational solutions in order to produce useful tools.
Research in bioinformatics and computational biology can encompass anything from abstraction of the properties of a biological system into a mathematical or physical model, to implementation of new algorithms for data analysis, to the development of databases and web tools to access them.
Biology as a science of the specific means that biologists need to remember a lot of details as well as general principles. Biologists have been dealing with problems of information management since the 17th century.
The roots of the concept of evolution lie in the work of early biologists who catalogued and compared species of living things. The cataloguing of species was the preoccupation of biologists for nearly three centuries, beginning with animals and plants and continuing with microscopic life upon the invention of the compound microscope. New forms of life and fossils of previously unknown, extinct life forms are still being discovered even today.
All this cataloguing of plants and animals resulted in what seemed a vast amount of information at the time. In the mid-16th century, Otto Brunfels published the first major modern work describing plant species, the Herbarium vitae eicones . As Europeans traveled more widely around the world, the number of catalogued species increased, and botanical gardens and herbaria were established. The number of catalogued plant types was 500 at the time of Theophrastus, a student of Aristotle. By 1623, Casper Bauhin had observed 6,000 types of plants. Not long after John Ray introduced the concept of distinct species of animals and plants, and developed guidelines based on anatomical features for distinguishing conclusively between species. In the 1730s, Carolus Linnaeus catalogued 18,000 plant species and over 4,000 species of animals, and established the basis for the modern taxonomic naming system of kingdoms, classes, genera, and species. By the end of the 18th century, Baron Cuvier had listed over 50,000 species of plants.
It was no coincidence that a concurrent preoccupation of biologists, at this time of exploration and cataloguing, was classification of species into an orderly taxonomy. A botany text might encompass several volumes of data, in the form of painstaking illustrations and descriptions of each species encountered. Biologists were faced with the problem of how to organize, access, and sensibly add to this information. It was apparent to the casual observer that some living things were more closely related than others. A rat and a mouse were clearly more similar to each other than a mouse and a dog. But how would a biologist know that a rat was like a mouse (but that rat was not just another name for mouse) without carrying around his several volumes of drawings? A nomenclature that uniquely identified each living thing and summed up its presumed relationship with other living things, all in a few words, needed to be invented.
The solution was relatively simple, but at the time, a great innovation. Species were to be named with a series of one-word names of increasing specificity. First a very general division was specified: animal or plant? This was the kingdom to which the organism belonged. Then, with increasing specificity, came the names for class, genera, and species. This schematic way of classifying species, as illustrated in Figure 1-3, is now known as the "Tree of Life."
A modern taxonomy of the earth's millions of species is too complicated for even the most zealous biologist to memorize, and fortunately computers now provide a way to maintain and access the taxonomy of species. The University of Arizona's Tree of Life project and NCBI's Taxonomy database are two examples of online taxonomy projects.
Taxonomy was the first informatics problem in biology. Now, biologists have reached a similar point of information overload by collecting and cataloguing information about individual genes. The problem of organizing this information and sharing knowledge with the scientific community at the gene level isn't being tackled by developing a nomenclature. It's being attacked directly with computers and databases from the start.
The evolution of computers over the last half-century has fortuitously paralleled the developments in the physical sciences that allow us to see biological systems in increasingly fine detail. Figure 1-4 illustrates the astonishing rate at which biological knowledge has expanded in the last 20 years.
Simply finding the right needles in the haystack of information that is now available can be a research problem in itself. Even in the late 1980s, finding a match in a sequence database was worth a five-page publication. Now this procedure is routine, but there are many other questions that follow on our ability to search sequence and structure databases. These questions are the impetus for the field of bioinformatics.
The science of informatics is concerned with the representation, organization, manipulation, distribution, maintenance, and use of information, particularly in digital form. There is more than one interpretation of what bioinformatics—the intersection of informatics and biology—actually means, and it's quite possible to go out and apply for a job doing bioinformatics and find that the expectations of the job are entirely different than you thought.
The functional aspect of bioinformatics is the representation, storage, and distribution of data. Intelligent design of data formats and databases, creation of tools to query those databases, and development of user interfaces that bring together different tools to allow the user to ask complex questions about the data are all aspects of the development of bioinformatics infrastructure.
Developing analytical tools to discover knowledge in data is the second, and more scientific, aspect of bioinformatics. There are many levels at which we use biological information, whether we are comparing sequences to develop a hypothesis about the function of a newly discovered gene, breaking down known 3D protein structures into bits to find patterns that can help predict how the protein folds, or modeling how proteins and metabolites in a cell work together to make the cell function. The ultimate goal of analytical bioinformaticians is to develop predictive methods that allow scientists to model the function and phenotype of an organism based only on its genome sequence. This is a grand goal, and one that will be approached only in small steps, by many scientists working together.
Cracking the genome code is complex. At the very simplest level, we still have difficulty identifying unknown genes by computer analysis of genomic sequence. We still have not managed to predict or model how a chain of amino acids folds into the specific structure of a functional protein.
Beyond the single-molecule level, the challenges are immense. The sheer amount of data in GenBank is now growing at an exponential rate, and as datatypes beyond DNA, RNA, and protein sequence begin to undergo the same kind of explosion, simply managing, accessing, and presenting this data to users in an intelligible form is a critical task. Human-computer interaction specialists need to work closely with academic and clinical researchers in the biological sciences to manage such staggering amounts of data.
Biological data is very complex and interlinked. A spot on a DNA array, for instance, is connected not only to immediate information about its intensity, but to layers of information about genomic location, DNA sequence, structure, function, and more. Creating information systems that allow biologists to seamlessly follow these links without getting lost in a sea of information is also a huge opportunity for computer scientists.
Finally, each gene in the genome isn't an independent entity. Multiple genes interact to form biochemical pathways, which in turn feed into other pathways. Biochemistry is influenced by the external environment, by interaction with pathogens, and by other stimuli. Putting genomic and biochemical data together into quantitative and predictive models of biochemistry and physiology will be the work of a generation of computational biologists. Computer scientists, mathematicians, and statisticians will be a vital part of this effort.
There's a wide range of topics that are useful if you're interested in pursuing bioinformatics, and it's not possible to learn them all. However, in our conversations with scientists working at companies such as Celera Genomics and Eli Lilly, we've picked up on the following "core requirements" for bioinformaticians:
You should have a fairly deep background in some aspect of molecular biology. It can be biochemistry, molecular biology, molecular biophysics, or even molecular modeling, but without a core of knowledge of molecular biology you will, as one person told us, "run into brick walls too often."
You must absolutely understand the central dogma of molecular biology. Understanding how and why DNA sequence is transcribed into RNA and translated into protein is vital. (In Chapter 2, we define the central dogma, as well as review the processes of transcription and translation.)
You should have substantial experience with at least one or two major molecular biology software packages, either for sequence analysis or molecular modeling. The experience of learning one of these packages makes it much easier to learn to use other software quickly.
You should have experience with programming in a computer language such as C/C++, as well as in a scripting language such as Perl or Python.
There are a variety of other advanced skill sets that can add value to this background: molecular evolution and systematics; physical chemistry—kinetics, thermodynamics and statistical mechanics; statistics and probabilistic methods; database design and implementation; algorithm development; molecular biology laboratory methods; and others.
Computers are powerful devices for understanding any system that can be described in a mathematical way. As our understanding of biological processes has grown and deepened, it isn't surprising, then, that the disciplines of computational biology and, more recently, bioinformatics, have evolved from the intersection of classical biology, mathematics, and computer science.
Biochemistry is often an anecdotal science. If you notice a disease or trait of interest, the imperative to understand it may drive the progress of research in that direction. Based on their interest in a particular biochemical process, biochemists have determined the sequence or structure or analyzed the expression characteristics of a single gene product at a time. Often this leads to a detailed understanding of one biochemical pathway or even one protein. How a pathway or protein interacts with other biological components can easily remain a mystery, due to lack of hands to do the work, or even because the need to do a particular experiment isn't communicated to other scientists effectively.
The Internet has changed how scientists share data and made it possible for one central warehouse of information to serve an entire research community. But more importantly, experimental technologies are rapidly advancing to the point at which it's possible to imagine systematically collecting all the data of a particular type in a central "factory" and then distributing it to researchers to be interpreted.
In the 1990s, the biology community embarked on an unprecedented project: sequencing all the DNA in the human genome. Even though a first draft of the human genome sequence has been completed, automated sequencers are still running around the clock, determining the entire sequences of genomes from various life forms that are commonly used for biological research. And we're still fine-tuning the data we've gathered about the human genome over the last 10 years. Immense strings of data, in which the locations of only a relatively few important genes are known, have been and still are being generated. Using image-processing techniques, maps of entire genomes can now be generated much more quickly than they could with chemical mapping techniques, but even with this technology, complete and detailed mapping of the genomic data that is now being produced may take years.
Recently, the techniques of x-ray crystallography have been refined to a degree that allows a complete set of crystallographic reflections for a protein to be obtained in minutes instead of hours or days. Automated analysis software allows structure determination to be completed in days or weeks, rather than in months. It has suddenly become possible to conceive of the same type of high-throughput approach to structure determination that the Human Genome Project takes to sequence determination. While crystallization of proteins is still the limiting step, it's likely that the number of protein structures available for study will increase by an order of magnitude within the next 5 to 10 years.
Parallel computing is a concept that has been around for a long time. Break a problem down into computationally tractable components, and instead of solving them one at a time, employ multiple processors to solve each subproblem simultaneously. The parallel approach is now making its way into experimental molecular biology with technologies such as the DNA microarray. Microarray technology allows researchers to conduct thousands of gene expression experiments simultaneously on a tiny chip. Miniaturized parallel experiments absolutely require computer support for data collection and analysis. They also require the electronic publication of data, because information in large datasets that may be tangential to the purpose of the data collector can be extremely interesting to someone else. Finding information by searching such databases can save scientists literally years of work at the lab bench.
The output of all these high-throughput experimental efforts can be shared only because of the development of the World Wide Web and the advances in communication and information transfer that the Web has made possible.
The increasing automation of experimental molecular biology and the application of information technology in the biological sciences have lead to a fundamental change in the way biological research is done. In addition to anecdotal research—locating and studying in detail a single gene at a time—we are now cataloguing all the data that is available, making complete maps to which we can later return and mark the points of interest. This is happening in the domains of sequence and structure, and has begun to be the approach to other types of data as well. The trend is toward storage of raw biological data of all types in public databases, with open access by the research community. Instead of doing preliminary research in the lab, scientists are going to the databases first to save time and resources.
Up to now you've probably gotten by using word-processing software and other canned programs that run under user-friendly operating systems such as Windows or MacOs. In order to make the most of bioinformatics, you need to learn Unix, the classic operating system of powerful computers known as servers and workstations. Most scientific software is developed on Unix machines, and serious researchers will want access to programs that can be run only under Unix. Unix comes in a number of flavors, the two most popular being BSD and SunOs. Recently, however, a third choice has entered the marketplace: Linux. Linux is an open source Unix operating system. In Chapter 3, Chapter 4, and Chapter 5, we discuss how to set up a workstation for bioinformatics running under Linux. We cover the operating system and how it works: how files are organized, how programs are run, how processes are managed, and most importantly, what to type at the command prompt to get the computer to do what you want.
Setting up your computer with a Linux operating system allows you to take advantage of cutting-edge scientific-research tools developed for Unix systems. As it has grown popular in the mass market, Linux has retained the power of Unix systems for developing, compiling, and running programs, networking, and managing jobs started by multiple users, while also providing the standard trimmings of a desktop PC, including word processors, graphics programs, and even visual programming tools. This book operates on the assumption that you're willing to learn how to work on a Unix system and that you'll be working on a machine that has Linux or another flavor of Unix installed. For many of the specific bioinformatics tools we discuss, Unix is the most practical choice.
On the other hand, Unix isn't necessarily the most practical choice for office productivity in a predominantly Mac or PC environment. The selection of available word processing and desktop publishing software and peripheral devices for Linux is improving as the popularity of the operating system increases. However, it can't (yet) go head-to-head with the consumer operating systems in these areas. Linux is no more difficult to maintain than a normal PC operating system, once you know how, but the skills needed and the problems you'll encounter will be new at first.
As of this writing, my desktop computer has been
reliably up and running Linux for nearly five months, with the
exception of a few days time out for a hardware failure. No software
crashes, no little bombs or unhappy faces, no missing
*.dll files or mysterious error messages.
Installation of Linux took about two days and some help from tech
support the first time I did it, and about one hour the second time
(on a laptop, no less). Realistically, the main problem I have
encountered being the only Linux user in a Mac/PC environment is
opening email attachments from Mac users.—CJG
Fortunately, some of the companies selling packaged Linux distributions have substantially automated the installation procedure, and also offer 90 days of phone and web technical support for your installation. Companies such as Red Hat and SuSE and organizations such as Debian provide Linux distributions for PCs, while Yellow Dog (and others) provide Linux distributions for Macintosh computers.
There are a couple of ways to phase Linux in gradually. Of course, if you have more than one computer workstation, you can experiment with converting one of your machines to Linux while leaving your familiar operating system on the rest. The other choice is to do a dual boot installation . In a dual boot installation, you create two sections (called partitions) on your hard drive, and install Linux in one of them, with your old operating system in the other. Then, when you turn on your computer, you have a choice of whether to start up Linux or your other operating system. You can leave all your old files and programs where they are and start with new work in your Linux partition. Newer versions of Linux, such as Yellow Dog Linux for the PowerPC, allow users to emulate a MacOS environment within Linux and access software and files for both platforms simultaneously.
In Chapter 6, we cover information literacy. Only a few years ago, biologists had to know how to do literature searches using printed indexes that led them to references in the appropriate technical journals. Modern biologists search web-based databases for the same information and have access to dozens of other information types as well. Knowing how to navigate these resources is a vital skill for every biologist, computational or not.
We then introduce the basic tools you'll need to locate databases, computer programs, and other resources on the Web, to transfer these resources to your computer, and to make them work once you get them there. In Chapter 7 through Chapter 11 we turn to particular types of scientific questions and the tools you will need to answer them. In some cases, there are computer programs that are becoming the standard for solving a particular type of problem (e.g., BLAST and FASTA for amino acid and nucleic acid sequence alignment). In other areas, where the method for solving a problem is still an open research question, there may be a number of competing tools, or there may be no tool that completely solves the problem.
Handling large volumes of complex data requires a systematic and automated approach. If you're searching a database for matches to one query, a web form will do the trick. But what if you want to search for matches to 10,000 queries, and then sort through the information you get back to find relationships in the results? You certainly don't want to type 10,000 queries into a web form, and you probably don't want your results to come back formatted to look nice on a web page. Shared public web servers are often slow, and using them to process large batches of data is impractical. Chapter 12 contains examples of how to use Perl as a driver to make your favorite program process large volumes of data using your own computer.
Anyone who has experience with designing and carrying out an experiment to answer a question has the basic skills needed to program a computer. A laboratory experiment begins with a question, which evolves into a testable hypothesis, that is, a statement that can be tested for truth based on the results of an experiment or experiments. The processes developed to test the hypotheses are analogous to computer programs. The essence of an experiment is: if you take system X, and do something to it, what happens? The experiment that is done must be designed to have results that can be clearly interpreted. Computer programs must also be carefully designed so that the values that are passed from one part of a program to the next can be clearly interpreted. The human programmer must set up unambiguous instructions to the computer and must think through, in advance, what different types of results mean and what the computer should do with them. A large part of practical computer programming is the ability to think critically, to design a process to answer a question, and to understand what is required to answer the question unambiguously.
Even if you have these skills, learning a computer language isn't a trivial undertaking, but it has been made a lot easier in recent years by the development of the Perl language. Perl, referred to by its creator as "the duct tape of the Internet, and of everything else," began its evolution as a scripting language optimized for data processing. It continues to evolve into a full-featured programming language, and it's practical to use Perl to develop prototypes for virtually any kind of computer program. Perl is a very flexible language; you can learn just enough to write a simple script to solve a one-off problem, and after you've done that once or twice, you have a core of knowledge to build on. The key to learning Perl is to use it and to use it right away. Just as no amount of reading the textbook can make you speak Spanish fluently, no amount of reading O'Reilly's Learning Perl is going to be as helpful as getting out there and trying to "speak" it. In Chapter 12, we provide example Perl code for parsing common biological datatypes, driving and processing output from programs written in other languages, and even a couple of Perl implementations that solve common computational biology problems. We hope these examples inspire you to try a little programming of your own.
Chapter 6 also introduces the public databases where biological data is archived to be shared by researchers worldwide.
While you can quickly find a single protein structure file or DNA sequence file by filling in a web form and searching a public database, it's likely that eventually you will want to work with more than one piece of data. You may even be collecting and archiving your own data; you may want to make a new type of data available to a broader research community. To do these things efficiently, you need to store data on your own computer. If you want to process your stored data using a computer program, you need to structure your data. Understanding the difference between structured and unstructured data and designing a data format that suits your data storage and access needs is the key to making your data useful and accessible.
There are many ways to organize data. While most biological data is still stored in flat file databases, this type of database becomes inefficient when the quantity of data being stored becomes extremely large. Chapter 13 covers the basic database concepts you need to talk to database experts and to build your own databases. We discuss the differences between flat file and relational databases, introduce the best public-domain tools for managing databases, and show you how to use them to store and access your data.
It's hard to make sense of your data, or make a point, without visualization tools. The extraction of cross sections or subsets of complex multivariate data sets is often required to make sense of biological data. Storing your data in structured databases, which are discussed in Chapter 13, creates the infrastructure for analysis of complex data.
Once you've stored data in an accessible, flexible format, the next step is to extract what is important to you and visualize it. Whether you need to make a histogram of your data or display a molecular structure in three dimensions and watch it move in real time, there are visualization tools that can do what you want. Chapter 14 covers data-analysis and data-visualization tools, from generic plotting packages to domain-specific programs for marking up biological sequence alignments, displaying molecular structures, creating phylogenetic trees, and a host of other purposes.
An important component of any kind of computational science is knowing when you need to write a program yourself and when you can use code someone else has written. The efficient programmer is a lazy programmer; she never wastes effort writing a program if someone else has already made a perfectly good program available. If you are looking to do something fairly routine, such as aligning two protein sequences, you can be sure that someone else has already written the program you need and that by searching you can probably even find some source code to look at. Similarly, many mathematical and statistical problems can be solved using standard code that is freely available in code libraries. Perl programmers make code that simplifies standard operations available in modules; there are many freely available modules that manage web-related processes, and there are projects underway to create standard modules for handling biological-sequence data.
There are some questions we can't answer for you, and that's one of them; in fact, it's one of the biggest open research questions in computational biology. What we can and do give you are the tools to find information about such problems and others who are working on them, and even, with the proper inspiration, to develop approaches to answering them yourself. Bioinformatics, like any other science, doesn't always provide quick and easy answers to problems.
The questions that drive (and fund) bioinformatics research are the same questions humans have been working away at in applied biology for the last few hundred years. How can we cure disease? How can we prevent infection? How can we produce enough food to feed all of humanity? Companies in the business of developing drugs, agricultural chemicals, hybrid plants, plastics and other petroleum derivatives, and biological approaches to environmental remediation, among others, are developing bioinformatics divisions and looking to bioinformatics to provide new targets and to help replace scarce natural resources.
The existence of genome projects implies our intention to use the data they generate. The implicit goals of modern molecular biology are, simply stated, to read the entire genomes of living things, to identify every gene, to match each gene with the protein it encodes, and to determine the structure and function of each protein. Detailed knowledge of gene sequence, protein structure and function, and gene expression patterns is expected to give us the ability to understand how life works at the highest possible resolution. Implicit in this is the ability to manipulate living things with precision and accuracy.