You are previewing Beginning Perl for Bioinformatics.
O'Reilly logo
Beginning Perl for Bioinformatics

Book Description

With its highly developed capacity to detect patterns in data, Perlhas become one of the most popular languages for biological dataanalysis. But if you're a biologist with little or no programmingexperience, starting out in Perl can be a challenge. Manybiologists have a difficult time learning how to apply the languageto bioinformatics. The most popular Perl programming books areoften too theoretical and too focused on computer science for anon-programming biologist who needs to solve very specificproblems. Beginning Perl for Bioinformatics is designed toget you quickly over the Perl language barrier by approachingprogramming as an important new laboratory skill, revealing Perlprograms and techniques that are immediately useful in the lab.Each chapter focuses on solving a particular bioinformatics problemor class of problems, starting with the simplest and increasing incomplexity as the book progresses. Each chapter includesprogramming exercises and teaches bioinformatics by showing andmodifying programs that deal with various kinds of practicalbiological problems. By the end of the book you'll have a solidunderstanding of Perl basics, a collection of programs for suchtasks as parsing BLAST and GenBank, and the skills to take on moreadvanced bioinformatics programming. Some of the later chaptersfocus in greater detail on specific bioinformatics topics. Thisbook is suitable for use as a classroom textbook, for self-study,and as a reference. The book covers:

  • Programming basics and working with DNA sequencesand strings

  • Debugging your code

  • Simulating gene mutations using random numbergenerators

  • Regular expressions and finding motifs in data

  • Arrays, hashes, and relational databases

  • Regular expressions and restriction maps

  • Using Perl to parse PDB records, annotations inGenBank, and BLAST output

  • Table of Contents

    1. Beginning Perl for Bioinformatics
      1. SPECIAL OFFER: Upgrade this ebook with O’Reilly
      2. A Note Regarding Supplemental Files
      3. Preface
        1. What Is Bioinformatics?
          1. What Bioinformatics Can Do
        2. About This Book
        3. Who This Book Is For
        4. Why Should I Learn to Program?
        5. Structure of This Book
        6. Conventions Used in This Book
        7. Comments and Questions
        8. Acknowledgments
      4. 1. Biology and Computer Science
        1. 1.1. The Organization of DNA
        2. 1.2. The Organization of Proteins
        3. 1.3. In Silico
        4. 1.4. Limits to Computation
      5. 2. Getting Started with Perl
        1. 2.1. A Low and Long Learning Curve
        2. 2.2. Perl's Benefits
          1. 2.2.1. Ease of Programming
          2. 2.2.2. Rapid Prototyping
          3. 2.2.3. Portability, Speed, and Program Maintenance
          4. 2.2.4. Versions of Perl
        3. 2.3. Installing Perl on Your Computer
          1. 2.3.1. Perl May Already Be Installed!
          2. 2.3.2. No Internet Access?
          3. 2.3.3. Downloading
          4. 2.3.4. Binary Versus Source Code
          5. 2.3.5. Installation
            1. 2.3.5.1. Unix and Linux
            2. 2.3.5.2. Macintosh
            3. 2.3.5.3. Windows
        4. 2.4. How to Run Perl Programs
          1. 2.4.1. Unix or Linux
          2. 2.4.2. Macs
          3. 2.4.3. Windows
        5. 2.5. Text Editors
        6. 2.6. Finding Help
      6. 3. The Art of Programming
        1. 3.1. Individual Approaches to Programming
        2. 3.2. Edit—Run—Revise (and Save)
          1. 3.2.1. Saves and Backups
          2. 3.2.2. Error Messages
          3. 3.2.3. Debugging
        3. 3.3. An Environment of Programs
          1. 3.3.1. Open Source Programs
        4. 3.4. Programming Strategies
        5. 3.5. The Programming Process
          1. 3.5.1. The Design Phase
          2. 3.5.2. Algorithms
          3. 3.5.3. Pseudocode and Code
          4. 3.5.4. Comments
      7. 4. Sequences and Strings
        1. 4.1. Representing Sequence Data
        2. 4.2. A Program to Store a DNA Sequence
          1. 4.2.1. Control Flow
          2. 4.2.2. Comments Revisited
          3. 4.2.3. Command Interpretation
          4. 4.2.4. Statements
            1. 4.2.4.1. Variables
            2. 4.2.4.2. Strings
            3. 4.2.4.3. Assignment
            4. 4.2.4.4. Print
            5. 4.2.4.5. Exit
        3. 4.3. Concatenating DNA Fragments
        4. 4.4. Transcription: DNA to RNA
        5. 4.5. Using the Perl Documentation
        6. 4.6. Calculating the Reverse Complement in Perl
        7. 4.7. Proteins, Files, and Arrays
        8. 4.8. Reading Proteins in Files
        9. 4.9. Arrays
        10. 4.10. Scalar and List Context
        11. 4.11. Exercises
      8. 5. Motifs and Loops
        1. 5.1. Flow Control
          1. 5.1.1. Conditional Statements
            1. 5.1.1.1. Conditional tests and matching braces
          2. 5.1.2. Loops
            1. 5.1.2.1. open and unless
        2. 5.2. Code Layout
        3. 5.3. Finding Motifs
          1. 5.3.1. Getting User Input from the Keyboard
          2. 5.3.2. Turning Arrays into Scalars with join
          3. 5.3.3. do-until Loops
          4. 5.3.4. Regular Expressions
            1. 5.3.4.1. Regular expressions and character classes
            2. 5.3.4.2. Pattern matching with =~ and regular expressions
        4. 5.4. Counting Nucleotides
        5. 5.5. Exploding Strings into Arrays
        6. 5.6. Operating on Strings
        7. 5.7. Writing to Files
        8. 5.8. Exercises
      9. 6. Subroutines and Bugs
        1. 6.1. Subroutines
          1. 6.1.1. Advantages of Subroutines
          2. 6.1.2. Writing Subroutines
        2. 6.2. Scoping and Subroutines
          1. 6.2.1. Arguments
          2. 6.2.2. Scoping
        3. 6.3. Command-Line Arguments and Arrays
        4. 6.4. Passing Data to Subroutines
          1. 6.4.1. Subroutines: Pass by Value
          2. 6.4.2. Subroutines: Pass by Reference
        5. 6.5. Modules and Libraries of Subroutines
        6. 6.6. Fixing Bugs in Your Code
          1. 6.6.1. use warnings; and use strict;
          2. 6.6.2. Fixing Bugs with Comments and Print Statements
          3. 6.6.3. The Perl Debugger
            1. 6.6.3.1. A program with bugs
            2. 6.6.3.2. How to start and stop the debugger
            3. 6.6.3.3. Debugger command summary
            4. 6.6.3.4. Stepping through statements with the debugger
            5. 6.6.3.5. Setting breakpoints
            6. 6.6.3.6. Fixing another bug
            7. 6.6.3.7. use warnings; and use strict; redux
        7. 6.7. Exercises
      10. 7. Mutations and Randomization
        1. 7.1. Random Number Generators
        2. 7.2. A Program Using Randomization
          1. 7.2.1. Seeding the Random Number Generator
          2. 7.2.2. Control Flow
          3. 7.2.3. Making a Sentence
          4. 7.2.4. Randomly Selecting an Element of an Array
          5. 7.2.5. Formatting
          6. 7.2.6. Another Way to Calculate the Random Position
        3. 7.3. A Program to Simulate DNA Mutation
          1. 7.3.1. Pseudocode Design
            1. 7.3.1.1. Select a random position in a string
            2. 7.3.1.2. Choose a random nucleotide
            3. 7.3.1.3. Place a random nucleotide into a random position
          2. 7.3.2. Improving the Design
          3. 7.3.3. Combining the Subroutines to Simulate Mutation
          4. 7.3.4. A Bug in Your Program?
        4. 7.4. Generating Random DNA
          1. 7.4.1. Bottom-up Versus Top-down
          2. 7.4.2. Subroutines for Generating a Set of Random DNA
          3. 7.4.3. Turning the Design into Code
        5. 7.5. Analyzing DNA
          1. 7.5.1. Some Notes About the Code
        6. 7.6. Exercises
      11. 8. The Genetic Code
        1. 8.1. Hashes
        2. 8.2. Data Structures and Algorithms for Biology
          1. 8.2.1. A Gene Expression Database
          2. 8.2.2. Gene Expression Data Using Unsorted Arrays
          3. 8.2.3. Gene Expression Data Using Sorted Arrays and Binary Search
          4. 8.2.4. Gene Expression Data Using Hashes
          5. 8.2.5. Relational Databases
          6. 8.2.6. DBM
        3. 8.3. The Genetic Code
          1. 8.3.1. Background
          2. 8.3.2. Translating Codons to Amino Acids
          3. 8.3.3. The Redundancy of the Genetic Code
          4. 8.3.4. Using Hashes for the Genetic Code
        4. 8.4. Translating DNA into Proteins
        5. 8.5. Reading DNA from Files in FASTA Format
          1. 8.5.1. FASTA Format
          2. 8.5.2. A Design to Read FASTA Files
          3. 8.5.3. A Subroutine to Read FASTA Files
          4. 8.5.4. Writing Formatted Sequence Data
          5. 8.5.5. A Main Program for Reading DNA and Writing Protein
        6. 8.6. Reading Frames
          1. 8.6.1. What Are Reading Frames?
          2. 8.6.2. Translating Reading Frames
        7. 8.7. Exercises
      12. 9. Restriction Maps and Regular Expressions
        1. 9.1. Regular Expressions
        2. 9.2. Restriction Maps and Restriction Enzymes
          1. 9.2.1. Background
          2. 9.2.2. Planning the Program
          3. 9.2.3. Restriction Enzyme Data
          4. 9.2.4. Logical Operators and the Range Operator
          5. 9.2.5. Finding the Restriction Sites
        3. 9.3. Perl Operations
          1. 9.3.1. Precedence of Operations and Parentheses
        4. 9.4. Exercises
      13. 10. GenBank
        1. 10.1. GenBank Files
        2. 10.2. GenBank Libraries
        3. 10.3. Separating Sequence and Annotation
          1. 10.3.1. Using Arrays
          2. 10.3.2. Using Scalars
            1. 10.3.2.1. Pattern modifiers
            2. 10.3.2.2. Examples of pattern modifiers
            3. 10.3.2.3. Separating annotations from sequence
        4. 10.4. Parsing Annotations
          1. 10.4.1. Using Arrays
          2. 10.4.2. When to Use Regular Expressions
          3. 10.4.3. Main Program
          4. 10.4.4. Parsing Annotations at the Top Level
          5. 10.4.5. Parsing the FEATURES Table
            1. 10.4.5.1. Features
            2. 10.4.5.2. Parsing
        5. 10.5. Indexing GenBank with DBM
          1. 10.5.1. DBM Essentials
          2. 10.5.2. A DBM Database for GenBank
        6. 10.6. Exercises
      14. 11. Protein Data Bank
        1. 11.1. Overview of PDB
        2. 11.2. Files and Folders
          1. 11.2.1. Opening Directories
          2. 11.2.2. Recursion
          3. 11.2.3. Processing Many Files
        3. 11.3. PDB Files
          1. 11.3.1. PDB File Format
          2. 11.3.2. SEQRES
        4. 11.4. Parsing PDB Files
          1. 11.4.1. Extracting Primary Sequence
          2. 11.4.2. Finding Atomic Coordinates
        5. 11.5. Controlling Other Programs
          1. 11.5.1. The Stride Secondary Structure Predictor
          2. 11.5.2. Parsing Stride Output
        6. 11.6. Exercises
      15. 12. BLAST
        1. 12.1. Obtaining BLAST
        2. 12.2. String Matching and Homology
        3. 12.3. BLAST Output Files
        4. 12.4. Parsing BLAST Output
          1. 12.4.1. Extracting Annotation and Alignments
          2. 12.4.2. Parsing BLAST Alignments
        5. 12.5. Presenting Data
          1. 12.5.1. The printf Function
          2. 12.5.2. here Documents
          3. 12.5.3. format and write
        6. 12.6. Bioperl
          1. 12.6.1. Sample Modules
          2. 12.6.2. Bioperl Tutorial Script
        7. 12.7. Exercises
      16. 13. Further Topics
        1. 13.1. The Art of Program Design
        2. 13.2. Web Programming
        3. 13.3. Algorithms and Sequence Alignment
        4. 13.4. Object-Oriented Programming
        5. 13.5. Perl Modules
          1. 13.5.1. Bioperl
        6. 13.6. Complex Data Structures
        7. 13.7. Relational Databases
        8. 13.8. Microarrays and XML
        9. 13.9. Graphics Programming
        10. 13.10. Modeling Networks
        11. 13.11. DNA Computers
      17. A. Resources
        1. A.1. Perl
          1. A.1.1. Web Site
          2. A.1.2. CPAN: Comprehensive Perl Archive Network
          3. A.1.3. FAQs: Frequently Asked Questions
            1. A.1.3.1. Beginners
          4. A.1.4. Online Manuals
          5. A.1.5. Books
          6. A.1.6. Conference
          7. A.1.7. Newsgroups
        2. A.2. Computer Science
          1. A.2.1. Algorithms
          2. A.2.2. Software Engineering
          3. A.2.3. Theory of Computer Science
          4. A.2.4. General Programming
        3. A.3. Linux
        4. A.4. Bioinformatics
          1. A.4.1. Books
          2. A.4.2. Governmental Organizations
          3. A.4.3. Conferences
        5. A.5. Molecular Biology
      18. B. Perl Summary
        1. B.1. Command Interpretation
        2. B.2. Comments
        3. B.3. Scalar Values and Scalar Variables
          1. B.3.1. Strings
          2. B.3.2. Numbers
          3. B.3.3. Scalar Variables
        4. B.4. Assignment
        5. B.5. Statements and Blocks
        6. B.6. Arrays
        7. B.7. Hashes
        8. B.8. Operators
        9. B.9. Operator Precedence
        10. B.10. Basic Operators
          1. B.10.1. Arithmetic Operators
          2. B.10.2. Bitwise Operators
          3. B.10.3. String Operators
          4. B.10.4. File Test Operators
        11. B.11. Conditionals and Logical Operators
          1. B.11.1. true and false
          2. B.11.2. Logical Operators
          3. B.11.3. Using Logical Operators for Control Flow
          4. B.11.4. The if Statement
        12. B.12. Binding Operators
        13. B.13. Loops
        14. B.14. Input/Output
          1. B.14.1. Input from Files
          2. B.14.2. Input from STDIN
          3. B.14.3. Input from Files Named on the Command Line
          4. B.14.4. Output Commands
            1. B.14.4.1. Output to STDOUT, STDERR, and Files
        15. B.15. Regular Expressions
          1. B.15.1. Overview
          2. B.15.2. Metacharacters
            1. B.15.2.1. Escaping with \
            2. B.15.2.2. Alternation with |
            3. B.15.2.3. Grouping with ( )
            4. B.15.2.4. Character classes
            5. B.15.2.5. Matching any character with .
            6. B.15.2.6. Beginning and end of strings with ^ and $
            7. B.15.2.7. Quantifiers: * + {MIN,} {MIN,MAX} ?
            8. B.15.2.8. Making quantifiers match minimally with ?
          3. B.15.3. Capturing Matched Patterns
          4. B.15.4. Metasymbols
          5. B.15.5. Extending Regular-Expression Sequences
          6. B.15.6. Pattern Modifiers
        16. B.16. Scalar and List Context
        17. B.17. Subroutines and Modules
        18. B.18. Built-in Functions
      19. Index
      20. About the Author
      21. Colophon
      22. SPECIAL OFFER: Upgrade this ebook with O’Reilly