You are previewing Python for Bioinformatics.
O'Reilly logo
Python for Bioinformatics

Book Description

Python for Bioinformatics provides a clear introduction to the Python programming language and instructs beginners on the development of simple programming exercises. Important Notice: The digital edition of this book is missing some of the images or content found in the physical edition.

Table of Contents

  1. Cover
  2. Title
  3. Copyright
  4. Dedication
  5. Preface
  6. Brief Contents
  7. Contents
  8. Chapter 1 Introduction
    1. 1.1 The Purpose of This Book
    2. 1.2 Use of Third-Party Software
    3. 1.3 Required Background of Readers
    4. 1.4 Object-Oriented Programming
    5. 1.5 Presentation Convention
    6. 1.6 Conversion from C/C++ to Python
      1. 1.6.1 Similarities
      2. 1.6.2 Fundamental Python Commands that Differ from C/C++
    7. 1.7 The Environment
    8. 1.8 Biopython
    9. Bibliography
  9. Chapter 2 NumPy and SciPy
    1. 2.1 Introduction to NumPy and SciPy
    2. 2.2 Basic Array Manipulations
    3. 2.3 Basic Math
    4. 2.4 More on Multiplication
    5. 2.5 More Math
      1. 2.5.1 Equals or Copy
      2. 2.5.2 Comparisons
      3. 2.5.3 More on Slicing
      4. 2.5.4 Sorting and Shaping
      5. 2.5.5 Random Numbers
      6. 2.5.6 Statistical Methods
    6. 2.6 Thinking About Problems
    7. 2.7 Array Conversions
    8. 2.8 SciPy
    9. 2.9 Summary
    10. Bibliography
    11. Problems
  10. Chapter 3 Image Manipulation
    1. 3.1 The Image Module
    2. 3.2 Colors and Conversions
    3. 3.3 Digital Image Formats
    4. 3.4 Simple Image Manipulations
    5. 3.5 Conversions to and from Arrays
    6. 3.6 Summary
    7. Bibliography
    8. Problems
  11. Chapter 4 The Akando and Dancer Modules
    1. 4.1 The Akando Module
      1. 4.1.1 Plotting Routines
      2. 4.1.2 Algebraic and Geometric Functions
      3. 4.1.3 Correlation
      4. 4.1.4 Image Conversions
    2. 4.2 The Dancer Module
    3. 4.3 Summary
    4. Problems
  12. Chapter 5 Statistics
    1. 5.1 Simple Statistics
    2. 5.2 Distributions
    3. 5.3 Normalization
    4. 5.4 Multivariate Statistics
    5. 5.5 Probabilities
    6. 5.6 Odds
    7. 5.7 Decisions from Distributions
    8. 5.8 Summary
    9. Problems
  13. Chapter 6 Parsing DNA Data Files
    1. 6.1 FASTA Files
    2. 6.2 Genbank Files
      1. 6.2.1 File Overview
      2. 6.2.2 Parsing the DNA
      3. 6.2.3 Gene and Protein Information
      4. 6.2.4 Gene Locations
      5. 6.2.5 Normal and Complement
      6. 6.2.6 Splices
      7. 6.2.7 Extracting All Gene Locations
      8. 6.2.8 Coding DNA
      9. 6.2.9 Proteins
      10. 6.2.10 Extracting Translations
    3. 6.3 ASN.1 File Format
    4. 6.4 Summary
    5. Bibliography
    6. Problems
  14. Chapter 7 Sequence Alignment
    1. 7.1 Alphabets
    2. 7.2 Matching Sequences
      1. 7.2.1 Perfect Matches
      2. 7.2.2 Insertions and Deletions
      3. 7.2.3 Rearrangements
      4. 7.2.4 Global Versus Local Alignments
      5. 7.2.5 Sequence Length
    3. 7.3 Simple Alignments
      1. 7.3.1 Direct Alignment
      2. 7.3.2 Statistical Alignment
      3. 7.3.3 Brute Force Alignment
    4. 7.4 Summary
    5. Bibliography
    6. Problems
  15. Chapter 8 Dynamic Programming
    1. 8.1 The Problem with the Brute Force Approach
    2. 8.2 The Dynamic Programming Algorithm
      1. 8.2.1 The Scoring Matrix
      2. 8.2.2 The Arrow Matrix
      3. 8.2.3 Extracting the Aligned Sequences
    3. 8.3 Efficient Programming
      1. 8.3.1 Flowing along the Diagonals
      2. 8.3.2 Slicing Matrices
      3. 8.3.3 Extracting Diagonal Element Locations
      4. 8.3.4 Extracting Values from the Substitution Matrix
      5. 8.3.5 Computing the Scoring Matrix Values for a Single Diagonal
      6. 8.3.6 An Efficient Computation of the Scoring Matrix
    4. 8.4 Global Versus Local Alignments
    5. 8.5 Gap Penalties
    6. 8.6 Does Dynamic Programming Find the Best Alignments?
    7. 8.7 Summary
    8. Problems
  16. Chapter 9 Tandem Repeats
    1. 9.1 Tandem Repeats
    2. 9.2 Hauth’s Solution
      1. 9.2.1 Foundation
      2. 9.2.2 Multiple Words
      3. 9.2.3 Tandem Repeats
    3. 9.3 Summary
    4. Bibliography
    5. Problems
  17. Chapter 10 Hidden Markov Models
    1. 10.1 The Emission HMM
    2. 10.2 The Transition HMM
    3. 10.3 The Recurrent HMM
    4. 10.4 Constructing a Transition HMM
    5. 10.5 Considerations
      1. 10.5.1 Assuming Data
      2. 10.5.2 Spurious Strings
      3. 10.5.3 Recurrent Probabilities
    6. 10.6 Summary
    7. Problems
  18. Chapter 11 Genetic Algorithms
    1. 11.1 Simulated Annealing
    2. 11.2 The Genetic Algorithm
      1. 11.2.1 Energy Surfaces
      2. 11.2.2 The Genetic Algorithm Approach
      3. 11.2.3 Checking the Solution
    3. 11.3 Nonnumerical Genetic Algorithms
      1. 11.3.1 Notes on Copying
      2. 11.3.2 Creating Random Arrangements
      3. 11.3.3 The Genetic Algorithm
    4. 11.4 Summary
    5. Problems
  19. Chapter 12 Multiple Sequence Alignment
    1. 12.1 The Greedy Approach
      1. 12.1.1 Sequence Comparison
      2. 12.1.2 Assembly
    2. 12.2 Nongreedy Approach
      1. 12.2.1 Creating Genes
      2. 12.2.2 Steps in the Genetic Algorithm
      3. 12.2.3 The Test Run
      4. 12.2.4 Improvements
    3. 12.3 Summary
    4. Problems
  20. Chapter 13 Gapped Alignments
    1. 13.1 Theory of Gapped Alignments
    2. 13.2 Chopping the Data
    3. 13.3 Pairwise Alignments
    4. 13.4 Building the Assembly
      1. 13.4.1 Creating New Contigs
      2. 13.4.2 Adding to a Contig
      3. 13.4.3 Joining Contigs
      4. 13.4.4 Performing the Assembly
    5. 13.5 Summary
    6. Bibliography
    7. Problems
  21. Chapter 14 Trees
    1. 14.1 Basic Tree Theory
    2. 14.2 Python and Trees
    3. 14.3 An Example Using UPGMA
    4. 14.4 Examples of Trees
      1. 14.4.1 Sorting Trees
      2. 14.4.2 Dictionary Trees
      3. 14.4.3 Percolation Trees
      4. 14.4.4 Suffix Trees
    5. 14.5 Decision Trees and Random Forests
    6. 14.6 Summary
    7. Problems
  22. Chapter 15 Text Mining
    1. 15.1 An Introduction to Text Mining
    2. 15.2 Collecting Bioinformatic Textual Data
    3. 15.3 Creating Dictionaries
    4. 15.4 Methods of Finding Root Words
      1. 15.4.1 Porter Stemming
      2. 15.4.2 Suffix Trees
      3. 15.4.3 Combining Simplified Porter Stemming with Slicing
    5. 15.5 Document Analysis
      1. 15.5.1 Text Mining Ten Documents
      2. 15.5.2 Word Frequency
      3. 15.5.3 Indicative Words
      4. 15.5.4 Document Classification
    6. 15.6 Summary
    7. Bibliography
    8. Problems
  23. Chapter 16 Measuring Complexity
    1. 16.1 Linguistic Complexity
    2. 16.2 Suffix Trees
    3. 16.3 Superstrings
    4. 16.4 Summary
    5. Bibliography
    6. Problems
  24. Chapter 17 Clustering
    1. 17.1 The Purpose of Clustering
    2. 17.2 k-Means Clustering
    3. 17.3 Solving More Difficult Problems
      1. 17.3.1 Preprocessing Data
      2. 17.3.2 Modifications of k-Means
    4. 17.4 Dynamic k-Means
    5. 17.5 Comments on k-Means
    6. 17.6 Summary
    7. Bibliography
    8. Problems
  25. Chapter 18 Self-Organizing Maps
    1. 18.1 SOM Theory
    2. 18.2 An SOM Example
      1. 18.2.1 Reading an Image
      2. 18.2.2 Initializing the SOM
      3. 18.2.3 The Best Matching Unit (BMU)
      4. 18.2.4 Updating the SOM
      5. 18.2.5 SOM Iterations
      6. 18.2.6 Interpreting the SOM
    3. 18.3 Summary
    4. Bibliography
    5. Problems
  26. Chapter 19 Principal Component Analysis
    1. 19.1 The Purpose of PCA
    2. 19.2 Eigenvectors
    3. 19.3 The PCA Process
      1. 19.3.1 Case 1: More Dimensions than Vectors
      2. 19.3.2 Case 2: Linear Combinations in the Data
      3. 19.3.3 Case 3: Imperfect Dimensionality Reductions
      4. 19.3.4 Coordinate Selection
    4. 19.4 Using SVD to Compute PCA
    5. 19.5 Describing Systems with Eigenvectors
    6. 19.6 Eigenimages
    7. 19.7 Summary
    8. Bibliography
    9. Problems
  27. Chapter 20 Species Identification
    1. 20.1 Data Collection
    2. 20.2 The First Clustering
    3. 20.3 Using Principal Component Analysis
    4. 20.4 The Second Clustering
    5. 20.5 Using a Self-Organizing Map
    6. 20.6 Summary
    7. Bibliography
    8. Problems
  28. Chapter 21 Fourier Transforms
    1. 21.1 Fourier Theory
    2. 21.2 Digital Fourier Transform
      1. 21.2.1 DFT Theory
      2. 21.2.2 Example with a Simple Sawtooth Signal
      3. 21.2.3 Features of the DFT
      4. 21.2.4 Power Spectrum
    3. 21.3 Fast Fourier Transform
      1. 21.3.1 Duplicate Computations
      2. 21.3.2 The FFT Method
      3. 21.3.3 FFTs in SciPy
      4. 21.3.4 The Swap Function
    4. 21.4 Frequency Analysis
      1. 21.4.1 Simple Signals
      2. 21.4.2 DNA Coding Regions
    5. 21.5 Summary
    6. Bibliography
    7. Problems
  29. Chapter 22 Correlations
    1. 22.1 Correlation Theory
    2. 22.2 Random Signal Correlation
    3. 22.3 Structured Signal Correlation
    4. 22.4 Correlation of DNA Strings
    5. 22.5 Higher Dimensions
      1. 22.5.1 Two-Dimensional FFTs in SciPy
      2. 22.5.2 Image Frequencies
    6. 22.6 The Onset of Image Processing
    7. 22.7 Two-Dimensional Correlations
    8. Summary
    9. Bibliography
    10. Problems
  30. Chapter 23 Numerical Sequence Alignment
    1. 23.1 Alternate Encodings
      1. 23.1.1 Hydrophobicity
      2. 23.1.2 GC Content
      3. 23.1.3 Numerical Methods
    2. 23.2 Numerical Alignments
    3. 23.3 Measuring the Hurst Exponent
    4. 23.4 Chaos Representation
      1. 23.4.1 Representing the Data
      2. 23.4.2 A Simpler Method
      3. 23.4.3 Comparing Chaos Images of Different Species
      4. 23.4.4 Organizing the Data
    5. 23.5 Summary
    6. Bibliography
    7. Problems
  31. Chapter 24 Gene Expression Array Files
    1. 24.1 Raw Data
      1. 24.1.1 Reading Raw Data in Python
      2. 24.1.2 Dealing with 16-Bit Data
    2. 24.2 GEL Files
      1. 24.2.1 TIFF Headers
      2. 24.2.2 The Image File Directory
      3. 24.2.3 Reading the Data
    3. 24.3 Summary
    4. Bibliography
    5. Problems
  32. Chapter 25 Spot Finding and Measurement
    1. 25.1 Spot Finding
      1. 25.1.1 Intensity Variations
      2. 25.1.2 Block Location
      3. 25.1.3 The Coarse Grid
      4. 25.1.4 Fine-Tuning the Spot Locations
    2. 25.2 Spot Measurements
    3. 25.3 Summary
    4. Bibliography
    5. Problems
  33. Chapter 26 Spreadsheet Arrays and Data Displays
    1. 26.1 Reading Spreadsheets
      1. 26.1.1 The Platform File
      2. 26.1.2 The Z-Ratio File
      3. 26.1.3 Reading Two Channel Files
    2. 26.2 Displaying the Data
      1. 26.2.1 The Heat Map
      2. 26.2.2 The R Versus G Graph
      3. 26.2.3 The R/G Versus I Graph
      4. 26.2.4 M Versus A Graph
    3. 26.3 Summary
    4. Bibliography
    5. Problems
  34. Chapter 27 Applications with Expression Arrays
    1. 27.1 LOESS Normalization
    2. 27.2 Expressed Genes
    3. 27.3 Multiple Slides
      1. 27.3.1 Normalization
      2. 27.3.2 Extracting Outliers
    4. 27.4 Summary
    5. Bibliography
    6. Problems
  35. Index