You are previewing Algorithms on Strings, Trees and Sequences.
O'Reilly logo
Algorithms on Strings, Trees and Sequences

Book Description

String algorithms are a traditional area of study in computer science. In recent years their importance has grown dramatically with the huge increase of electronically stored text and of molecular sequence data (DNA or protein sequences) produced by various genome projects. This 1997 book is a general text on computer algorithms for string processing. In addition to pure computer science, the book contains extensive discussions on biological problems that are cast as string problems, and on methods developed to solve them. It emphasises the fundamental ideas and techniques central to today's applications. New approaches to this complex material simplify methods that up to now have been for the specialist alone. With over 400 exercises to reinforce the material and develop additional topics, the book is suitable as a text for graduate or advanced undergraduate students in computer science, computational biology, or bio-informatics. Its discussion of current algorithms and techniques also makes it a reference for professionals.

Table of Contents

  1. Cover
  2. Half Title
  3. Title Page
  4. Copyright
  5. Contents
  6. Preface
  7. I: Exact String Matching: The Fundamental String Problem
    1. 1. Exact Matching: Fundamental Preprocessing and First Algorithms
      1. 1.1 The naive method
      2. 1.2 The preprocessing approach
      3. 1.3 Fundamental preprocessing of the pattern
      4. 1.4 Fundamental preprocessing in linear time
      5. 1.5 The simplest linear-time exact matching algorithm
      6. 1.6 Exercises
    2. 2. Exact Matching: Classical Comparison-Based Methods
      1. 2.1 Introduction
      2. 2.2 The Boyer–Moore Algorithm
      3. 2.3 The Knuth-Morris-Pratt algorithm
      4. 2.4 Real-time string matching
      5. 2.5 Exercises
    3. 3. Exact Matching: A Deeper Look at Classical Methods
      1. 3.1 A Boyer–Moore variant with a “simple” linear time bound
      2. 3.2 Cole’s linear worst-case bound for Boyer–Moore
      3. 3.3 The original preprocessing for Knuth-Morris-Pratt
      4. 3.4 Exact matching with a set of patterns
      5. 3.5 Three applications of exact set matching
      6. 3.6 Regular expression pattern matching
      7. 3.7 Exercises
    4. 4. Seminumerical String Matching
      1. 4.1 Arithmetic versus comparison-based methods
      2. 4.2 The Shift-And method
      3. 4.3 The match-count problem and Fast Fourier Transform
      4. 4.4 Karp–Rabin fingerprint methods for exact match
      5. 4.5 Exercises
  8. II: Suffix Trees and Their Uses
    1. 5. Introduction to Suffix Trees
      1. 5.1 A short history
      2. 5.2 Basic definitions
      3. 5.3 A motivating example
      4. 5.4 A naive algorithm to build a suffix tree
    2. 6. Linear-Time Construction of Suffix Trees
      1. 6.1 Ukkonen’s linear-time suffix tree algorithm
      2. 6.2 Weiner’s linear-time suffix tree algorithm
      3. 6.3 McCreight’s suffix tree algorithm
      4. 6.4 Generalized suffix tree for a set of strings
      5. 6.5 Practical implementation issues
      6. 6.6 Exercises
    3. 7. First Applications of Suffix Trees
      1. 7.1 APL1: Exact string matching
      2. 7.2 APL2: Suffix trees and the exact set matching problem
      3. 7.3 APL3: The substring problem for a database of patterns
      4. 7.4 APL4: Longest common substring of two strings
      5. 7.5 APL5: Recognizing DNA contamination
      6. 7.6 APL6: Common substrings of more than two strings
      7. 7.7 APL7: Building a smaller directed graph for exact matching
      8. 7.8 APL8: A reverse role for suffix trees, and major space reduction
      9. 7.9 APL9: Space-efficient longest common substring algorithm
      10. 7.10 APL10: All-pairs suffix-prefix matching
      11. 7.11 Introduction to repetitive structures in molecular strings
      12. 7.12 APL11: Finding all maximal repetitive structures in linear time
      13. 7.13 APL12: Circular string linearization
      14. 7.14 APL13: Suffix arrays – more space reduction
      15. 7.15 APL14: Suffix trees in genome-scale projects
      16. 7.16 APL15: A Boyer–Moore approach to exact set matching
      17. 7.17 APL16: Ziv–Lempel data compression
      18. 7.18 APL17: Minimum length encoding of DNA
      19. 7.19 Additional applications
      20. 7.20 Exercises
    4. 8. Constant-Time Lowest Common Ancestor Retrieval
      1. 8.1 Introduction
      2. 8.2 The assumed machine model
      3. 8.3 Complete binary trees: a very simple case
      4. 8.4 How to solve lca queries in B
      5. 8.5 First steps in mapping T to B
      6. 8.6 The mapping of T to B
      7. 8.7 The linear-time preprocessing of T
      8. 8.8 Answering an lca query in constant time
      9. 8.9 The binary tree is only conceptual
      10. 8.10 For the purists: how to avoid bit-level operations
      11. 8.11 Exercises
    5. 9. More Applications of Suffix Trees
      1. 9.1 Longest common extension: a bridge to inexact matching
      2. 9.2 Finding all maximal palindromes in linear time
      3. 9.3 Exact matching with wild cards
      4. 9.4 The k-mismatch problem
      5. 9.5 Approximate palindromes and repeats
      6. 9.6 Faster methods for tandem repeats
      7. 9.7 A linear-time solution to the multiple common substring problem
      8. 9.8 Exercises
  9. III: Inexact Matching, Sequence Alignment, Dynamic Programming
    1. 10. The Importance of (Sub)sequence Comparison in Molecular Biology
    2. 11. Core String Edits, Alignments, and Dynamic Programming
      1. 11.1 Introduction
      2. 11.2 The edit distance between two strings
      3. 11.3 Dynamic programming calculation of edit distance
      4. 11.4 Edit graphs
      5. 11.5 Weighted edit distance
      6. 11.6 String similarity
      7. 11.7 Local alignment: finding substrings of high similarity
      8. 11.8 Gaps
      9. 11.9 Exercises
    3. 12. Refining Core String Edits and Alignments
      1. 12.1 Computing alignments in only linear space
      2. 12.2 Faster algorithms when the number of differences are bounded
      3. 12.3 Exclusion methods: fast expected running time
      4. 12.4 Yet more suffix trees and more hybrid dynamic programming
      5. 12.5 A faster (combinatorial) algorithm for longest common subsequence
      6. 12.6 Convex gap weights
      7. 12.7 The Four-Russians speedup
      8. 12.8 Exercises
    4. 13. Extending the Core Problems
      1. 13.1 Parametric sequence alignment
      2. 13.2 Computing suboptimal alignments
      3. 13.3 Chaining diverse local alignments
      4. 13.4 Exercises
    5. 14. Multiple String Comparison – The Holy Grail
      1. 14.1 Why multiple string comparison?
      2. 14.2 Three “big-picture” biological uses for multiple string comparison
      3. 14.3 Family and superfamily representation
      4. 14.4 Multiple sequence comparison for structural inference
      5. 14.5 Introduction to computing multiple string alignments
      6. 14.6 Multiple alignment with the sum-of-pairs (SP) objective function
      7. 14.7 Multiple alignment with consensus objective functions
      8. 14.8 Multiple alignment to a (phylogenetic) tree
      9. 14.9 Comments on bounded-error approximations
      10. 14.10 Common multiple alignment methods
      11. 14.11 Exercises
    6. 15. Sequence Databases and Their Uses – The Mother Lode
      1. 15.1 Success stories of database search
      2. 15.2 The database industry
      3. 15.3 Algorithmic issues in database search
      4. 15.4 Real sequence database search
      5. 15.5 FASTA
      6. 15.6 BLAST
      7. 15.7 PAM: the first major amino acid substitution matrices
      8. 15.8 PROSITE
      9. 15.9 BLOCKS and BLOSUM
      10. 15.10 The BLOSUM substitution matrices
      11. 15.11 Additional considerations for database searching
      12. 15.12 Exercises
  10. IV: Currents, Cousins, and Cameos
    1. 16. Maps, Mapping, Sequencing, and Superstrings
      1. 16.1 A look at some DNA mapping and sequencing problems
      2. 16.2 Mapping and the genome project
      3. 16.3 Physical versus genetic maps
      4. 16.4 Physical mapping
      5. 16.5 Physical mapping: STS-content mapping and ordered clone libraries
      6. 16.6 Physical mapping: radiation-hybrid mapping
      7. 16.7 Physical mapping: fingerprinting for general map construction
      8. 16.8 Computing the tightest layout
      9. 16.9 Physical mapping: last comments
      10. 16.10 An introduction to map alignment
      11. 16.11 Large-scale sequencing and sequence assembly
      12. 16.12 Directed sequencing
      13. 16.13 Top-down, bottom-up sequencing: the picture using YACs
      14. 16.14 Shotgun DNA sequencing
      15. 16.15 Sequence assembly
      16. 16.16 Final comments on top-down, bottom-up sequencing
      17. 16.17 The shortest superstring problem
      18. 16.18 Sequencing by hybridization
      19. 16.19 Exercises
    2. 17. Strings and Evolutionary Trees
      1. 17.1 Ultrametric trees and ultrametric distances
      2. 17.2 Additive-distance trees
      3. 17.3 Parsimony: character-based evolutionary reconstruction
      4. 17.4 The centrality of the ultrametric problem
      5. 17.5 Maximum parsimony, Steiner trees, and perfect phylogeny
      6. 17.6 Phylogenetic alignment, again
      7. 17.7 Connections between multiple alignment and tree construction
      8. 17.8 Exercises
    3. 18. Three Short Topics
      1. 18.1 Matching DNA to protein with frameshift errors
      2. 18.2 Gene prediction
      3. 18.3 Molecular computation: computing with (not about) DNA strings
      4. 18.4 Exercises
    4. 19. Models of Genome-Level Mutations
      1. 19.1 Introduction
      2. 19.2 Genome rearrangements with inversions
      3. 19.3 Signed inversions
      4. 19.4 Exercises
  11. Epilogue – where next?
  12. Bibliography
  13. Glossary
  14. Index