BLAST

Book description

Sequence similarity is a powerful tool for discovering biological function. Just as the ancient Greeks used comparative anatomy to understand the human body and linguists used the Rosetta stone to decipher Egyptian hieroglyphs, today we can use comparative sequence analysis to understand genomes. BLAST (Basic Local Alignment Search Tool), is a sophisticated software package for rapid searching of nucleotide and protein databases. It is one of the most important software packages used in sequence analysis and bioinformatics. Most users of BLAST, however, seldom move beyond the program's default parameters, and never take advantage of its full power. BLAST is the only book completely devoted to this popular suite of tools. It offers biologists, computational biology students, and bioinformatics professionals a clear understanding of BLAST as well as the science it supports. This book shows you how to move beyond the default parameters, get specific answers using BLAST, and how to interpret your results. The book also contains tutorial and reference sections covering NCBI-BLAST and WU-BLAST, background material to help you understand the statistics behind BLAST, Perl scripts to help you prepare your data and analyze your results, and a wealth of tips and tricks for configuring BLAST to meet your own research needs. Some of the topics covered include:

  • BLAST basics and the NCBI web interface

  • How to select appropriate search parameters

  • BLAST programs: BLASTN, BLASTP, BLASTX, TBLASTN, TBLASTX, PHI-BLAST, and PSI BLAST

  • Detailed BLAST references, including NCBI-BLAST and WU-BLAST

  • Understanding biological sequences

  • Sequence similarity, homology, scoring matrices, scores, and evolution

  • Sequence Alignment

  • Calculating BLAST statistics

  • Industrial-strength BLAST, including developing applications with Perl and BLAST

BLAST is the only comprehensive reference with detailed, accurate information on optimizing BLAST searches for high-throughput sequence analysis. This is a book that any biologist should own.

Table of contents

  1. Table of Contents (1/2)
  2. Table of Contents (2/2)
  3. Foreword
  4. Preface
    1. Audience for This Book
    2. Structure of This Book
    3. A Little Math, a Little Perl
    4. Conventions Used in This Book
    5. URLs Referenced in This Book
    6. Comments and Questions
    7. Acknowledgments
      1. Ian
      2. Mark
      3. Joey
  5. Part I
    1. Hello BLAST
      1. What Is BLAST?
      2. Using NCBI-BLAST (1/2)
      3. Using NCBI-BLAST (2/2)
        1. Choosing the BLAST Program
        2. Entering the Query Sequence
        3. Choosing the Database to Search
        4. Choosing the Parameters of the Search
        5. Choosing the Format
        6. Submitting the Search
        7. Viewing the Results
      4. Alternate Output Formats
      5. Alternate Alignment Views
      6. The Next Step
      7. Further Reading
  6. Part II
    1. Biological Sequences
      1. The Central Dogma of Molecular Biology
        1. DNA
        2. RNA
        3. Protein
        4. The Genetic Code
      2. Evolution (1/2)
      3. Evolution (2/2)
        1. Mutation
        2. Natural Selection
        3. Genetic Drift
        4. The Neutral Theory of Evolution
        5. Molecular Clocks
        6. Homology, Phylogeny, and Trees
        7. The Tree of Life
      4. Genomes and Genes
        1. Prokaryotic Genes
        2. Eukaryotic Genes
        3. Transcripts
        4. Repeats
        5. Pseudogenes
      5. Biological Sequences and Similarity
      6. Further Reading
    2. Sequence Alignment
      1. Global Alignment: Needleman-Wunsch
        1. Initialization
        2. Fill
        3. Trace-Back
      2. Local Alignment: Smith-Waterman
      3. Dynamic Programming
      4. Algorithmic Complexity
      5. Global Versus Local
      6. Variations
        1. Gap Modifications
        2. Reduced Memory
        3. Aligning Transcripts to Genomic Sequence
      7. Final Thoughts
      8. Further Reading
    3. Sequence Similarity
      1. Introduction to Information Theory
      2. Amino Acid Similarity
      3. Scoring Matrices
        1. PAM and BLOSUM Matrices
      4. Target Frequencies, lambda, and H
        1. Lambda
        2. Relative Entropy
        3. Match-Mismatch Scoring
      5. Sequence Similarity
      6. Karlin-Altschul Statistics
        1. Gapped Alignments
        2. Length Correction
      7. Sum Statistics and Sum Scores
        1. Converting a Sum Score to a Sum Probability
        2. Probability Versus Expectation
      8. Further Reading
  7. Part III
    1. BLAST
      1. The Five BLAST Programs
      2. The BLAST Algorithm (1/3)
      3. The BLAST Algorithm (2/3)
      4. The BLAST Algorithm (3/3)
        1. Seeding
          1. Implementation details
        2. Extension
          1. Implementation details
        3. Evaluation
          1. Implementation details
      5. Further Reading
    2. Anatomy of a BLAST Report
      1. Basic Structure
      2. Alignments (1/2)
      3. Alignments (2/2)
        1. BLASTP
        2. BLASTN
        3. BLASTX
        4. TBLASTN
        5. TBLASTX
        6. Alignment Groups
    3. A BLAST Statistics Tutorial
      1. Basic BLAST Statistics
        1. Actual Versus Effective Lengths
        2. The Raw Score and Bit Score
        3. The Expect of an HSP
        4. The WU-BLAST P-Value
        5. Sum Statistics
        6. An Expect(n) Means That Sum Statistics Were Applied
        7. Sum Statistics Are Pair-Wise in Their Focus
        8. The Sum Score
        9. Effective Length of a BLASTX Query
        10. Calculating a Sum Score
        11. Calculating the Pair-Wise Sum P-Value
        12. Correcting for Multiple Tests
        13. Correcting for Database Size
        14. Frame- and Size-Corrected Expects
      2. Using Statistics to Understand BLAST Results
      3. Where Did My Oligo Go?
        1. Karlin-Altschul Statistics as a Tool for Further Investigation
        2. What It All Means
    4. 20 Tips to Improve YourBLASTSearches
      1. 8.1 Don’t Use the Default Parameters
      2. 8.2 Treat BLAST Searches as Scientific Experiments
      3. 8.3 Perform Controls, Especially in theTwilightZone
      4. 8.4 View BLAST Reports Graphically
      5. 8.5 Use the Karlin-Altschul Equation toDesignExperiments
      6. 8.6 When Troubleshooting, Read the Footer First
      7. 8.7 Know When to Use Complexity Filters
      8. 8.8 Mask Repeats in Genomic DNA
      9. 8.9 Segment Large Genomic Sequences
      10. 8.10 Be Skeptical of Hypothetical Proteins
      11. 8.11 Expect Contaminants in EST Databases
      12. 8.12 Use Caution When Searching Raw Sequencing Reads
      13. 8.13 Look for Stop Codons and Frame-Shifts to find Pseudo-Genes
      14. 8.14 Consider Using Ungapped Alignment for BLASTX, TBLASTN, and TBLASTX
      15. 8.15 Look for Gaps in Coverage as a Sign ofMissedExons
      16. 8.16 Parse BLAST Reports with Bioperl
      17. 8.17 Perform Pilot Experiments
      18. 8.18 Examine Statistical Outliers
      19. 8.19 Use links and topcomboN to Make Sense of Alignment Groups
      20. 8.20 How to Lie with BLAST Statistics
    5. BLAST Protocols
      1. BLASTN Protocols (1/3)
      2. BLASTN Protocols (2/3)
      3. BLASTN Protocols (3/3)
        1. Mapping Oligos to a Genome
          1. Approach
          2. NCBI-BLAST parameters
          3. WU-BLAST parameters
          4. Expected results
          5. Optimizations and variations
        2. Mapping Nonspliced DNA to a Genome
          1. Approach
          2. NCBI-BLAST parameters
          3. WU-BLAST parameters
          4. Expected results
          5. Optimizations and variations
        3. Mapping a cDNA/EST to a Genome
          1. Approach
          2. NCBI-BLAST parameters
          3. WU-BLAST parameters
          4. Expected results
          5. Optimizations and variations
        4. Cross-Species Sequence Exploration
          1. Approach
          2. NCBI-BLAST parameters
          3. WU-BLAST parameters
          4. Expected results
          5. Optimizations and variations
        5. Annotating Genomic DNA with ESTs
          1. Approach
          2. NCBI-BLAST parameters
          3. WU-BLAST parameters
          4. Expected results
          5. Optimizations and variations
        6. Transcript Clustering and Extension
          1. Approach
          2. NCBI-BLAST parameters
          3. WU-BLAST parameters
          4. Expected results
        7. Clustering with blastclust
          1. Approach
          2. EST clustering
          3. Shotgun sequences
          4. Expected results
        8. Vector Clipping
          1. Approach
          2. NCBI-BLAST parameters
          3. WU-BLAST parameters
          4. Expected results
          5. Optimizations and variations
        9. Repeat Masking
          1. Approach
          2. NCBI-BLAST parameters
          3. WU-BLAST parameters
          4. Expected results
          5. Optimizations and variations
        10. Contaminant Detection
      4. BLASTP Protocols
        1. The Standard BLASTP Search
          1. Approach
          2. NCBI-BLAST parameters
          3. WU-BLAST parameters
          4. Expected results
          5. Optimizations and variations
        2. Fast, Insensitive Search
          1. Approach
          2. NCBI-BLAST parameters
          3. WU-BLAST parameters
          4. Expected results
          5. Optimizations and variations
        3. Slow, Sensitive Search
          1. Approach
          2. NCBI-BLAST parameters
          3. WU-BLAST parameters
          4. Expected results
          5. Optimizations and variations
      5. BLASTX Protocols
        1. Gene Finding in Genomic DNA
          1. Approach
          2. NCBI-BLAST parameters
          3. WU-BLAST parameters
          4. Expected results
          5. Optimizations and variations
        2. Annotating ESTs (and Shotgun Sequence)
          1. Approach
          2. NCBI-BLAST parameters
          3. WU-BLAST parameters
          4. Expected results
          5. Optimizations and variations
        3. Super-Fast BLASTX
          1. Approach
          2. NCBI-BLAST parameters
          3. WU-BLAST parameters
          4. WU-BLAST 1.4 parameters
          5. Expected results
          6. Optimizations and variations
      6. TBLASTN Protocols
        1. Mapping a Protein to a Genome
          1. Approach
          2. NCBI-BLAST parameters
          3. WU-BLAST parameters
          4. Expected results
          5. Optimizations and variations
        2. Mining ESTs (and Shotgun DNA) for Protein Similarities
          1. Approach
          2. NCBI-BLAST parameters
          3. WU-BLAST parameters
          4. Expected results
          5. Optimizations and variations
      7. TBLASTX Protocols
        1. Preventing Stop Codons
        2. Finding Undocumented Genes in Genomic DNA
          1. Approach
          2. NCBI-BLAST
          3. WU-BLAST
          4. Expected results
          5. Optimizations and variations
        3. Transcript-Transcript TBLASTX
          1. Approach
          2. NCBI-BLAST
          3. WU-BLAST
          4. Expected results
          5. Optimizations and variations
  8. Part IV
    1. Installation and Command-Line Tutorial
      1. NCBI-BLAST Installation
        1. Unix Installation
          1. Files and directories
          2. The .ncbirc file
          3. Setting the PATH and BLASTDB environment variables
        2. Windows Installation
          1. The ncbi.ini file
          2. Setting the PATH environment variable
        3. Macintosh OS X Installation
        4. Macintosh OS 9 Installation
      2. WU-BLAST Installation
        1. Expanding the tarball
        2. Files and Directories
        3. Executables
        4. Environment Variables
        5. Setting Resource Limits with /etc/sysblast
      3. Command-Line Tutorial (1/4)
      4. Command-Line Tutorial (2/4)
      5. Command-Line Tutorial (3/4)
      6. Command-Line Tutorial (4/4)
        1. NCBI-BLAST
          1. formatdb
          2. blastn
          3. megablast
          4. blastp
          5. blastx
          6. tblastn
          7. tblastx
          8. bl2seq
          9. fastacmd
          10. PSI-BLAST
          11. PHI-BLAST
          12. Environment variables and .ncbirc
        2. WU-BLAST
          1. xdformat
          2. blastn
          3. blastp
          4. blastx
          5. tblastn
          6. tblastx
          7. xdget
          8. nrdb and patdb
          9. Environment variables
      7. Editing Scoring Matrices
    2. BLAST Databases
      1. FASTA Files
        1. NCBI Identifier Format
          1. Compound identifiers
          2. Concatenated definition lines
        2. Descriptions
      2. BLAST Databases
        1. Large Databases
          1. Large NCBI databases
          2. Large WU-BLAST databases
        2. Virtual Databases
        3. Alias Databases
        4. Removing Redundancy
        5. Standard BLAST Databases
        6. Custom BLAST Databases
      3. Sequence Databases (1/2)
      4. Sequence Databases (2/2)
        1. International Nucleotide Sequence Database
        2. Database Growth
        3. Flat Files
          1. ACCESSION, LOCUS, VERSION, and GI
          2. DEFINITION, KEYWORDS, and SOURCE
          3. FEATURES
        4. Other Common Databases
      5. Sequence Database Management Strategies (1/2)
      6. Sequence Database Management Strategies (2/2)
        1. Queries, Indexes, and Reports
        2. Local Database Considerations
        3. Retrieving FASTA Files by Accession
        4. Flat File Indexing
        5. Commercial Sequence Management Software
        6. Tools on the Internet
    3. Hardware and Software Optimizations
      1. The Persistence of Memory
        1. BLAST Pipelines and Caching
      2. CPUs and Computer Architecture
        1. Multiprocessor Computers
        2. Operating Systems and Compilers
      3. Compute Clusters
        1. Remote Versus Local Databases
          1. Remote databases
          2. Local databases
      4. Distributed Resource Management
      5. Software Tricks
        1. Multiplexing/Query Packing
        2. Query Chopping
        3. Database Splitting
        4. Serial BLAST Searching
      6. Optimized NCBI-BLAST
        1. Apple/Genentech BLAST
        2. Paracel-BLAST and BlastMachine
        3. TimeLogic Tera-BLAST
  9. Part V
    1. NCBI-BLAST Reference
      1. Usage Statements
      2. Command-Line Syntax
      3. blastall Parameters (1/2)
      4. blastall Parameters (2/2)
        1. -a [integer]
        2. -A [integer]
        3. -b [integer]
        4. -B [integer]
        5. -d [database]
        6. -D [1..23]
        7. -e [real number]
        8. -E [integer]
        9. -f [integer]
        10. -F [T/F], -F [string]
        11. -g [T/F]
        12. -G [integer]
        13. -i [input file]
        14. -I [T/F]
        15. -J [T/F]
        16. -K [integer]
        17. -l [file]
        18. -L [string]
        19. -m [0..11]
        20. -M [matrix file]
        21. -n [T/F]
        22. -o [output file]
        23. -p [program name]
        24. -P [0/1]
        25. -q [negative integer]
        26. -Q [1..23]
        27. -r [integer]
        28. -R [checkpoint file]
        29. -S [1..3]
        30. -t [integer]
        31. -T [T/F]
        32. -v [integer]
        33. -w [integer]
        34. -W [integer]
        35. -X [integer]
        36. -y [integer]
        37. -Y [real number]
        38. -z [real number]
        39. -Z [integer]
      5. formatdb Parameters
        1. -B [file]
        2. -F [file]
        3. -i [file]
        4. -l [file]
        5. -L [file]
        6. -n [string]
        7. -o [T/F]
        8. -p [T/F]
        9. -s [T/F]
        10. -t [string]
        11. -v [integer]
        12. -V [T/F]
      6. fastacmd Parameters
        1. -a [T/F]
        2. -c [T/F]
        3. -d [string]
        4. -D [T/F]
        5. -i [file]
        6. -I
        7. -l [integer]
        8. -L [integer],[integer]
        9. -o [file]
        10. -p [T/F/G]
        11. -P [integer]
        12. -s [string]
        13. -S [1..2]
        14. -t [T/F]
        15. -T [T/F]
      7. megablast Parameters (1/2)
      8. megablast Parameters (2/2)
        1. -a [integer]
        2. -A [integer]
        3. -b [integer]
        4. -d [string]
        5. -D [0..3]
        6. -e [real number]
        7. -E [integer]
        8. -f [T/F]
        9. -F [T/F] [string]
        10. -G [integer]
        11. -H [integer]
        12. -i [file]
        13. -I [T/F]
        14. -l [file]
        15. -L [string]
        16. -m [0..11]
        17. -M [integer]
        18. -n [T/F]
        19. -N [0,1,2]
        20. -o [file]
        21. -p [real number]
        22. -P [integer]
        23. -q [negative integer]
        24. -Q [file]
        25. -r [integer]
        26. -R [T/F]
        27. -s [integer]
        28. -S [0..3]
        29. -t [16,18,21]
        30. -T [T/F]
        31. -U [T/F]
        32. -v [integer]
        33. -W [integer]
        34. -X [integer]
        35. -y [integer]
        36. -z [real number]
        37. -Z [integer]
      9. bl2seq Parameters
        1. -a [file]
        2. -A [T/F]
        3. -d [real number]
        4. -D [0/1]
        5. -e [real number]
        6. -E [integer]
        7. -F [T/F] [string]
        8. -g [T/F]
        9. -G [integer]
        10. -i [file]
        11. -I [integer],[integer]
        12. -j [file]
        13. -J [integer],[integer]
        14. -m [T/F]
        15. -M [string]
        16. -o [file]
        17. -p [string]
        18. -q [negative integer]
        19. -r [integer]
        20. -S [1..3]
        21. -t [integer]
        22. -T [T/F]
        23. -U [T/F]
        24. -W [integer]
        25. -X [integer]
        26. -Y [real number]
      10. blastpgp Parameters (PSI-BLAST andPHIBLAST) (1/2)
      11. blastpgp Parameters (PSI-BLAST andPHIBLAST) (2/2)
        1. PSI-BLAST
        2. PHI-BLAST
          1. -a [integer]
          2. -A [integer]
          3. -b [integer]
          4. -B [file]
          5. -c [integer]
          6. -C [file]
          7. -d [string]
          8. -e [real]
          9. -E [integer]
          10. -f [integer]
          11. -F [string]
          12. -g [T/F]
          13. -G [integer]
          14. -h [real number]
          15. -H [integer]
          16. -i [file]
          17. -I [T/F]
          18. -j [integer]
          19. -J [T/F]
          20. -k [file]
          21. -K [integer]
          22. -l [string]
          23. -L [integer]
          24. -m [0..9]
          25. -M [string]
          26. -N [real number]
          27. -o [file]
          28. -O [file]
          29. -p [string]
          30. -Q [file]
          31. -R [file]
          32. -s [T/F]
          33. -S [integer]
          34. -t [T/F]
          35. -T [T/F]
          36. -U [T/F]
          37. -v [integer]
          38. -W [1..3]
          39. -X [integer]
          40. -y [real number]
          41. -Y [real number]
          42. -z [real number]
          43. -Z [integer]
      12. blastclust Parameters
        1. -a [integer]
        2. -b [T/F]
        3. -c [file]
        4. -C [T/F]
        5. -d [file]
        6. -e [T/F]
        7. -i [file]
        8. -l [file]
        9. -L [real number]
        10. -p [T/F]
        11. -r [file]
        12. -s [file]
        13. -v [file]
        14. -W [integer]
    2. WU-BLAST Reference
      1. Usage Statements
      2. Command-Line Syntax
      3. WU-BLAST Parameters (1/3)
      4. WU-BLAST Parameters (2/3)
      5. WU-BLAST Parameters (3/3)
        1. altscore=[string]
        2. B=[integer]
        3. bottom
        4. cpus=[integer]
        5. dbrecmax=[integer]
        6. dbrecmin=[integer]
        7. E=[number]
        8. E2=[number]
        9. echofilter
        10. errors
        11. filter=[string]
        12. gapE2=[number]
        13. gapH=[number]
        14. gapK=[number]
        15. gapL=[number]
        16. gapS2=[integer]
        17. gapsepqmax=[int]
        18. gapsepsmax=[int]
        19. gapX
        20. gi
        21. golf=[number]
        22. golmax=[integer]
        23. gspmax=[integer]
        24. H=[number]
        25. hspmax=[integer]
        26. hitdist=[integer]
        27. hspsepqmax=[int]
        28. hspsepsmax=[int]
        29. K=[number]
        30. kap
        31. L=[number]
        32. lcfilter
        33. lcmask
        34. links
        35. M=[integer]
        36. maskextra=[integer]
        37. matrix=[file]
        38. N=[integer]
        39. nogap
        40. nonnegok
        41. nosegs
        42. notes
        43. novalidctxok
        44. nwlen=[integer]
        45. nwstart=[integer]
        46. o=[file]
        47. olf=[number]
        48. olmax=[integer]
        49. postsw
        50. Q=[integer]
        51. qoffset=[integer]
        52. qrecmax=[integer]
        53. Qrecmin=[integer]
        54. R=[integer]
        55. restest
        56. S=[integer]
        57. mS2=[integer]
        58. seqtest
        59. span, span1, span2
        60. T=[integer]
        61. top
        62. topcomboN=[integer]
        63. V=[integer]
        64. warnings
        65. wink=[integer]
        66. wordmask=[method]
        67. W=[integer]
        68. X=[integer]
        69. Y=[number]
        70. Z=[number]
      6. xdformat Parameters
        1. -A [0..2]
        2. -a [database]
        3. -c [character]
        4. -D [integer]
        5. -d [string]
        6. -e [file]
        7. -G
        8. -i
        9. -K [integer]
        10. -k
        11. -L [number]
        12. -l [number]
        13. -M [number]
        14. -O [4..8]
        15. -P [integer]
        16. -q [0..3]
        17. -r
        18. -T [string]
        19. -v
        20. -X
      7. xdget Parameters
        1. -A [n, 0]
        2. -a [integer]
        3. -b [integer]
        4. -d
        5. -D [integer]
        6. -e [file]
        7. -F
        8. -f
        9. -G
        10. -o [file]
        11. -N [0, n]
        12. -P [integer]
        13. -r
        14. -T [string]
        15. -t
  10. Part VI
    1. NCBI Display Formats
      1. Brief Descriptions
      2. Detailed Descriptions and Examples
        1. Option 0: Pairwise Alignments
        2. Query-Anchored Alignments
        3. Option 1: Query-Anchored Showing Identities
        4. Option 2: Query-Anchored, No Identities
        5. Option 3: Flat Query-Anchored Showing Identities
        6. Option 4: Flat Query-Anchored, No Identities
        7. Option 5: Query-Anchored, No Identities, and Blunt Ends
        8. Option 6: Flat Query-Anchored, No Identities, and Blunt Ends
        9. Option 7: XML
        10. Option 8: Tabular, Without Comment Lines
        11. Option 9: Tabular, with Comment Lines
        12. Option 10: ASN.1 Text Format
        13. Option 11: ASN.1 Binary Format
    2. Nucleotide Scoring Schemes
    3. NCBI-BLAST Scoring Schemes
      1. NCBI-BLAST Matrices and Gap Costs
    4. blast-imager.pl
    5. blast2table.pl
  11. Glossary (1/2)
  12. Glossary (2/2)
  13. Index (1/5)
  14. Index (2/5)
  15. Index (3/5)
  16. Index (4/5)
  17. Index (5/5)

Product information

  • Title: BLAST
  • Author(s): Ian Korf, Mark Yandell, Joseph Bedell
  • Release date: July 2003
  • Publisher(s): O'Reilly Media, Inc.
  • ISBN: 9780596002992