This is the Title of the Book, eMatter Edition
Copyright © 2012 O’Reilly & Associates, Inc. All rights reserved.
4
|
Chapter 1: Hello BLAST
Using NCBI-BLAST
This book begins by exploring the BLAST pages on the NCBI web site. The NCBI,
part of the National Institutes of Health, is a U.S. government-funded center for the
curation and presentation of public biological knowledge. The NCBI is a public
repository for DNA and protein sequences (GenBank), but it’s far more than just a
data storehouse. The NCBI also maintains a comprehensive medical publication
archive (PubMed), distributes many tools for biological analyses (NCBI toolbox),
and puts together its own tools for making the most use of the data that it stores
(LocusLink, UniGene, RefSeq, Taxonomy browser). Most importantly, for our pur-
poses, it’s where the BLAST algorithm was first developed (Altschul et al., 1990) and
where it can be obtained, distributed, and used for free without restrictions. Anyone
with access to the Internet can run a BLAST search and explore the plethora of
genetic resources that have been amassed and curated by the NCBI over the years.
You’ll get the most out of this chapter if you follow along with a web browser. Begin
by going to the BLAST homepage at http://www.ncbi.nlm.nih.gov/BLAST.
Choosing the BLAST Program
Without explaining all of the options presented on the homepage, let’s get right into
it with a default BLASTN search. Choose “Standard nucleotide-nucleotide BLAST
[blastn]” as shown in Figure 1-1. BLASTN is a program that compares a nucleotide
query sequence to a database of nucleotide sequences.
Figure 1-1. NCBI BLAST home page
This is the Title of the Book, eMatter Edition
Copyright © 2012 O’Reilly & Associates, Inc. All rights reserved.
Using NCBI-BLAST
|
5
Entering the Query Sequence
After choosing the kind of search you want to perform, the next step is to define the
sequence with which to search. There are three options for this: paste in the bare
sequence, paste in a file in FASTA format, or enter a valid NCBI identifier. You can
just start typing a sequence in the search box; however, when the search is done,
there will be no identifier to describe the sequence you entered. After several such
searches, the lack of an identifier will make it difficult to keep track of which results
go with which sequence. The second option allows you to define the sequence using
the FASTA format. The FASTA format is described in detail in Chapter 11, but the
basic specifications are that it’s a text file beginning with a greater than sign (>) fol-
lowed by an identifier and a definition line, which is then proceeded by the one-let-
ter nucleotide or peptide sequence on subsequent lines. Let’s use the following
sequence:
>gi|11611818|gb|AF287139.1|AF287139 Latimeria chalumnae Hoxa-11 gene, partial cds
TACTTGCCAAGTTGCACCTACTACGTTTCGGGTCCCGATTTCTCCAGCCTCCCTTCTTTTTTGCCCCAGACCCCGTCTTCTCGCC
CCATGACATACTCCTATTCGTCTAATCTACCCCAAGTTCAACCTGTGAGAGAAGTTACCTTCAGGGACTATGCCATTGATACATC
CAATAAATGGCATCCCAGAAGCAATTTACCCCATTGCTACTCAACAGAGGAGATTCTGCACAGGGACTGCCTAGCAACCACCACC
GCTTCAAGCATAGGAGAAATCTTTGGGAAAGGCAACGCTAACGTCTACCATCCTGGCTCCAGCACCTCTTCTAATTTCTATAACA
CAGTGGGTAGAAACGGGGTCCTACCGCAAGCCTTTGACCAGTTTTTCGAGACGGCTTATGGCACAACAGAAAACCACTCTTCTGA
CTACTCTGCAGACAAGAATTCCGACAAAATACCTTCGGCAGCAACTTCAAGGTCGGAGACTTGCAGGGAGACAGACGAGAAGGAG
AGACGGGAAGAAAGCAGTAGCCCAGAGTCTTCTTCCGGCAACAATGAGGAGAAATCAAGCAGTTCCAGTGGTCAACGTACAAGGA
AGAAGAGGTGC
Before you try to type all this into the search text box, let’s look at identifiers, which
are an easier and more reliable way to enter queries. The previous example of the
coelacanth (Latimeria chalumnae) Hoxa-11 gene has three valid NCBI identifiers that
can be entered into the search box. The three identifiers are separated by pipes (|)
and designate the GI (11611818), the accession number and version (AF287139.1),
and the locus (AF287139). These identifiers are explained in detail in Chapter 11.
For the current search (Figure 1-2), use the locus identifier, AF287139.
Using the locus, BLAST pulls out the FASTA file from the NCBI databases and uses
it in the search just as if you had entered it all in the search box. If you are dealing
with public sequence, this is the fastest and most reliable way to enter the query.
Choosing the Database to Search
For this search, we’ll leave the default database as nr (Figure 1-3). Historically, the
database was curated to contain a nonredundant set of nucleotide sequences (hence
nr); however, it’s no longer screened to be nonredundant. Because of its comprehen-
sive nature, nr is usually a good first start when trying to identify a novel sequence or
when determining if related sequences have been described previously. The database
is curated by the NCBI and consists of nucleotide sequences from all of GenBank,
RefSeq, EMBL, and DDBJ. You don’t need to be concerned about the details of these
/-sequence sources now but just know that they provide a comprehensive set of

Get BLAST now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.