Let’s assume that you’ve already chosen a search engine. What content should you index for searching? It’s certainly reasonable to point your search engine at your site, tell it to index the full text of every document it finds, and walk away. That’s a large part of the value of search systems—they can be comprehensive, and are able to cover a huge amount of content quickly.
But indexing everything doesn’t always serve users well. In a large, complex web environment chock-full of heterogeneous subsites and databases, you may want to allow users to search the silo of technical reports or the staff directory without muddying their search results with the latest HR newsletter articles on the addition of fish sticks to the cafeteria menu. The creation of search zones—pockets of more homogeneous content—reduces the apples-and-oranges effect and allows users to focus their searches.
Choosing what to make searchable isn’t limited to selecting the right search zones. Each document or record in a collection has some sort of structure, whether rendered in HTML, XML, or database fields. In turn, that structure stores content components: pieces or “atoms” of content that typically are smaller than a document. Some of that structure—say, an author’s name—may be leveraged by a search engine, while other parts—such as the legal disclaimer at the bottom of each page—might be left out.
Finally, if you’ve conducted an inventory and analysis of your site’s content, you already ...