4.5. Summary

The search engine built in this project is rather basic. Although it does offer suggestions for perceived misspellings, it lacks other features that users have come to expect from search engines such as relevancy ranking and word-stemming.

The words' positions are also stored in the database, which shows where they were found within the documents. By comparing these values, you can identify which words appear closer to others or even sequentially. You may also want to consider the number of times a term appears in a document when sorting the results. Algorithms to rank items by relevancy are closely guarded secrets and there's no real right or wrong answer. Feel free to experiment.

Word-stemming is another area where there are many algorithms available with no one right way to do it. Stemming allows a user to enter fish in a query, for example, but receive results for fishing, fishes, and fished as well. The engine understands such words as just different forms of the same base word and retrieves them from the index as well.

Probably the easiest method of stemming to implement is to truncate common suffixes such as -s, -es, -ies, -ly, -ing, and -ed from the words before they are added to the database and then from the terms before they are used in the search query. This approach is naïve, however, as there are a large number of exceptions found in the English language. A search for run might include runs and running, but not ran. See http://en.wikipedia.org/wiki/Stemming ...

Get PHP and MySQL®: Create-Modify-Reuse now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.