Before You Go Off and Try to Build a Search Engineâ¦

While this chapter has hopefully given you some good insight into how to extract useful information from unstructured text, itâs barely scratched the surface of the most fundamental concepts, both in terms of theory and engineering considerations. Information retrieval is literally a multibillion-dollar industry, so you can only imagine the amount of combined investment that goes into both the theory and implementations that work at scale to power search engines such as Google and Yahoo!. This section is a modest attempt to make sure youâre aware of some of the inherent limitations of TF-IDF, cosine similarity, and other concepts introduced in this chapter, with the hopes that it will be beneficial in shaping your overall view of this space.

While TF-IDF is a powerful tool thatâs easy to use, our specific implementation of it has a few important limitations that weâve conveniently overlooked but that you should consider. One of the most fundamental is that it treats a document as a bag of words, which means that the order of terms in both the document and the query itself does not matter. For example, querying for âGreen Mr.â would return the same results as âMr. Greenâ if we didnât implement logic to take the query term order into account or interpret the query as a phrase as opposed to a pair of independent terms. But obviously, the order in which terms appear is very important.

Even if you carry out an n-gram ...

Get Mining the Social Web now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Mining the Social Web by Matthew A. Russell

Before You Go Off and Try to Build a Search Engineâ¦

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly

Before You Go Off and Try to Build a Search Engineâ¦

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly

Before You Go Off and Try to Build a Search Engineâ¦