- Mining the Social Web
- SPECIAL OFFER: Upgrade this ebook with O’Reilly
- Preface
- 1. Introduction: Hacking on Twitter Data
- 2. Microformats: Semantic Markup and Common Sense Collide
- 3. Mailboxes: Oldies but Goodies
- 4. Twitter: Friends, Followers, and Setwise Operations
- 5. Twitter: The Tweet, the Whole Tweet, and Nothing but the Tweet
- 6. LinkedIn: Clustering Your Professional Network for Fun (and Profit?)
- 7. Google+: TF-IDF, Cosine Similarity, and Collocations
- 8. Blogs et al.: Natural Language Processing (and Beyond)
- 9. Facebook: The All-in-One Wonder
- 10. The Semantic Web: A Cocktail Discussion
- Index
- About the Author
- Colophon
- SPECIAL OFFER: Upgrade this ebook with O’Reilly

Although rigorous approaches to natural language processing (NLP) that include such things as sentence segmentation, tokenization, word chunking, and entity detection are necessary in order to achieve the deepest possible understanding of textual data, it’s helpful to first introduce some fundamentals from Information Retrieval theory. The remainder of this chapter introduces some of its more foundational aspects, including TF-IDF, the cosine similarity metric, and some of the theory behind collocation detection. Chapter 8 provides a deeper discussion of NLP.

If you want to dig deeper into IR theory, the full text of
*Introduction to Information Retrieval* is
available online
and provides more information than you could ever want to know about
the field.

Information retrieval is an extensive field with many specialties.
This discussion narrows in on TF-IDF, one of the most fundamental
techniques for retrieving relevant documents from a corpus. TF-IDF
stands for term frequency-inverse document
frequency and can be used to query a corpus by calculating
normalized scores that express the relative importance of terms in the
documents. Mathematically, TF-IDF is expressed as the product of the
term frequency and the inverse document frequency, *tf_idf =
tf*idf*, where the term `tf`

represents the importance of a term in a specific document, and `idf`

represents the importance of a term relative to the entire corpus. Multiplying these ...