O'Reilly logo

Natural Language Processing with Java and LingPipe Cookbook by Krishna Dayanidhi, Breck Baldwin

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

The Tf-Idf distance

A very useful distance metric between strings is provided by the TfIdfDistance class. It is, in fact, closely related to the distance metric from the popular open source search engine, Lucene/SOLR/Elastic Search, where the strings being compared are the query against documents in the index. Tf-Idf stands for the core formula that is term frequency (TF) times inverse document frequency (IDF) for terms shared by the query and the document. A very cool thing about this approach is that common terms (for example, the) that are very frequent in documents are downweighted, while rare terms are upweighted in the distance comparison. This can help focus the distance on terms that are actually discriminating in the document collection. ...

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required