A very useful distance metric between strings is provided by the
TfIdfDistance class. It is, in fact, closely related to the distance metric from the popular open source search engine, Lucene/SOLR/Elastic Search, where the strings being compared are the query against documents in the index. Tf-Idf stands for the core formula that is term frequency (TF) times
inverse document frequency (IDF) for terms shared by the query and the document. A very cool thing about this approach is that common terms (for example,
the) that are very frequent in documents are downweighted, while rare terms are upweighted in the distance comparison. This can help focus the distance on terms that are actually discriminating in the document collection. ...