Removing duplicate documents (deduplication)

Solr provides us with a way to prevent duplicate or nearly duplicate elements to get indexed using a signature/fingerprint field. It natively provides a deduplication technique of this type via the signature class, and this can further be used to implement new hash and signature implementations.

Let's see how we can implement deduplication in Solr. We'll use our musicCatalog core, which we used in the previous chapter as well, and will modify it:

  1. Copy the musicCatalog core and create a new core called musicCatalog-dedupe from it. After we have created the new core, we'll change schema.xml to add a signature field that will contain the document signature/fingerprint:
    <!-- Field to store the fingerprint/signature ...

Get Apache Solr for Indexing Data now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.