O'Reilly logo

Ruby Cookbook by Leonard Richardson, Lucas Carlson

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

23.7. Finding Duplicate Files

Problem

You want to find the duplicate files that are taking up all the space on your hard drive.

Solution

The simple solution is to group the files by size and then by their MD5 checksum. Two files are presumed identical if they have the same size and MD5 sum.

The following program takes a list of directories on the command line, and prints out all sets of duplicate files. You can pass a different code block into each_set_of_ duplicates for different behavior: for instance, to prompt the user about which of the duplicates to keep and which to delete.

	#!/usr/bin/ruby
	# find_duplicates.rb
	
	require 'find'
	require 'digest/md5'

	def each_set_of_duplicates(*paths)
	  sizes = {}
	  Find.find(*paths) do |f|
	   (sizes[File.size(f)] ||= []) << f if File.file? f
	  end
	  sizes.each do |size, files|
	    next unless files.size > 1
	    md5s = {}
	    files.each do |f|
	      digest = Digest::MD5.hexdigest(File.read(f))
	      (md5s[digest] ||= []) << f
	    end
	    md5s.each { |sum, files| yield files if files.size > 1 }
	  end
	end

	each_set_of_ 
duplicates(*ARGV) do |f|
	  puts " 
Duplicates: #{f.join(", ")}"
	end

Discussion

This is one task that can't be handled with a simple Find.find code block, because it's trying to figure out which files have certain relationships to each other. Find.find takes care of walking the file tree, but it would be very inefficient to try to make a single trip through the tree and immediately spit out a set of duplicates. Instead, we group the files by size and then by their MD5 checksum.

The MD5 ...

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required