FileLister

Find duplicate files on a Mac using checksums, not guesswork

Two files with different names can be byte-for-byte identical. Two files with the same name can be completely different. If you want to find real duplicates on a Mac, comparing names and sizes is not enough. You need to compare content, and that means checksums.

Duplicate files pile up in ways that are hard to see. You save the same PDF from an email twice. A backup copies a folder into two places. You export the same photo at two moments and the names differ by a number. Over a few years a working drive can carry tens of gigabytes of files that exist more than once, and nothing in Finder tells you.

The instinct is to sort by name and look for pairs. That misses most of them. Here is why, and what actually works.

Why name and size matching fails

Name matching catches the obvious cases and nothing else. invoice.pdf and invoice copy.pdf get flagged. But IMG_4821.jpg and wedding-final.jpg can be the exact same image, and name matching will never pair them.

Size matching is a bit better but still wrong in both directions. Two different photos shot seconds apart can land on the same byte count by coincidence, so you get false pairs. And a file saved by two apps can differ by a few metadata bytes while the picture inside is identical, so you get missed pairs. Size is a hint, not proof.

What a checksum actually proves

A checksum, also called a hash, runs the entire contents of a file through a function that spits out a short fixed-length string. The same bytes always produce the same string. Different bytes produce a different one. MD5, SHA-1, and SHA-256 are the common ones.

So the logic for finding duplicates becomes simple and honest:

  • If two files have the same checksum, they are the same file, whatever they are named and wherever they live.
  • If two files have different checksums, they are not the same file, even if the name and size match.

This is how backup tools, forensics tools, and package managers decide whether two things are equal. It is the only method that does not lie to you.

Doing it by hand

You can compute a checksum in Terminal:

md5 ~/Downloads/report.pdf

Or for a stronger hash:

shasum -a 256 ~/Downloads/report.pdf

That is fine for checking one or two files. To find duplicates across a whole folder you would hash every file, collect the results, group by hash, and report the groups with more than one member. People write shell scripts for this, and they work, but they are slow, awkward to read, and they give you a wall of hex with no thumbnails and no easy way to act on the result.

Doing it across a folder

FileLister computes MD5, SHA-1, and SHA-256 while it scans, so the checksum is just another column in your catalog. Once every file has a hash, finding duplicates is a matter of grouping identical hashes, which the app does for you.

  1. Scan the folder or drive you want to check.
  2. Turn on the hash column you prefer. SHA-256 if you want the strongest guarantee, MD5 if you want speed and the files are not security-sensitive.
  3. Use duplicate detection to see every set of files that share a hash.
  4. Keep one from each set and deal with the rest.

Because the match is on content, it does not matter that the copies have different names or sit in different folders. If the bytes are equal, they land in the same group.

Comparing two folders

A related job is checking whether a copy finished correctly, or whether an old folder and a new one hold the same files. Hashes answer that too. Catalog both folders, compare the hash sets, and you get three lists: files only in the first, files only in the second, and files in both. That tells you exactly what a copy missed, without trusting the copy tool's own report.

A note on MD5 versus SHA-256

MD5 is fast and perfectly good for spotting accidental duplicates. It is considered broken for security, meaning someone can deliberately craft two different files with the same MD5, but nobody is doing that to your holiday photos. For finding real duplicates on your own drive, MD5 is fine. When the check has to stand up to tampering, use SHA-256.

Either way, the principle holds: compare what is inside the file, not what it is called. Names lie. Hashes do not.

Keep reading