Does an identical cryptographic hash or checksum for two files mean they are identical

checksum

I have 2 excel documents and I want to check if they are exactly the same, apart from the file name.

For example, the files are called fileone.xls and filetwo.xls. Apart from the file names, their contents are presumed to be identical but this is what i want to check.

I've been looking for ways to review this and without installing a bunch of plugins. There doesn't seem a straight forward way.

I've tried generating MD5 hashes for both files. When the hashes are identical, does this mean that the file contents are 1:1 the same?

Best Answer

When the hashes are identical, does this mean that the file contents are 1:1 the same?

All files are a collection of bytes (values 0-255). If two files MD5 hashes match, both those collections of bytes are extremely likely the exact same (same order, same values).

There's a very small chance that two files can generate the same MD5, which is a 128 bit hash. The probability is:

Probability of just two hashes accidentally colliding is 1/2128 which is 1 in 340 undecillion 282 decillion 366 nonillion 920 octillion 938 septillion 463 sextillion 463 quintillion 374 quadrillion 607 trillion 431 billion 768 million 211 thousand 456. (from an answer on StackOverflow.)

Hashes are meant to work in "one direction only" - i.e. you take a collection of bytes and get a hash, but you can't take a hash and get back a collection of bytes.

Cryptography depends on this (it's one way two things can be compared without knowing what those things are.)

Around the year 2005, methods were discovered to take an MD5 hash and create data that matches that hash create two documents that had the same MD5 hash (collision attack). See @user2357112's comment below. This means an attacker can create two executables, for example, that have the same MD5, and if you are depending on MD5 to determine which to trust, you'll be fooled.

Thus MD5 should not be used for cryptography or security. It's bad to publish an MD5 on a download site to ensure download integrity, for example. Depending on an MD5 hash you did not generate yourself to verify file or data contents is what you want to avoid.

If you generate your own, you know you're not being malicious to yourself (hopefully). So for your use, it's OK, but if you want someone else to be able to reproduce it, and you want to publicly publish the MD5 hash, a better hash should be used.


Note that it's possible for two Excel files to contain the same values in the same rows and columns, but for the bytestream of the file to be completely different due to different formatting, styles, settings, etc.

If you are wanting to compare the data in the file, export it to CSV with the same rows and columns first, to strip out all formatting, and then hash or compare the CSV's.

Related Question