Is a checksum infallible?

Asked by rexpresso (922

) May 21st, 2010

Thinking about the checksums that are used to confirm the correct transfer of files through networks… I don’t know much about this but what I know lead me to this doubt… let’s say I have a bitmap image (hypothetical case) of 100000×100000 pixels all white and I do a checksum. If I paint one pixel black certainly the file will only be minimally changed. How sure can we be that the checksum will be different? Thanks!

Observing members: 0

Composing members: 0

10 Answers

No, they are not. However, because checksums generate a long string the chances of a collision are extremely small, and I do mean extremely. If you paint one pixel black the checksum will be different. If you’re very paranoid, you can use SHA checksums, to lower the chances further, but this isn’t something that conventional users need to worry about. The chances of a collision in file checksumming are low to the point of ridiculous.

gggritso (5459

)“Great Answer” (5

) Flag as…

To continue @gggritso excellent explanation, the checksum is ALWAYS designed to change based on a single bit difference. Single bit changes are the most common types of errors.

You can sort of think of checksums like an error amplifier.

That is, if I have a string of binary digits and a simple checksum (summing), say:

11011010

We add all the digits (get 101), modulo 1, and get 1.

If we flip a bit, so we have
11111010
add the digits (get 110), modulo 1, we get 0. It doesn’t matter what (single) bit we flip, we get the same checksum.

This checksum fails if an even number of bits has been flipped, of course:
111111110, checksums to 1 again.

However, MD5, which is a rather complicated checksum, if we give it the same values:

11011010 => 2a7f fe43 a66d a06b 7b84 0ea6 bf0e de76
11111010 => 3e05 4a70 3f7d dd8f d41e 69a8 d594 048d
11111110 => a033 8fb4 29c0 3d5b 58a2 f43d 87ee fa40

So MD5 detects both our single & double bit flips.

Make any sense?

grumpyfish (6657

)“Great Answer” (6

) Flag as…

^ I agree with what @gggritso and @grumpyfish said about catching single bit errors and stuff…

I thought I’d chime in with a counter-argument about how (incredibly) likely errors can actually be: http://lambda-diode.com/opinion/ecc-memory

timtrueman (5765

)“Great Answer” (3

) Flag as…

relevant wiki page here.

arpinum (1989

)“Great Answer” (0

) Flag as…

If you ever have a mapping from A to B where B is smaller than A, its not infallible. However, as long as B is large enough (compared to A) it will be good enough for practical purposes.

In the case of images, as I understand it, two images with the same checksum will look nothing alike and will require so many bit flips that its unlikely that you’ll get the same checksum with a different image.

roundsquare (5522

)“Great Answer” (1

) Flag as…

Checksums are useful if only you’re talking about small files and you have limited resources. I used them to check the boot data when I was working with embedded processors and I had to work inside of a 2K stack frame. They are by no means infallible. A better algorithm is CRC64, cyclic redundancy check.

IchtheosaurusRex (8676

)“Great Answer” (0

) Flag as…

Going a bit further on picking your brains (thanks!) basically there’s no way to be 100% sure that a 5gb file on another computer is the same as on my computer without comparing everything bit by bit, is that right?

rexpresso (922

)“Great Answer” (0

) Flag as…

@rexpresso , you cannot be 100% sure, but CRC has a very low error rate, even with very large files. Checksum reliability degrades as the file size increases.

IchtheosaurusRex (8676

)“Great Answer” (0

) Flag as…

@rexpresso Just checked back in a week later =)

Correct, for 100.000% accuracy, you must compare bit-by-bit, but also check your comparison stack bit-by-bit for things like cosmic ray bit flips, etc.

One option is to pull md5 sums (or some other hash) on chunks of the data, twice using different slices. That is, run it once with 100K slices, then again with 66K slices. CHANCES are that you won’t have a collision on both (as collisions are rather rare)

Edited to add: My other thought is this: if you’re looking to detect random accidental changes, a checksum is a good way to go. If you’re trying to detect a deliberate attack, you may need to use some other system.

Far more efficient than my slice system above, would simply be to run two different hashes against it. E.g., md5 and SHA1 or something—to my knowledge no one has disocvered a collision that works on more than one hash.

grumpyfish (6657

)“Great Answer” (0

) Flag as…