Scanning and Hashing: Unraveling a Case of Hidden ESI
Trying to help counsel understand what amounted to an obstructive data dump and ferret out undisclosed electronic collections, I recently tried to contrast having multiple scans of a document in an electronic collection with having multiple copies of the same scan. I was surprised at how tough it was to clearly articulate the difference, and it’s a critical distinction for de-duplication and, in this instance, to expose chicanery.
In paper collections, it’s common to see the same document again-and-again, originating with the various custodians who had copies. A human reviewer can look at two independent TIFF images of the same document, recognize obvious differences and marginalia, then quickly conclude, “They’re the same document.” Subtle changes may be overlooked; but in practice, the process rarely hinges on subtlety.
But scan the very same document multiple times, even using the same equipment and exercising utmost care in positioning the paper on the glass, and each scan will have a different hash value than the next, A hash calculation is such an exacting fingerprint of electronic data that the most minuscule, invisible variation inherent to the scanning process will generate different hash values. Like snowflakes...or the phenotypes of identical twins.
Now, I expect you could run optical character recognition against each digitally unique scan of identical documents and arrive at matching hash values for the OCR, but success will hinge on variables like skew, scan lag and image speckling. I haven’t tested this theory, but I suspect there’s enough inherent variation in OCR that you’d need some well-crafted quality assurance mechanisms to ensure this approach meets expectations.
The bottom line is that trying to use hash matching to cull duplicates of the same document scanned multiple times is a losing proposition.
The point I made about the data dump collection was that a high percentage of the TIFF image files produced shared matching hash values with other TIFF images in the collection. These TIFF matches couldn’t have come about coincidentally, i.e., from multiple custodians having the same paper document. The only way this could happen for paper documents is if the same scanned image is copied and produced, over-and-over again, falsely attributed to multiple custodians or multiple paper copies from the same custodian. When you consider how paper collections from multiple custodians dovetail into electronic collections, these tens of thousands of hash matches fail the smell test. Suddenly, the claim that the source was paper and they don’t have native electronic counterparts of the documents starts to fall apart like a two-bit suitcase in the rain.
But even as I’m writing this, I wonder if I’ll be able to make it clear enough for the court and counsel to understand what’s afoot. Thoughts?





I think the best way to make it clearer to others is to use an analogy to an old fax machine. Consider the case of someone who faxes the same color photograph to multiple people by feeding the photo into an old fax machine multiple times. Even if the person doing the faxing is sending to the same machine, none of the faxes will be identical. The color-to-gray adjustment will be a bit off, the squareness of the photo edge will be a little different, and the speed of the photo through the scanner will result in minor variances in stretching on the recipients' end. Even if these variables match perfectly, external light or shadow will often cause blocks of pixels to have different jaggedness of their edges, and changes in the dust on the photo or sending machine will make some pixels appear in different areas. At a typical 100 x 200 dpi resolution, to have identical 8½ x 11 faxes, it would need to match just under 2 million pixels identically. No changes in ambient lighting. No movement of any dust particle larger than 1/100th of an inch. Perfect alignment.
A modern document scanner is really no different than an old fax machine. Although the precision has been improved, the resolution has been improved, too. A 1600 dpi scanner will need to match 239 million pixels in order to identically reproduce an 8½ x 11 image. In order to produce a document that would have a chance at producing the same hash value -- no pixel change -- lighting would need to be identical, document placement would need to match within 1/1600th of an inch, and no dust could be in the system that's any bigger than that size.
A different hash value indicates that somewhere in those quarter billion pixels, at least one is different. An identical hash value indicates that the images are from the same scan, or that the different people who scanned the documents at different times aligned the documents and dust specks within 1/1600th of an inch of precision, using the exact same scanner (not just the same model of scanner). It's clear which one really happened.
Posted by: TravisL | September 11, 2008 at 02:40 PM