Legal Technology News - E-Discovery and Compliance Blog

« ILTA 2008 | Main | Advanced E-Discovery Institute Seminar »

September 01, 2008

Scanning and Hashing: Unraveling a Case of Hidden ESI

SnowflakesTrying to help counsel understand what amounted to an obstructive data dump and ferret out undisclosed electronic collections, I recently tried to contrast having multiple scans of a document in an electronic collection with having multiple copies of the same scan. I was surprised at how tough it was to clearly articulate the difference, and it’s a critical distinction for de-duplication and, in this instance, to expose chicanery.

In paper collections, it’s common to see the same document again-and-again, originating with the various custodians who had copies. A human reviewer can look at two independent TIFF images of the same document, recognize obvious differences and marginalia, then quickly conclude, “They’re the same document.” Subtle changes may be overlooked; but in practice, the process rarely hinges on subtlety.

 

But scan the very same document multiple times, even using the same equipment and exercising utmost care in positioning the paper on the glass, and each scan will have a different hash value than the next,  A hash calculation is such an exacting fingerprint of electronic data that the most minuscule, invisible variation inherent to the scanning process will generate different hash values.  Like snowflakes...or the phenotypes of identical twins.

 

Now, I expect you could run optical character recognition against each digitally unique scan of identical documents and arrive at matching hash values for the OCR, but success will hinge on variables like skew, scan lag and image speckling. I haven’t tested this theory, but I suspect there’s enough inherent variation in OCR that you’d need some well-crafted quality assurance mechanisms to ensure this approach meets expectations.

 

The bottom line is that trying to use hash matching to cull duplicates of the same document scanned multiple times is a losing proposition.

 

The point I made about the data dump collection was that a high percentage of the TIFF image files produced shared matching hash values with other TIFF images in the collection. These TIFF matches couldn’t have come about coincidentally, i.e., from multiple custodians having the same paper document. The only way this could happen for paper documents is if the same scanned image is copied and produced, over-and-over again, falsely attributed to multiple custodians or multiple paper copies from the same custodian. When you consider how paper collections from multiple custodians dovetail into electronic collections, these tens of thousands of hash matches fail the smell test. Suddenly, the claim that the source was paper and they don’t have native electronic counterparts of the documents starts to fall apart like a two-bit suitcase in the rain.

 

But even as I’m writing this, I wonder if I’ll be able to make it clear enough for the court and counsel to understand what’s afoot.  Thoughts?

TrackBack

TrackBack URL for this entry:
http://www.typepad.com/services/trackback/6a00d8345280a669e200e554d7e10c8833

Listed below are links to weblogs that reference Scanning and Hashing: Unraveling a Case of Hidden ESI:

Comments

I think the best way to make it clearer to others is to use an analogy to an old fax machine. Consider the case of someone who faxes the same color photograph to multiple people by feeding the photo into an old fax machine multiple times. Even if the person doing the faxing is sending to the same machine, none of the faxes will be identical. The color-to-gray adjustment will be a bit off, the squareness of the photo edge will be a little different, and the speed of the photo through the scanner will result in minor variances in stretching on the recipients' end. Even if these variables match perfectly, external light or shadow will often cause blocks of pixels to have different jaggedness of their edges, and changes in the dust on the photo or sending machine will make some pixels appear in different areas. At a typical 100 x 200 dpi resolution, to have identical 8½ x 11 faxes, it would need to match just under 2 million pixels identically. No changes in ambient lighting. No movement of any dust particle larger than 1/100th of an inch. Perfect alignment.

A modern document scanner is really no different than an old fax machine. Although the precision has been improved, the resolution has been improved, too. A 1600 dpi scanner will need to match 239 million pixels in order to identically reproduce an 8½ x 11 image. In order to produce a document that would have a chance at producing the same hash value -- no pixel change -- lighting would need to be identical, document placement would need to match within 1/1600th of an inch, and no dust could be in the system that's any bigger than that size.

A different hash value indicates that somewhere in those quarter billion pixels, at least one is different. An identical hash value indicates that the images are from the same scan, or that the different people who scanned the documents at different times aligned the documents and dust specks within 1/1600th of an inch of precision, using the exact same scanner (not just the same model of scanner). It's clear which one really happened.

Post a comment

Comments are moderated, and will not appear on this weblog until the author has approved them.

This weblog only allows comments from registered users. To comment, please Sign In.

Sign Up for the E-Discovery and Compliance Newsletter



An Affiliate of the Law.com Network

From the Law.com Newswire

Sign up to receive Legal Blog Watch by email
View a Sample

Contact EDD Update


Subscribe to this blog's feed



RSS Feed: LTN Podcast

Monica Bay's Law Technology Now Podcasts are also available as an RSS feed.

Go to RSS Subscribe page




February 2012

Sun Mon Tue Wed Thu Fri Sat
      1 2 3 4
5 6 7 8 9 10 11
12 13 14 15 16 17 18
19 20 21 22 23 24 25
26 27 28 29      

Blog Directory - Blogged