Legal Technology News - E-Discovery and Compliance Blog

« SuperiorGlacier Opens Chicago Office | Main | Planet Data Webinar on FRE 502 »

June 10, 2009

One More Thing You'll Wish You'd Never Heard About Search

42-21218096 I don't think I've ever met Perlustro CTO Jim Baker; but, now-and-again he calls my mobile and talks for hours at a blistering pace belying his easy southern drawl.  He’s a smart fellow who’s been around the block, and he's so steeped in the world of electronic data and computer forensics, listening to him is like trying to drink from a fire hose.  It's all I can do to keep up.

Perlustro is the never-heard-of-'em company responsible for a world-famous product: ILook, billed as "the most widely distributed law enforcement, military, and intelligence agency multifunction computer forensic examination system in the world."  Indeed, until recently, only law enforcement and agency types could use the ILook tools; but now they are emerging as commercial products to compete head on with the likes of Encase, FTK, ProDiscover and X-Ways Forensics.

Amidst a million other tidbits, Jim shared a "I-can-hardly-believe-it's-true" flaw impacting indexing engines.  If you're doing e-discovery by keyword searching indices of client data--and who isn’t?--you need to have this issue on your radar.  Or, you may wish you kept your head in the sand with everybody else.

Here's where I need to offer a disclaimer.  I haven't independently run tests to confirm that Jim's report is true, but the source is solid, and the logic seems unassailable.

My readers know that I'm always flogging the importance of using tools that understand the data they're searching and deliver reliable recursion through nested, encoded data--tools that go as deep as is needed to get to the responsive information.  Plus, good tools need to flag what they skip and flub with as much clarity as they report what they accomplish. 

That's quite a mouthful.  Let's break it down.

If you open up Windows Notepad, write a note and save it, your system encodes the data in about the simplest form for electronic information storage around, called ASCII text or, more simply "plain text."  There won't be any bolding or italics, and you can forget about fancy stuff like colorful fonts, diàĉŗiŧicāl marks or (every lawyer's much-beloved) section symbols. § § §. Just...plaintext.

ASCII text was designed to be stored using just 7 binary digits or “bits.”  That means you can express all the ASCII characters in a “byte” of data, that is, 7 ones or zeroes, with one more bit used for parity to provide a “bit” of assurance that nothing got lost in transmission.  This 7-bit encoding allowed for 128 unique characters (27=128).  Because ASCII text has been around for about 50 years, the first 32 “non-printing” characters are  reserved to antiquated teletype functions like “carriage return“ and “bell.”  The remaining characters are dedicated to basic punctuation, the numbers 0-9 and the English alphabet in upper and lower case.

If all the ESI we had to search were plain text, e-discovery would be easier.  Not easy, mind you, because the volume of data would still be great, and our language still complex; but, easier.

Alas (and contrary to common presumption), little of what we collect and search in e-discovery is plain text.  We just think of it as text because that’s how it appears to us within applications. 

Instead, ESI is encoded in many different ways, and it’s quite common for these encoded objects to be nested like Russian matryoshka dolls: a Word document inside a Zip archive attached to an e-mail message within a compressed Outlook PST container file residing on an encrypted volume.  Each nested object is encoded differently from its parent and child objects, and even within the body of a single document or file, encoding changes as content changes. 

And lest we forget, foreign alphabets employ many more than our paltry 26 letters.  In order to support Cyrillic and Arabic and CJK ideograms, the world needed far more than 128 or even the 256 unique characters made available by re-tasking the parity bit.  So, Unicode was born in 1991, and its various Unicode Transformation Formats or UTF encodings have gained wide acceptance, supporting enormous character sets.

Whew.  Are we having fun yet?

When an e-discovery tool processes encoded ESI, it must first apply the proper filter to the data to convert it to plain text so it that can be indexed.  If the data is encoded in multiple ways, multiple filters must be applied in the correct sequence to recurse (that is, cycle through) all different forms of encoding to reach any textual content.  If no filter or the wrong filter is applied along the way, the text isn’t indexed.  This occurs several ways, e.g., the encoding isn’t recognized, the tool doesn’t support the encoding, the content isn’t text or the file is corrupted, encrypted or password protected.  

If an indexing engine’s text extracter doesn’t understand the encoding of the file, it may default to applying the most common textual encoding schemes to the unrecognized content in a last-ditch effort to find intelligible text.  So, we’d expect a sweep of the data for ASCII certainly, but also other common encoding schemes like UTF-7 or UTF-8.

When a file expected to yield text doesn’t, e.g., because the text extracter can’t interpret the encoding or can’t access the contents, such failure mustn’t be ignored.  Accordingly, e-discovery tools must account for exceptions, and workflows should be designed to follow up on and rectify exceptions.  That is, someone who understands encoding and has the tools and skills to extract the text needs to insure that text makes it into the index.

All of this is background to understanding why what you think you're doing may never have been done.  What’s the word for that in law and medicine?  Oh!  Yes!  Malpractice.

Okay, maybe it’s not malpractice if you don’t know you’re ignoring potentially responsive material; but, what do you call it once you know your tools aren’t effective and you keep using them?

This is where Jim Baker throws the proverbial turd in the punchbowl.

Two of the most common indexing engines around are Microsoft’s Index Server and Windows Search.  These tools do a reasonable job teasing text out of the most common productivity formats, like Microsoft Word and Excel files.  But, Jim points out that it’s common for users bumping up against Excel’s row limit to export their big spreadsheet to an Access database, such that the Excel .XLS file becomes an Access .MDB file.  One would think that would be no problem, because if the search tool can pull text from the Excel spreadsheet, surely it can pull the same text from the Access database. Right?

Except that apparently it can't.

In its wisdom, Microsoft implemented Windows Search such that it only indexes text from Access MDB files when it's encoded in the customary textual formats.  Were the text from the Excel spreadsheet stored in the Access database as text, no worries.  But Jim says it’s not text.  It’s what’s called OLE content (for Object Linking and Embedding and pronounced “o-lay”—just pretend you’re at a bullfight).  Because the indexer dives in looking only for text, it doesn’t apply the filter required to see encoded text in the embedded OLE/spreadsheet data. 

It won’t trigger an error message, and the MDB won’t appear on any exceptions list.  Instead, the system will report that it successfully indexed the Access MDB when, in fact, it missed everything you intended to index.

And that’s just one tiny example of how the search engines and text extraction tools in common use today are not doing what lawyers and litigants naïvely believe they do. 

The problem is compounded because the parade of vendors in the e-discovery space encourages consumers to believe that there is a comparable differentiation between technologies.  It doesn’t help that so many claim to use super-duper proprietary tools developed by crack in-house software engineers and coders.  For most, that’s just so much poppycock (from the Dutch pappekak: pap=mush and kak=dung).  Oh heck, let’s just use the Texas technical term for it: BULLSHIT!

At the heart of so many “proprietary EDD technologies” lie just a tiny handful of the same text extraction and indexing engines, sharing the same limitations and the same flaws.  Limitations and flaws that are understood…and by convention and expedience, simply, silently ignored.

Tick, tick, tick, tick.  That’s the countdown to sanctions, when our complacent reliance on wobbly text extractors, indexers and indexing methods blows up in our face.  It’s not enough to introduce quality control just to keyword selections.  You must be sure that your indices truly reflect the informational content of your collections. 

But there’s hope.  It’s not too late to ask hard questions, test your tools and audit your outcomes.  Remember that in most e-discovery efforts, it’s a fiction to claim that you ran keyword searches against the collection.  You ran them against the index; and if the index is a dud, the search is no better.

TrackBack

TrackBack URL for this entry:
http://www.typepad.com/services/trackback/6a00d8345280a669e2011570ea679c970b

Listed below are links to weblogs that reference One More Thing You'll Wish You'd Never Heard About Search:

Comments

Post a comment

Comments are moderated, and will not appear on this weblog until the author has approved them.

This weblog only allows comments from registered users. To comment, please Sign In.

Sign Up for the E-Discovery and Compliance Newsletter



An Affiliate of the Law.com Network

From the Law.com Newswire

Sign up to receive Legal Blog Watch by email
View a Sample

Contact EDD Update


Subscribe to this blog's feed



RSS Feed: LTN Podcast

Monica Bay's Law Technology Now Podcasts are also available as an RSS feed.

Go to RSS Subscribe page




February 2012

Sun Mon Tue Wed Thu Fri Sat
      1 2 3 4
5 6 7 8 9 10 11
12 13 14 15 16 17 18
19 20 21 22 23 24 25
26 27 28 29      

Blog Directory - Blogged