Over There: Where Angels Have No Fear to Tread
I am late to the party in discussing the case of Digicel et al v. Cable & Wireless, et al. Others, including the extraordinary Chris Dale and the magnificent Sharon Nelson, long ago put their stamp on the case. The peripatetic Sultan of Search, Jason Baron, even guest blogged it for the prolific Ralph Losey. But as it was decided "Over There," and Sir Andrew Lloyd Webber hasn't set it to music, I paid it little heed.
But lately, I'm obsessed with sensible ways to improve keyword searches and practical means to test searches before they're trotted out against vast swaths of ESI.
Mr. Justice Morgan's opinion is the rare case where a jurist closely analyzed the efficacy and burden of particular keywords for electronic search--an undertaking that U.S. Magistrate Judge John Facciolla artfully characterized as a fool's errand for lawyers and judges. Still, once we change the "esses" to "zeds," there's much we Yanks can learn from the Digicel decision.
Digicel is a fight between mobile phone service providers operating in seven Carribean markets and the island phone companies to which they're obliged to interconnect in order to offer service. The claimants sought damages for the defendants' alleged foot dragging in offering interconnection.
The Defendants unilaterally selected and deployed ten keywords in their electronic searches; to wit: Digicel, interconnect, interconnection, licence, liberalise, liberalisation, strategy, competing, competitor, competition.
Let’s consider what’s amiss with these terms and where they can be tweaked to improve performance
Digicel: Wouldn’t you think it likely that some would place a second “L” at the end of the name? At the very least, I’d test to assess that potential before running a broad search.
Interconnect and Interconnection: In the manner searched, would there be any occurrence of “interconnection” that didn’t overlap with a hit for the root “interconnect?” How would occurrences that word wrap with hyphenation be handled?
Licence: Doubtlessly, the English spelling is the preferred in the collections searched, but it’s wise to anticipate that some may have employed the American spelling “license” and to search for it as well. A wildcard character will fill the bill without hurting precision.
Liberalise and Liberalisation: Wouldn’t a search for the root “liberali!” make more sense in that it would grab both variants along with the American spelling? It won’t hit on “liberal”, so it’s not likely to be significantly noisier.
Strategy: Perhaps they were seeking to steer clear of “strategic;” but here again, using the root makes more sense. Additionally, the word “strategy” is prone to transposition error. If it’s a crucial term, it’s wise to also search for “startegy,” too. A test against a sample helps decide.
Competing, Competitor and Competition: Doesn’t stemming make more sense here? If you don’t want to grab items with “compete,” then just use the stem “competi!.”
With a little thought, our list of ten terms becomes:
Digicel!
Interconnect!
Licen*e
Liberali!
Strateg!
competi!
This list doesn’t cure every potential problem I mention, but it’s a better, faster approach that won’t materially boost the cost of review.
With misgivings, the Court went on to require Defendants to run several other search terms against a much broader collection on the theory that, had the Defendants worked cooperatively with the Claimants at the outset, the additional terms should have been run along with those discussed above: delay, frustra*, impede and obstruct.
I'm not so sure.
Though the Defendants did a good job acquainting the Court with empirical data about the terms they ran, I see no indication that they undertook any testing to demonstrate to the Court that the proposed search terms were painfully overbroad. Had they done so using a sampling of the data to be searched, I think the Court would have listened and ruled differently.
The term "delay" is particularly problematic. On average, when run across an array of file types, the term "delay" can be expected to generate thousands of false hits for every potentially relevant one. No one's established a threshold "false hit ratio" to disallow a keyword as unacceptably imprecise, but plowing through many thousands of hay straws for a single needle can't be what the Court intended.
Why should you believe me as to the burden? Because I tested it. Obviously, I don't have any of the parties' data, but I can test it against data from, e.g., other telecommunications providers or a bare Windows installation or a Carribean concern and prove that, in each instance, including the word "delay" without some Boolean restraint or other limitation, will rake in huge numbers of irrelevant hits.
That's the power of testing search terms against sample data. It equips you with the single most effective persuasion tool you can bring to court: credibility.




Craig -- Your suggestions here are solid. Practitioners must get in the habit of testing their terms before running the big scale searches and, importantly, choosing vendors whose systems (and pricing structure) are most compatible with this kind of approach to production.
--Chaumette
Posted by: David Chaumette | July 01, 2009 at 06:51 PM