A perfect time for a new beginning
EDD continues to be on top of everyone’s mind and the timing for this blog could not be anymore perfect. After all, this time of year has always been marked for the new beginnings. The New York Rangers are about to start their quest for the Stanley Cup and the Yankees look forward to adding yet another pennant to their collection. :) The schools are open for a new school year and some even believe the World has come into being around this time many moons ago…
Speaking of the world. International Electronic Discovery is one of the most recent and perhaps the most complicated and uncertain areas of the industry. In addition to already complex issues associated with domestic EDD, we now have to worry about the data (and metadata) in multiple languages as well as our ability to index and search documents written in “non – Latin” (Asian, Cyrillic, etc.) characters. The notion of Unicode compliance (i.e. an ability of discovery databases to recognize fonts associated with all languages) is moving to the top of priorities of the database developers, but in the opinion of this author, not fast enough as the Unicode compliant products are very few and most of them are far from perfect.
Add complications of data mining in EU and Asia due to Data Protection and Privacy regulations as well as customs restrictions to this already lethal mix and you get a very fertile ground for many interesting discussions.
We certainly look forward to them!




Well, true. No software is perfect. It's been that way in the industry since LTS and BRS.
That said I am quite certain MetaLINCS is fully Unicode compliant and can handle the auto-detection and categorization of unicode languages without having to identify what language it is in advance. Throwing analytics on top of full categorization, threading and deduping of the foreign language data, it's a powerful tool.
As an evangelist, just my two cents.
Posted by: Mark Reichenbach | October 01, 2007 at 02:09 PM
While no software package is perfect, more will be required to deal with multiple language data than recognizing Unicode characters and throwing analytics on top of it.
It's key to have the different components play together hand in hand throughout the entire process.
- data processing prowess (extraction modules fully capable to recognize: not only full-text but specific metadata fields encoding, variations, full-text , dictionary libraries)
- how much and how fast : what's the computing power of the indexer
- filtering in an language agnostic manner: if there are several languages in the collection can the system handle a single search term query list instead of combining results from multiple runs to manage the recall? Single dictionary library (better) or multi?
- auto-categorization: how sophisticated this really is? Majority of the cases bring mixed / multi characterset challenges: can the threshold be set in the software to manage the language categorization algorithms to order and score the results ? ( can you set the language score order if only the parent email is in Italian but the attachment is in English what about if the email thread is partly Italian / German (30% / 60 % ) but the attachment drafts are 100 % English)
- can the platform normalize what's indexed? Can the indexing software filter out binary artifacts "gibberish" generated by logos, graphic and image imprints as partial components of document ?
- how does the system handle paper scanned / OCR extracted content ? Can you seamlessly incorporate OCR results into the indexer (Do you have unicode OCR/OWR engines plugged into the indexer)?
- how does the review platform manages UTF charactersets (Java , php, .net, html) how well does the system manage the end user input and present results to play flawlessly with the backend?
- how well does the review system's Concept and Analytics components fare with pictorial rather then white space based tokenized phrase interpretation?
- reliable machine translation ?
- can the system handle the tiff / pdf / rendering of unicode characters properly at production time? Can unicode endorsments be managed?
In the coming years will see the industry come to age with managing data of different language characters some sooner and better than others.
Posted by: Gyorgy Pados | October 01, 2007 at 06:01 PM
Full support for the languages of the world involves many complex technical issues.
First all of the different character coding systems used around the world. These all need to be recognized and transformed to Unicode, wherever these encodings might occur in content and metadata.
The next challenge is tokenizing the content, i.e. breaking it up into words, numbers, and other sensible units. As a westerner we think this is easy, just look for the spaces and other punctuation. But take a look languages like Chinese and Japanese where there are typically no spaces and you'll begin to appreciate the problem.
Then we get into the whole area of linguistic analysis, starting from something simple like stemming (e.g., "going" --> "go"), moving into more complex features like identifying noun phrases (which form the basis for many notions of "concept" used in EDD). These functions are all language-specific.
A good system needs to identify the language(s) in which each document has been written, both for internal reasons like applying the correct linguistic analysis rules, and for the benefit of users who may wish to search, review or translate content in a specific language and so need to identify the documents using that language.
Integration with the UI is another challenging area. Consider query term highlighting and entering queries in different language. Right-to-left languages like Arabic and Hebrew present unique challenges in this area.
Fortunately, these are all well-understood technical issues and good products exist that address them effectively. Continuing the unabashed evangelism, if you need a fully I18n-capable EDD tool check out MetaLINCS.
Posted by: Chuck Williams | October 07, 2007 at 09:41 PM