Technology Assisted Review:
Three Streams Flowing Into One River
Dexter Associates, LLC
The litigation world is undergoing something of a ‘Sea Change’ with regard to the Discovery Process. Driven by the advent of Electronically Stored Information (ESI), Discovery is slowly transforming from:
“… part of the pre-trial litigation process during which each party requests relevant information and documents from the other side in an attempt to “discover” pertinent facts. Generally discovery devices include depositions, interrogatories, requests for admissions, document production requests and requests for inspection1.”
“… refers to any process in which electronic data is sought, located, secured, and searched with the intent of using it as evidence in a civil or criminal legal case. E-discovery can be carried out offline on a particular computer or it can be done in a network. Court-ordered or government sanctioned hacking for the purpose of obtaining critical evidence is also a type of e-discovery.”2
Now, while Discovery’s traditional definition and devices are still valid and enforceable they are slow, tedious and expensive. Electronic Discovery (eDiscovery) shows promise in reducing the volume of documents to be reviewed thus saving time and money.
Several commercial businesses to providing litigation support have sprung up in the past ten (10) years. This paper examines three (3) technologies through which eDiscovery may be conducted and examine pertinent issues brought about by eDiscovery.
Current eDiscovery technologies
Currently in vogue due in large part by the DaSilva Moore3, Global Aerospace4 and Kleen Products5 cases currently being heard in their respective courts. Each of these cases involved Predictive Coding but vary on the implementation as explained by Brandon D. Hollinder on the eDiscovery Blogspot on April 25, 2012:
- “In Da Silva Moore, the parties initially agreed to use predictive coding (although they never agreed to all of the details) and Magistrate Judge Peck allowed its use. Plaintiffs have since attacked Judge Peck and most recently formally sought his recusal from the matter. That request is currently pending.
- Global Aerospace Inc., et al, v. Landow Aviation, L.P. dba Dulles, is the most recent case to address predictive coding, and it goes a step further than Da Silva Moore. In Global Aerospace, the defendants wanted to use predictive coding themselves, but plaintiffs objected. Virginia County Circuit Judge James H. Chamblin, ordered that Defendants could use predictive coding to review documents. Like Da Silva Moore, the court did not impose the use of predictive coding, rather, the court allowed a party to use it upon request.
- Kleen Prods., LLC v. Packaging Corp. of Am. goes the furthest, and is perhaps the most interesting of the three predictive coding cases because it is different than Da Silva Moore and Global Aerospace in one very important way: the plaintiffs in Kleen are asking the court to force the defendants to use predictive coding when defendants review their own material. The court has yet to rule on the issue.”
All of the recent Gartner 2012 ‘Leaders Quadrant’ utilize Predictive Coding in their products. Essentially, Predictive Coding:
- start[s] with a set of data, derived or grouped in any number of variety of ways (e.g., through keyword or concept searching);
- use[s] a human-in-the-loop iterative strategy of manually coding a seed or sample set of documents for responsiveness and/or privilege;
- employ[s] machine learning software to categorize similar documents in the larger set of data;
- analyze[s] user annotations for purposes of quality control feedback and coding consistency.6
Speaking purely from the technological perspective, Predictive Coding is merely an application of proven Bayesian7 Statistical theory. In this case, it is used to reduce the volume of document files selected for discovery. The fundamental hypothesis driving the search may vary depending which side is formulating the hypothesis, but the model doesn’t really care. It will produce new data for repetitive inclusion in subsequent runs (called “wash, rinse repeat cycle by Sharon Nelson8) regardless of which side is conducting the search on the same corpus of information. Thus, it would appear the sides are almost arguing how many angels can dance on the head of a pin.
Does Predictive Coding work? If viewed from a strictly Bayesian model perspective yes as such models have been in use for years. If viewed from the Predictive Coding/litigation perspective, the smaller number (volume) of documents produced for relevancy coding does cut time compared to manual review. However, the question should really be posed as: “Are the hypotheses producing the desired results?” That is a question that surrounds each and every case involving discovery. The Defendant may certainly dump hundreds or even thousands of files to swamp Plaintiff; but could judicial sanctions be far behind in such circumstances? Conversely, each time Plaintiff requests additional time to continue refining their hypothesis claims of ‘fishing trips’ and then judicial sanctions may not be far behind.
Cost Effectiveness over time – As the volume of ESI increases, cost effectiveness decreases as ‘experienced attorneys’ must review each document for relevancy thus slowing the overall process, Is there an upper limit to the ESI volume before Predictive Coding methods become too expensive and too slow (aka “Return to GO and collect $200”)?
Quality control in assembling seed groups, preparing to run and relevancy of produced documents is not a priority. One vendor, Compiled Services, LLC suggests mistakes are frequently made in the electronic discover process:
“What “mistakes” are we talking about? In the electronicdiscovery process of collecting, preserving, de-duplicating, filtering, culling, and reviewing ESI, each stage represents opportunities for errors ranging from entering improper date ranges to failure to accurately enter specific format requirements for tools utilized in downstream stages of the process. Like an assembly line, each step in the discovery process has its own issues related to configuration, setting parameters, calibrating specifications, and tired multi-tasking humans responsible for monitoring every aspect of billions of pieces of data. That is to say, quality is an overwhelming challenge for a discovery process that isdealing with ever-growing volumes of data with each passing day. The opportunity for minor mistakes, oversight, or simple carelessness comes from the fallible nature of people who simply cannot guarantee 100 percent focus and attention to such massive quantities of information day after day after day.”9
When it comes to defects (or ‘mistakes’), I prefer Dr. W. Edwards Deming’s first principle:
“Create constancy for the improvement of product and service10”
when dealing with humans interaction with automated processes. In other words, no one involved in the project should permit a defect being inserted into the end product. It’s an ‘attitude thang’ that should permeate each and every member of the organization conducting eDiscovery.
Unknown scalability – is there an upper limit to the size of the corpus of documents to be searched? How many runs become too burdensome to complete the search? We’re far too early to consider scalability as an issue, but it should be something in the back of people’s minds as the technique becomes ubiquitous.
Whereas Predictive Coding applies statistical analysis to identify a group of documents that may have relevant case information, it is not the be-all/end-all. Even with the explosive growth in Electronically Stored Information, surely there is a better means to obtain legal information in a concise manner. For that matter, what humans call ‘information’ is merely ‘data’ (actually 1s and 0s) to a computer. How can these two parts of the equation overcome what is essentially a communications roadblock? Enter the Semantic Web.
The Semantic Web is not one single concept; but is a natural outgrowth of the original World Wide Web (WWW) design. Sir Tim Berners-Lee, the creator of web 1.0, describes it “as an extension of the Web “in which information is given well-defined meaning, better enabling computers and people to work in cooperation.”11 It is comprised of a set of design principles and technology (see illustration12) to formalize the representation of meaning between humans and computers.
To Zachary Adam Wyner13, the aspect of the Semantic Web most pertinent to t
Generalized ontology structure
his discussion is the concept of ‘ontology’14 or vocabulary. This vital tool is used to create a unique and formalized structure, syntax and vocabulary for the ‘world’ in which it describes, in this case the legal world. Comprised of several inter-operating layers, an ontology filters and translates human-machine understanding. The fundamental component upon which the ontology is based is the eXtensible Markup Language (XML). XML’s purpose is to transport and store data, with focus on what data is15. XML is used to create file descriptors, known as ‘tags’, using syntactical concept of:
<tagname> descriptor content </tagname>
<moviename> Star Trek </moviename>
These tags can thus provide an additional, richer description of the contents of the file. Existing near the ‘bottom’ of the Semantic Web ‘stack’, XML is hardware and software agnostic. The components between the XML and User Interface & Applications layers are the formal ontologic languages RDF, OWL and SPARQL.
- RDF (Resource Description Framework) provides the foundation for publishing and linking data.
- OWL (Web Ontology Language) is used to build vocabularies.
- SPARQL is the query language for the Semantic Web.
These languages, derived from World Wide Consortium (W3C) recommendations provide consistent means of information interchange between vocabularies. While they do provide a formal, standardized way of sharing ontology information, making this sharing work at the human level is left to the tools compromising the layers above them.
Vocabularies may be found at two (2) different levels of detail:
- Core which models general concepts which are believed to be central to the understanding of a world (i.e.; law)
- Domain which focuses upon the representation of more specific areas (e.g., copyright) and are thus built for particular applications.
Building a core ontology covering all aspects of legal theory, practice, precedents, etc. is quite out of the question today. However, domain specific vocabularies covering more specific areas are not only possible, but are being considered by some European organizations.
Fortunately, we are not concerned about the ontological structure for the entire corpus of legal information. What we are focusing on is the body of information to be searched during eDiscovery. There are three (3) natural solutions to this thorny problem:
- The Defendant has previously compiled and constructed an ontology covering their entire documentation. This is the simplest solution and, thanks to the Sarbanes-Oxley (SOX) legislation may already be in progress. Garrie and Armstong16, while arguing the affects of SOX in light of the Zubulake17 decisions, make note of the following:
“Prior to Sarbanes-Oxley most public and private companies in industries other than financial services and healthcare did not have to comply with burdensome legally mandated data retention policies. Under Sarbanes-
Oxley, however, public companies are distinguished from their private counterparts in that they must retain financial data in order to comply with the legislation. Not only are public companies forced to retain more data than private companies, but public companies are now required to maintain the data in an easily accessible manner.”
Companies have incorporated Semantic Web capability into Documentation Management (DM) systems. Jennifer Zaino reports:
“… the movement to include semantic capabilities as part of DM systems has already started, <George Roth, president and CEO of Recognos Inc.> says. He cites as an example Microsoft Sharepoint11 and the vendor’s $1.2 billion buy of search vendor FAST Search awhile back. “The shift to semantic search [for the enterprise] is happening big-time, and I think Microsoft is one of the leaders in this,” Roth says, even if Microsoft isn’t advertising the semantics behind its system.18
- The Defendant hires a document management, such as Recognos, to construct an ontology in response to litigation. Of course, the amount of time and costs associated with the effort would be subject to judicial review and approval.
- Revert to Predictive Coding or manual methods.
What does all this techno-babble mean to litigators? Well, Semantic Web searches, as seen simply from surfing the Web, is capable of identifying potentially valid Electronically Sourced Information (ESI) very quickly. Given the increasing use of electronic storage, backup and the attendant ‘metadata’ (descriptor tags), associated with ESI, properly framed queries have a higher probability of identifying relevant material. Depending upon the detail of the vocabularies involved, it may also be possible to identify and select relevant material in a fraction of the time necessary to conduct Predictive Coding. Such ‘maybes’ are dependent upon the design of the vocabulary, the amount and content of descriptor tags and the overall implementation of a domain specific ontology.
- Massive up-front costs in creating unique domain level vocabularies – a brief review of the W3C Case Studies shows the level of effort needed to construct a corporate wide ontology. Corporate organizations from General Counsel, Finance, Security, Human Resources and, of course, Information Technology must provide input, guidance and monitor the evolving design.
- Massive deployment costs associated with the corpus of existent ESI – a corporation’s Intranet may possess some of the necessary connections, and the corporation’s CobiT effort may provide more. However, implementing a full scale vocabulary is labor intensive.
- Formal language and syntax – the English language is full of ambiguity which means humans need to learn the formal (structured) language and syntax of the W3C Web components.
Promising eDiscovery Technologies
Technologists know that, given time, technology follows Moore’s Law19. Newer, better more accurate tools and methods will be introduced and replace the current ‘Top Dogs’. One only needs to view the Cellular/Mobile Communications sector for proof. When considering that Electronic Discovery has only been around for 10 years, the introduction of Technology Assisted Review may be viewed as being younger than 10 years.
Already we see court cases in which one party or the other is complaining about eDiscovery costs. Indeed, while technology serves to reduce the cost (time and labor) of producing potentially valuable documents for review. Unfortunately, review costs account for upwards of 70% of the total of a given eDiscovery project.
New technology and methods are coming that should reduce costs due to human review.
Natural Language Processing
We’ve briefly examined Predictive Coding’s algorithmic/human review hybrid and wrapping document files in the formalized XML language vocabulary. Wyner20 differentiates the difference between the Predictive Coding and Semantic Web methods as follows:
Predictive Coding is ‘knowledge light’ in that “… the processing presumes very little knowledge of the system or analyst”. Thus, when the statistical models are applied to the (often very) large population of documents, the contents are evaluated as meeting or not meeting query specifications.
Semantic Web is also ‘knowledge light’ but not to the same extent as Predictive Coding. In this method, wrapping a file with informational tags (metadata) does add some level of knowledge to the search. However, such searches are dependent upon the content of the tags, which, in turn, are dependent upon the knowledge the expert contributors bring to the design of the vocabularies.
However, there is a third method
Natural Language Processing (NLP) is ‘knowledge heavy’ in that rather than search for similarities and/or differences or search amongst tags, we know what we are looking for and we examine the actual file content.
The sort of ‘Natural Language Processing’ we speak of here is not the sort of HAL 9000 computer interface where one speaks to a computer which responds in one’s own language. In this case, we are considering the myriad and literally uncountable words kept as Electronically Stored Information (ESI).
The written language of any culture is multi-dimensional in nature. Consider:
Approximate age of the writing can be established by the syntax and lexicon of the writer. For example, the epic poem of Beowolf is written in Old English which contains characters and pronunciation not found in Modern English. Thus, someone from this era reading this alliterative poem in its original form faces a challenge as great as Grendel and his Mother.
- The intellectual level of the writer(s) may be derived from the document’s lexical density21, The higher the density, the more information is being communicated by the writer. Consider reading John Locke’s First Principles in one sitting. Not only does one need to wade through archaic language structures, the philosophical concepts themselves are hard to grasp.
- The tone of the document is identified by the word choices and within the context of other, related documents. This is especially true with electronic mail or social networks. For example a serious peer reviewed document may contain examples where the author(s) hotly dispute another’s conclusions but never outright label those conclusions as “imbecilic”. In contrast, email or social web sites may contain language that would shame a sailor.
- Some say the writer’s gender may be discerned through the words, syntax and imagery contained in the document. For example, no one could confuse the author’s genders reading a Tom Clancy or Marion Zimmer Bradley novel; even if all information about the author was not revealed.
Each of these dimensions requires an innate ability humans possess and computers can be ‘taught’ but never acquire themselves. The ability to ‘Comprehend’ or ‘Understand’ concepts contained within a document, communication or a series of documents, is the hallmark of Human Reasoning. The RAND Corporation has an excellent summary of the effort a human reader encounters when dealing with reading and comprehension:
“Comprehension does not occur by simply extracting meaning from text. During reading, the reader constructs different representations of the text that are important for comprehension. These representations include, for example, the surface code (the exact wording of the text), the text base (idea units representing the meaning),and a representation of the mental models embedded in the text”22.
It is the uniquely human ability to create constantly changing, internal models (or representations) of the text in order to fully comprehend the entirety. No computer in existence possesses the innate and inherent capability required to perform the task of comprehension. That is, unless and until, an application program (or series of programs) is built telling the computer EXACTLY how to accomplish it.
Legal Sector Implications
Whereas Predictive Coding and Semantic Web methods simply identify potentially valuable information, Natural Language Processing can be used to actually read the contents of a document and, eventually, assess that information for relevancy. It is this singular ability that leads to true lower costs for all concerned litigants.
“Natural Language Processing isn’t perfect yet: computers cannot understand human language. However, legal text is quite structured, and offers a lot more handholds for automated translation than, say, a novel”25.
Wyner and Peters have postulated what can be called an “interim solution” using semantic annotation within a document26.
“To analyse a legal case, legal professionals annotate the case into its constituent parts. The analysis is summarised in a case brief. However, the current approach is very limited:
- Analysis is time-consuming and knowledge-intensive.
- Case briefs may miss relevant information.
- Case analyses and briefs are privately held.
- Case analyses are in paper form, so not searchable over the Internet.
- Current search tools are for text strings, not conceptual information. We want to search for concepts such as for the holdings by a particular judge and with respect to causes of action against a particular defendant. (emphasis added)
With annotated legal cases, we can enable conceptual search.”
Conceptual analysis27 is a key NLP component. One example of conceptual analysis has been applied to detect plagiarism
in student-submitted papers is described by Dreher28 Granted, the systems selected by Dreher approach ‘discovery’ using relatively common string-by-string comparison methods, such methods still require knowledge of language to identify relevant comparisons.
Dr. Kathleen Dahlgren and her team at Cognition Technologies have taken a different and highly interesting NLP approach.
2Definition of electronic discovery (e-discovery or ediscovery) downloaded from http://searchfinancialsecurity.techtarget.com/definition/electronic-discovery on 12Jun03
3Monique da Silva Moore, et. al. v. Publicis Group SA, et al. Case No. 11-CV-1279 U.S. District Court for the Southern District of New York
4Global Aerospace Inc. v. Landow Aviation, L.P., No. CL 61040 (Va. Cir. Ct. Apr. 23, 2012) Circuit Court for Loudon County
5Kleen Products, LLC, et. al. v. Packaging Corporation of America, et. al. Case No. 10-CV-05711
UNITED STATES DISTRICT COURTNORTHERN DISTRICT OF ILLINOIS
6M. Whittingham, E. H. Rippey and S. L. Perryman quoting Jason R. Baron, Law in the Age of Exabytes: Some Further Thoughts on ‘Information Inflation’ and Current Issues in E-Discovery Search, 17 RICH. J.L. & TECH. 9, 32 (Spring 2011) in Litigation Support Technology Review http://www.litigationsupporttechnologyandnews.com/2011/09/predictive-coding-e-discovery-game.html
7Bayesian statistics is an approach for learning from evidence as it accumulates. In clinical trials, traditional (frequentist) statistical methods may use information from previous studies only at the design stage. Then, at the data analysis stage, the information from these studies is considered as a complement to, but not part of, the formal analysis. In contrast, the Bayesian approach uses Bayes’ Theorem to formally combine prior information with current information on a quantity of interest. The Bayesian idea is to consider the prior information and the trial results as part of a continual data stream, in which inferences are being updated each time new data become available. Downloaded from Guidance for the Use of Bayesian Statistics in Medical Device Clinical Trialshttp://www.fda.gov/medicaldevices/deviceregulationandguidance/guidancedocuments/ucm071072.htm on 12Jun05
8“Predictive Coding: Dozens of Names, No Definition, Lots of Controversy”, Sharon D. Nelson, Esq downloaded from http://www.legalitprofessionals.com/Legal-Technology-Observer/predictive-coding-dozens-of-names-no-definition-lots-of-controversy.html on 12Jun13
9“Quality Control in the age of digital data’, Compiled Services, LLC White Paper, downloaded from http://www.compiledservices.com/download/eDiscovery-Quality-Control on 12Jun03
10Deming W.E., Out of the Crisis, Chapter 2, “Elaboration on the 14 Points”, Published by Massachusetts Institute of Technology, Center of Advanced Education Services, Cambridge, MA, 1986
11Casellas, N., “Semantic Enhancement of Legal Information… Are We Up for the Challenge?” downloaded from http://blog.law.cornell.edu/voxpop/2010/02/15/semantic-enhancement-of-legal-information%E2%80%A6-are-we-up-for-the-challenge/ on 12Jun13
13Wyner, A. Z. “Weaving the Legal Semantic Web with Natural Langugage Processing”, VoxPopuLII, 17May2010, retrieved from http://blog.law.cornell.edu/voxpop/2010/05/17/weaving-the-legal-semantic-web-with-natural-language-processing/ on 12Jun13
14Casellas, ibid – “ontology refers to a consensual and reusable vocabulary of identified concepts and their relationships regarding some phenomena of the world.”
- 16Garrie, D.B. & Armstrong , M.J. “Electronic Discovery and the Challenge Posed by the Sarbanes-Oxley Act ”, downloaded from https://papers.ssrn.com/sol3/papers.cfm?abstract_id=743204 on 12Jun15
17Zubulake v. UBS Warburg LLC, 217 F.R.D. 309, 322 (S.D.N.Y. 2003)
18Zaino, J. “Semantic Tech’s On The Way to Document Management Systems“, downloaded from http://semanticweb.com/semantic-techs-on-the-way-to-document-management-systems_b24689 on 12Jun15
19Moore’s Law is a computing term which originated around 1970; the simplified version of this law states that processor speeds, or overall processing power for computers will double every two years.
21Williamson, G, from Lexical Density (1) lexical words (the so-called content or information-carrying words) and, (2) function words (those words which bind together a text). http://www.speech-therapy-information-and-resources.com/lexical-density.html
22RAND Corporation document “Defining Comprehension” https://encrypted.google.com/search?q=definition+of+understanding+text&ie=utf-8&oe=utf-8&client=ubuntu&channel=fs#hl=en&client=ubuntu&hs=6qa&channel=fs&sclient=psy-ab&q=definition+of+comprehension+in+reading&oq=definition+of+comprehension&gs_l=serp.1.1.0l220.127.116.11.718.104.22.168.0.0.0.0.0..0.0…0.0.SWx60ao1iGg&pbx=1&bav=on.2,or.r_gc.r_pw.r_qf.,cf.osb&fp=a8e35bc8121bc212&biw=1375&bih=784, retrieved on 12Jul09
23See: Science Daily pages on Artificial Intelligence and Cognition – http://www.sciencedaily.com/articles/computers_math/artificial_intelligence/
http://blog.law.cornell.edu/voxpop/tag/legal-natural-language-processing/ retrieved on 12Jul12
26Wyner, A. and Peters, W. “Semantic Annotations for Legal Text Processing using GATE Teamware
” retrieved from http://wyner.info/LanguageLogicLawSoftware/index.php/2012/05/01/crowdsourced-legal-case-annotation/ on 12Jul12.
27The division of a physical or abstract whole into its constituent parts to examine or determine their relationship or value
28Dreher, H., “Automatic Conceptual Analysis for Plagiarism Detection””, Issues in Informing Science and Information Technology, Volume 4, 2007