Charting the
Un-Discovered Country: Discovery About Discovery
Technology
Assisted Review (“TAR”), also known as predictive coding, is increasingly
popular as a means of controlling discovery costs, especially with large
organizational defendants faced with the mounting costs of reliably searching
and producing hundreds of gigabytes – or even terabytes – of potentially
discoverable information. TAR proponents tout studies showing its ability
to simultaneously lower costs and increase search recall and precision (both
these terms of art are discussed below). See Maura R.
Grossman & Gordon V. Cormack, Technology-Assisted Review in
E-Discovery Can Be More Effective and More Efficient Than Exhaustive Manual
Review, Rich. J.L.& Tech., Spring 2011, at 8-9, (available at http://jolt.richmond.edu/v17i3/article11.pdf). Indeed, advancements in TAR
technology have been made which, when properly employed, can allow predictive
coding algorithms to deal effectively with situations which have complicated
the use of TAR in the past, such as low-richness data sets (data sets with a
low percentage of responsive documents), unrepresentative initial seed sets,
and other potential complications.
The
numbers, however, tell only a part of the story. TAR is not a
one-size-fits all solution, and each search process must be individually
tailored to the data set under review. This tailoring (or “training”) can
be conducted in a number of different ways, but generally involves creating one
or more “seed set(s)” of documents, having a human reviewer review and code the
“seed” or “training” documents, and then feeding those coding decisions back to
the algorithm so it can learn the distinction between relevant and non-relevant
documents. Most training processes are iterative and some TAR tools
involve the correction of errors made by the algorithm. The training
process is repeated until the algorithm’s recall (the percentage of responsive
documents retrieved) and precision (the percentage of retrieved documents which
are responsive) are within acceptable limits. The “seed sets” themselves
can be generated in several ways, including hand-picking documents using
keyword or otherwise, reviewing a random sample of documents, and/or through
the use of “intelligent learning” algorithms (“active learning”) to select
documents that would assist the algorithm in learning, independent of direct
human intervention. These techniques are not mutually exclusive and different
predictive coding algorithms may make use of them either singly or in
combination, depending on the needs of the production and the characteristics
of the data set. As noted above, not all TAR tools and protocols are
created equal so some diligence is needed in selecting a tool and implementing
a reasonable and defensible process.
With
TAR’s increasing popularity, the training process is being subjected to
increasing scrutiny, with a growing number of attorneys seeking “discovery
about discovery” to ascertain the methods used to train the TAR process – and
to identify any errors, omissions, or flaws in that process. In
particular, the creation and composition of the iterative “seed sets” used to
train the algorithm are often the subject of great interest to plaintiffs, who
may wish to review, and potentially provide input into, the generation of the
seed set and/or the conduct of the training process.
The
courts are still struggling to determine how to approach the thorny issues
present at the intersection of broad discovery, the work product doctrine, the
attorney-client privilege, and the cooperation protocols enunciated by the
Sedona Conference. Some have expressed hesitation regarding discovery
about discovery, while others argue in favor of broad discoverability.
Notably, Judge Paul Grimm, writing in a law review, stated simply that, at
least in the context of record preservation, “[i]t is axiomatic that an
opponent may routinely obtain discovery of a client’s actions taken to
implement the duty to preserve information[,]” explaining that “[i]t is of no
moment that the…search was conducted at the direction of counsel. Parties
are permitted to inquire into an opponent’s efforts to preserve relevant
information[.]” Hon. Paul W. Grimm, et al. “Discovery
About Discovery: Does the Attorney-Client Privilege Protect All Attorney-Client
Communications Relating to the Preservation of Potentially Relevant
Information?”, 37 Balt. L.R. 413 (2008) (available at:http://www.aporter.com/resources/documents/9_Grimm_et_al_Discovery_About_Discovery%5B1%5D%5B1%5D.pdf). While TAR is currently used
principally in support of document review and production efforts, it is hard to
see why the processes used by a party to identify responsive documents should
be provided greater protection than the processes used to identify the location
of potentially responsive documents.
A
number of courts have relied on Rule 26(f) and the Sedona Conference’s
Cooperation Protocol to permit discovery about discovery. For example,
courts have compelled disclosure of the data repositories (e.g. custodians and
sources) searched, as well as the search terms used to conduct that
search. See Am. Home Assurance Co. v. Greater Omaha Packing Co.,
Inc., No. 8:11-cv-270, 2013 U.S. Dist. LEXIS 129638, 2013 WL 4875997
(D. Neb. Sept. 11, 2013); Apple Inc. v. Samsung Electronics Co. Ltd.,
No. 12-cv-0630, 2013 U.S. Dist. LEXIS 67085 (N.D. Cal. May 9, 2013); Uelian
de Abadia-Peixoto v. U.S. Dept. of Homeland Sec., Civ. No. 11-04001 (N.D.
Cal. Aug. 23, 2013) (all compelling production of search terms); see
alsoRalph Losey, More Courts Are Requiring Disclosure of Keywords,
E-Discovery Law Today (May 28, 2013) (available at:http://www.ediscoverylawtoday.com/2013/05/more-courts-are-requiring-disclosure-of-keywords/). Other courts have permitted
wider-ranging discovery on discovery in appropriate circumstances. For
example, in Ruiz-Bueno, III v. Scott, the federal district court
for the Southern District of Ohio found that, by refusing to provide discovery
on discovery, defendants had “fail[ed] to acknowledge the nuanced nature of
discovery.” No. 2:12-cv-0809, 2013 U.S. Dist. LEXIS 162953, 2013 WL
6055402 (S.D. Ohio Nov. 15, 2013). While noting that, ideally, the need
for discovery on discovery should be obviated by the Rule 26(f) planning
process, the Court nevertheless held that “[s]imply put, when plaintiffs
expressed some skepticism about the sufficiency of defendants’ efforts to
produce…defendants’ counsel should have been forthcoming with information…[t]hat
did not happen. The Court has the power…to make that happen now.”
The debate about discovery of the discovery process, as it relates to TAR,
primarily revolves around the “seed set”. While this terminology implies
a single “set” of documents, TAR programs are trained in many different ways,
and often make use of an iterative process with multiple “sets” of documents
coupled with human review and correction. As such, discovery about the
seed set should be viewed as discovery of the documents and processes used to
“train” the algorithm to recognize responsive and non-responsive
documents.
However, rather than address a deficiency after the fact as in Ruiz-Bueno,
potentially after the expenditure of substantial time and expense by both
parties, plaintiffs may be better off trying obtain transparency and
cooperation in advance – either through agreement with defendants or by use of
a motion to compel cooperation. In Moore v. Publicis Groupe,
Magistrate Judge Andrew Peck avoided the need for “discovery about discovery”
by encouraging that the that the seed set be disclosed to plaintiff’s as part
of the discovery protocol. 287 F.R.D. 182 (S.D.N.Y. 2012); see
also William A. Gross Constr. Assocs., Inc. v. Am. Mfrs. Mut. Ins. Co., 256
F.R.D. 134 (S.D.N.Y. 2009) (Peck, M.J.) (holding that the parties must
cooperate in selecting appropriate key-words to facilitate computerized search
for relevant e-mails). Indeed, under the protocol outlined in Moore,
the parties agreed to participate cooperatively in an iterative process which
included conferring several times regarding the composition of the seed set, and on the training process in general. Id.
In that case defendant committed to provide to plaintiff all non-privileged
documents used as part of the seed set, regardless of final
relevancy. Id. at 185, 192 (while not entering a ruling
on the subject, the Court noted that “[i]f you do predictive coding, you are
going to have to give your seed set, including the seed documents marked as
nonresponsive to the plaintiffs counsel[.]”).
Other courts also appear to have contemplated the establishment
of the same sort of collaborative effort envisioned by the Sedona Conference
and established in Moore. See, e.g. Gordon v. Kaleida
Health, No. 08-cv-378S, 2013 U.S. Dist. LEXIS 73330 (W.D.N.Y. May 21, 2013)
(Foschio, M.J.); Hinterberger v. Catholic Health Sys., No.
08-cv-380S, 2013 U.S. Dist. LEXIS 73141 (W.D.N.Y. May 21, 2013) (Foschio,
M.J.); see alsoSedona Conference, The Sedona Conference
Cooperation Proclamation, 10 Sedona Conf. J. 331 (2009) (available
at: https://thesedonaconference.org/cooperation-proclamation). But
see H. Christopher Boehning & Daniel J. Toal, ‘Seed Set’
Documents Should Not Be Discoverable, New York Law Journal (Feb. 4, 2014)
(available at: http://www.newyorklawjournal.com/id=1202641220784); In Gordon and Hinterberger,
the plaintiff moved to compel Defendants to “engage in meaningful meet and
confer discussions regarding an ESI protocol” and, if an agreement could not be
reached, to compel the submission by each party of a proposed protocol for
adoption by the Court. Id. The Court in Gordon denied
plaintiff’s motion in each case without prejudice, explaining that “Defendants
state they are prepared to meet and confer with Plaintiffs… regarding
Defendants’ ESI production using predictive coding… [a]ccordingly, it is not
necessary for the court to further address the merits of Plaintiffs’ motion at
this time.” Gordon, 2013 U.S. Dist. LEXIS at *11; see
also Hinterberger, 2013 U.S. Dist. LEXIS at *10 (same). While
the Court in Gordon and Hinterberger did not
find any need to enter an order, given defendants’ expressed willingness to
cooperate, where defendants prove unwilling to cooperate – or where there is
reason to doubt the sufficiency of their production – a court may prove more
amenable to compelling either cooperation or permitting discovery on discovery,
as did the court in Ruiz-Bueno, above. 2013 U.S. Dist. LEXIS
162953, 2013 WL 6055402.
At
least one court, however, has taken a more restrictive view about the
discoverability of the seed set. This position is well summarized by the
federal district court for the Northern District of Indiana in In Re:
Biomet M2A Magnum Hip Implant Prods. Liability Litig.,
No. 3:12-MD-2391, 2013 U.S. Dist. LEXIS 172570 (N.D. Ind. Aug. 21,
2013). In that case, plaintiff requested that defendant produce “the
discoverable documents used in the training of the ‘predictive coding’
algorithm.” Defendants disclosed only that the discoverable documents
used in the training had already been provided, without specifically identifying
those documents. Id. at *2. After first noting
that it was “self evident” that plaintiff did not have a right to discover the
entirety of the “seed set” used to train the algorithm, the Court addressed
whether defendant was required to disclose which of the admittedly responsive
documents were used in training the algorithm. Id. at
*3. The Court held that Rule 26(b)(1) does not permit discovery into the
use to which defendant put discoverable documents prior to their
production. Id. at *4. Nevertheless, the Court
called defendant’s refusal to cooperate “troubling” and indicated that,
although it could not compel production of the seed set, that “[defendant]’s
cooperation falls below what the Sedona Conference endorses[,]”, going on to
state that “[a]n unexplained lack of cooperation in discovery can lead a court
to question why the uncooperative party is hiding something, and such questions
can affect the exercise of discretion.” Id. at *5.
Given the concerns identified by Biomet, plaintiff’s counsel should
work assiduously with defense counsel to arrive at an agreeable protocol
whereby, to the extent practicable, they are able to review and participate in
the creation of the “seed set” and the training of the TAR algorithm used.
This cooperation is desirable not only for its potential to resolve this issue
with a minimum of time and expense, but also to ensure that the real goal –
maximal production of responsive information in an efficient, and timely,
fashion – can be achieved with a minimum of collateral litigation. It is
worth noting that even the Court in Biomet found defendant’s
lack of transparency “troubling”. As such, it may be that courts would be
more open to mandating such cooperation than to mandating after-the-fact discovery on discovery. That said, if such a
compromise cannot be reached, plaintiff should consider moving to compel
cooperation and/or to compel entry of a cooperative discovery protocol.
Please be sure to visit our website at http://RobertBFitzpatrick.com