The volume of duplicate documents which remain following standard deduplication is a common cause of frustration for legal teams, particularly when faced with vast volumes of documents and tight review deadlines.
The level of duplication can appear exaggerated when the family relationship of documents is overlooked, causing further frustration and confusion as to why seemingly duplicate documents are presented for review more than once.
This article considers approaches and solutions available to help manage and eliminate duplicate documents. Some are ‘off-the-shelf’ solutions built into popular document review platforms, others are customised solutions which require technical input from skilled technicians.
The standard approach to de-duplicating data in eDisclosure is to compare MD5 hashes (the equivalent of a digital fingerprint) at a family level, either on a custodian or global basis. By way of an example, a parent document (eg. an email) and its attachments are considered collectively as part of the deduplication process. Identical documents are attributed with identical MD5 hashes.
As the name suggests, family level deduplication analyses duplication as between complete families of documents, excluding duplicate families and ensuring only one copy of a family survives processing and is published for review. The process is largely automatic, requires very little human input and allows for large volumes of duplicate documents to be automatically culled.
A degree of duplication will almost always remain following standard deduplication, particularly where identical email attachments have been received, forwarded and otherwise circulated as part of different email chains. Such circulation and receipt of documents may be highly relevant (for context reasons) to a review. The more aggressive item-level deduplication will remove more documents as part of the deduplication process at the expense of context, for example only presenting one copy of an attachment to different emails.
A range of factors also exist which can undermine standard deduplication; many associated with the ways in which organisations manage information, including the use of email archiving systems (eg. Mimecast) and document management repositories (eg. Filesite). These can impact upon the content of documents and cause different MD5 hashes to be generated for seemingly identical documents.
Where seemingly duplicate information is escaping the deduplication process, a customised approach can be applied to defensibly eliminate this information, reduce reviewable document counts and help manage the costs of the disclosure process.
The first stage in this process involves a technician interrogating data to identify the reasons why documents are escaping deduplication.
The most common examples we experience are caused by email archiving or storage systems which strip out attachments, insert references or otherwise adjust the text of a document. Duplication then fails, for example, as between an email recovered from an archived source and one recovered from a live mailbox, despite the obvious similarities between the documents.
It is usually straightforward to identify sample documents for comparison and a simple review of metadata can identify obvious characteristics which have undermined deduplication. As innocuous as it might sound, the simple act of a storage system adding unique identifier strings to an email subject, for example as filesite does, can mean two seemingly identical emails not being considered duplicates.
Obvious characteristics aside, archiving software can also strip out attachments and add text to the body of documents. Missing attachments are relatively easy to spot, however document text should be analysed with greater care and subjected to more rigorous testing.
Document text can be compared by running it through a ‘diffing tool’ (this article assumes everyone has a favourite) to identify differences. If the tool identifies valid differences between the documents, then you are out of luck and the documents are not duplicates. However, you may succeed in identifying text laid down by software which can validly be ignored for the purposes of deduplication.
With causes identified, a customised process can be applied to the documents which disregards the identified differences between the documents which previously undermined deduplication. New data is generated exclusively for this purpose, with no original metadata being compromised and native documents remaining untouched.
The original deduplication process will have been applied during data processing and it will be typical to refine this process using another platform, for example SQL or a custom built software solution.
SQL has a useful function which allows hashes to be generated according to selected characteristics of a document. By the time duplication issues are encountered, data has typically been loaded into a database for review, therefore using SQL to address duplication has the obvious advantage that the data is already in a convenient format and requires little organisation before the process can be applied.
There is a limit to the amount of information can be hashed in SQL though, so you may have to be selective over the information used. The more information you look at, the better the accuracy of the custom process, therefore SQL may become too “loose” an approach, in which case custom built software will be preferable.
Generic pieces of software exist which will generate document hashes, however at Anexsys we have the expertise to create customised software to do this, with the following advantages:
- complete control over the process
- transparent and defensible
- capable of independent validation
- no size limitations, the process can consider as much data as necessary to validly complete the process, allowing you to set as high a duplicative standard as required
A range of ‘off-the-shelf’ solutions also exist to help manage or eliminate duplicate and near-duplicate information. Anexsys are a Relativity ‘Best in Service’ provider, therefore this article naturally focusses on the solutions offered by Relativity, which include the following:
- Email Threading
Email threading is a process which identifies the longest or most complete email in a chain (the ‘Inclusive Email’) and allows the shorter components to be excluded from review, thereby eliminating any unnecessary review of duplicate information. It is a defensible approach to eliminating duplicate information because information is only excluded when it is otherwise known to be included.
In practice we have seen email threading reduce reviewable volumes by between 10% to 50%.
Although there is an emerging trend of parties adopting this methodology, where parties are unwilling to exclude documents on this basis, email threading can still be applied to group similar information together, allowing for a more contextual review and assisting with consistency in decision making.
- Near Duplicate Analysis
Relativity’s analytics function can detect similar documents on a contextual and textual basis. Similar documents are grouped together and ranked by similarity. As with email threading, such a process can assist to group documents together and assist with a contextual review, however where a high degree of similarity is identified between documents, exclusions may also be possible.