Upon collecting all necessary data, proper deduplication and filtering technologies can dramatically save time and costs associated with Electronic Discovery. By deleting all duplicate documents and e-mails, you are reducing your data set for a more manageable review process. After deduplication, further culling techniques can be applied towards metadata properties such as document’s date or particular file-type restrictions. Finally, and in some cases, most importantly, complex search queries can be derived to group and identify targeted information.
Objectively eliminating these documents not only reduces costly review time, but also lessens the amount of data for conversion or hosting services, providing added savings. In fact, simply deduplicating and culling a data set reduces it by an average of thirty percent (30%).
- Mark duplicates and include full record
- Mark duplicates but only include placeholder
- Detect and remove duplicates
- Do not detect or remove duplicates
|
Duplicate Detection Criteria
|
- Across a single custodian
- Globally across all documents within a case
- Within a single or multiple sources, such as hard drives or tapes
|
|
|
- Custodian and file location
- Email properties such as sent and received date
- File properties such as created and modified date, and type
- Removal of system files (if necessary)
- Filtering can be performed on extended metadata fields if they are already extracted (requires native processing)
- Keyword
Identifying documents containing verbatim text
- Stemming and Fuzzy
Including conjugations of keywords, and hits based on a threshold of character differences, compensating for misspellings or OCR inaccuracy
- Proximity
Returns results if keywords occur within a specified number of words from each other
- Boolean
Extending keyword searches by logically connecting terms
- Near Duplicate Processing
Identifying and grouping similar documents based on like content
- E-mail Thread Identification
Identifying and splitting e-mail threads to eliminate redundant content
- Conceptual and Clustering
Grouping and connecting documents with related topics