Ediscovery Glossar



Active Data Data or files on a computer that can be accessed without a restoration process.
Active Learning An algorithm that allows the machine to interactively query the user to obtain the desired clarification of input. This is used in Machine Learning and so in Predictive Coding.
Algorithm A mathematical formula that is used in review platforms to support the Review Phase, specifically in Predictive Coding.
Archived Data Information that is not directly accessible to the user of a computer system but that the organisation maintains for long-term storage and record keeping purposes.
Attachment In common use, this term refers to a file (or files) associated with an email for transfer and storage as a single message unit.
Backup The procedure of making extra copies of data in case the original is lost or damaged.
Backup Tape Portable media used to store data that is not presently in use by an organisation to free up space but still allow for disaster recovery.
Batch A set of documents within the database often organised for the purposes of the set to be reviewed by a particular person.
Bates Number Unique number designated to each document that is produced from a database.
Bit Short for binary digit, is the smallest unit of data in a computer. A bit has a single binary value, either 0 or 1.
Bit-by-bit Copy See Forensic Copy
Byte A sequence of adjacent bits operated as a unit by a computer. A byte usually consists of eight bits.
Boolean search A type of search allowing users to combine keywords with operators such as AND, NOT and OR to create a more targeted search.
Categories Sometimes also referred to as tags, each document review exercise will have a list of categories which relate to issues that are pertinent in that matter. A document is categorised when it has been deemed to belong to a particular category. Examples of categories may be "relevant" or "privileged".
Category Tree A list of the categories organised in a systematic way. Also sometimes referred to as a “coding pane”.
Chain of Custody A document that tracks who had possession of a piece of media.
Clustering A form of Topic Grouping.
Coding Document coding is the process of capturing case-relevant information (e.g. author, date authored, date sent, recipient, date opened, etc.) from a paper document. Sometimes coding is also used to refer to assigning Categories to documents.
Compression A technology that reduces the size of a file. Compression programs are valuable to network users because they help save both time and bandwidth.
Concept Searching A type of search which yields a list of other words most commonly found to be in the same document in the data set as the word you searched for. For example, a search for the word "merger" may yield "acquisition", "shares" and "loan". This can provide greater intelligence around how to refine keyword searches, and whether or not code words are being used.
Confidence level See Sample Size: This is one of the figures that generate the Sample Size. It helps to establish how much trust we can have in the predictions produced by the machine in Predictive Coding. Normally set at 95%, it states: "If I were to take this test 100 times, 95 times out of 100 I would get the same result."
Connectors Words and symbols that are used to create a logical link between Search Terms, to include or exclude certain combinations of words. These are used as part of Boolean logic.
Control Number A unique number designated to each document in the database, which makes it easy to identify and refer to.
Corrupt file Corrupt files are those that are damaged prior to or during collection/processing. They cannot be processed.
Culling Decreasing the number of documents by filtering out unnecessary documents such as newsletters.
Custodian Refers to an individual who creates or stores Electronically Stored Information. When an ediscovery provider asks how many custodians there are, they are referring to the number of individuals within an organisation from whom electronically stored information is being collected.
Data Centre The database will be hosted in a data centre.
Data Mapping Data mapping involves mapping out a company’s IT infrastructure (the operating systems in use, the hardware, software and data storage areas) and the flow of information into and out of the company.
Data Sampling The process of checking data by identifying and checking representative individual documents.
Data Set The total number of documents that are collected for a project.
Database All the documents that are uploaded into the review platform.
Date filter Filtering documents in or out of the data set by identifying documents within an applicable date range.
Date Range Search Search through the database within a specific date range.
Dawn Raid An unannounced inspection of a company by a competition authority. The authority will typically enter the business premises and look for relevant data throughout the premises - including in lockers, desks, safes, computers and servers.
Decision Tree See Category Tree
Decryption Unlocking Encryption on a document or set of documents by using the provided password.
Deduplication During the data filtering stage, duplicate documents can be filtered out, either across the entire data set or per custodian.
Deleted Data Deleted data is data that, in the past, existed on the computer as live data but has been deleted by the computer system or end-user activity.
DeNIST By DeNISTing a dataset, junk data is removed from the set by applying a filter. This filter - originally designed by the FBI to identify files with no evidentiary value - removes the majority of junk data based on the hash value of each document, such as system files. NIST stands for National Institute of Standards and Technology.
Disclosure A process resulting in a list of documents and copies of such documents to be produced or delivered to other parties in a dispute, particularly in English litigation.
Disk Mirroring A way to more secure store data. Multiple hard drives that contain the same information, there will be no data loss when one hard drive fails.
Document Collection See Forensic Collection
Document Retention The preservation of documents and data, including hard copy and electronic documents, databases and emails that are created, sent and received in an organization’s ordinary course of business.
Document Type Is linked with the File Extension and identifies the type of document, such as Word document, Excel file or an image (picture).
Duplicate An identical document in the same data set.
E01 File A standard file format of a forensic copy taken by Encase, an industry standard forensic software.
Edisclosure The application of ediscovery processes and technologies for the purpose of the disclosure phase of English litigation. Sometimes written as e-disclosure or eDisclosure.
Ediscovery Techniques, processes and technologies which allow the analysis of electronically stored information (ESI) in order to respond to a legal request. Often used in dispute resolution matters (eg. litigation, arbitration), regulatory investigations and internal corporate investigations. Sometimes written as e-discovery or eDiscovery. See also EDRM and Edisclosure.
EDRM Electronic Discovery Reference Model. An industry standard and diagram which illustrates the spectrum of stages in an ediscovery exercise.
Electronic Document A file stored on digital media. See also ESI. Commonly regarded as the opposite of Hard copy.
Electronic Documents Questionnaire In English Civil Procedure, a document which directs parties in litigation to consider the Electronic Documents which may be relevant in their case. It is recommended by Practice Direction 31B in the Civil Procedure Rules. Often abbreviated as EDQ or N264.
Email Threading Visual graphic of email chains with their corresponding attachments. This enables reviewers to easily identify how email chains evolved and which parties were copied into which emails.
Embedded files Embedded files are files that live within a different file. For example, a company logo in an email signature or an Excel spreadsheet in a PowerPoint presentation. Should not be confused with an email attachment.
Encryption A procedure/technology that renders the contents of a message or file unintelligible to anyone not authorized to read it.
ESI Electronically Stored Information. Documents or other data stored digitally. In ediscovery, this could include emails, Word documents, Excel spreadsheets, chat logs, audio recordings and other information.
Exception Report A list of files that could not be processed onto the system. An example is files that are corrupted beyond repair
Family A family is group of documents that are related to each other. The most common example is an email and its attachment. Breaking up families during collection or review can have consequences
File Extension Three or four letters after the full stop in a file name, which indicates the data file's format or the application used to create the file. E.g. ".doc" to indicate a Word document
File Server A server within an organisation that contains company-related data.
Filtering The process during which data is filtered out of the system so that irrelevant documents are not uploaded to the review platform. Examples of filters that can be applied in this context are date ranges, keywords and duplicate documents.
Forensic Analysis Technical investigation into a piece of media. An example would be looking at a hard drive to analyse what data had been deleted from it
Forensic Copy During a forensic copy everything from a media source will be copied, this includes all unused and partially overwritten spaces.
Format The internal structure of a file, which defines the way it is stored and used. Specific applications may define unique formats for their data (e.g. “MS Word document file format”).
FTP File Transfer Protocol is used to transfer documents from one computer to another over the Internet.
Gigabyte 1024 Megabytes. Commonly abbreviated as GB.
Hash Value A unique reference code for each document derived from a mathematical calculation based on the content and/or properties of a document. It is used to identify duplicate files and also to verify copies of files, because any difference to a file will result in a completely different hash value. Some common hash algorithms include MD5 and SHA (Secure Hash Algorithm).
Hashing Determining the hash value of a document.
Hosting The database is hosted on a server in order to keep the review platform populated with documents.
Image (copy) Also known as a forensic copy.
Image (file type) A flat 'copy' of the native file. Comparable to a scanned paper document saved as a picture. Some common types if images include JPEG (.jpg) and TIFF (.tif).
Intelligent Prioritisation A technology which relies on the decisions applied by human reviewers to a percentage of the document population, to suggest other documents for review. It is the most basic form of predictive coding.
Keyword Search A search through the database by using words that are determined to be "key" to finding relevant documents within it.
Kilobyte A kilobyte is a 1,024 bytes, but is often used loosely as a synonym for 1,000 bytes. Commonly abbreviated as kB.
Linear Review A document review where a team reviews an entire data set without the use of predictive coding
Load File A file created from a database containing the Work Product, so it can be uploaded into a different system.
Machine Learning A type of artificial intelligence that allows a machine to learn and understand based upon human input. This is the underlying technology for Predictive Coding.
Mail container This file type which holds many email message files. Some common mail containers include PST and OST (from Microsoft Exchange, Outlook) or NSF (Lotus Notes).
Megabyte 1024 Kilobytes. Commonly abbreviated as MB.
Metadata Information held within a document other than the actual content. This includes: create date, last modified, language, when an email was sent, and who was copied into the email.
Native Application The application needed to open the file in its native format. For example, the native application for a spreadsheet is Microsoft Excel
Native file The original format of a document, as created by its native application, for example Word, PDF, Excel, or PowerPoint.
Nearline A way of hosting data between warm and cold storage, which enables the user to reduce hosting fees while maintaining full flexibility over their dataset.
Near-duplicate Two or more documents that are almost identical based on the text content. They may be in different file formats.
OCR Optical Character Recognition. Hard copy documents are scanned and the text extracted to make it searchable in the review platform.
Onsite collection A third independent party enters the premises of a client to collect data on their behalf.
Parent document In a family of documents, the email that holds an attachment is called the parent document.
Petabyte 1024 Terabytes. Commonly abbreviated as PB.
Predictive Coding A type of machine-learning technology which enables a computer to predict how documents should be categorised based on how an expert lawyer in a particular case has trained it to do so. This technology can help clients find relevant data more consistently and cost-effectively, and potentially eliminate irrelevant data so they don’t have to spend time and money reviewing it.
Privileged Documents Documents that are privileged are those that are protected from the need to disclose relevant documents to third parties because they either involve legal advice from a lawyer, or because the document was created for the dominant purpose of a litigation
Processing Phase The phase in the EDRM model where the data is being processed and uploaded to the review platform.
Production The process by which certain documents that have been uploaded onto the database are produced. This may be in various formats, including USB stick or paper.
Project Manager A person that oversees an ediscovery project and manages the database.
Quality Control The process by which documents which have been categorised in the database are checked for accuracy.
RAID Redundant Array of Independent Disks. A backup method that combines multiple hard drives to reduce the risk of data loss when a hard drive fails.
RAM An acronym for Random Access Memory, the short-term memory of a device. Valuable data can be recovered from this with the forensic tools.
Redactions Sensitive information can be redacted in documents when they are being reviewed. Redactions place a rectangular mark over the sensitive information so that it cannot be viewed by other parties.
Reports Reports can be run in review tools to provide an array of information about the progress of the document review; including how many documents are remaining, and how many documents are being reviewed per hour.
Review Phase Once the data is uploaded to the review platform, the reviewers can start reviewing the documents.
Reviewed A document which has been viewed and categorised.
Sample Set A statistically valid random selection of documents that is used for two purposes in Predictive Coding. The first is to generate prevalence estimates for the composition of the data (how many relevant documents to expect) and the second is to test to see how well the system has learned for the purposes of Predictive Coding (how many of those relevant documents are we likely to find).
Sample Size See Sample Set: In order for the Sample Set to be statistically sound, the sample size needs to be correctly calculated. This involves three numbers - the data corpus, the confidence interval and the margin of error. These three figures will impact upon the amount of human input that is required for Predictive Coding, and the amount of confidence we have in the results of predictive coding.
Search Hit A term that has been entered in a Keyword Search that matches with a word in the database.
Search Term See Keyword Search.
Search Term List A list of search terms in combination with connectors.
Second Request A discovery procedure where the authorities investigate mergers and acquisitions which may have anticompetitive consequences on the market. These are often responded to utilising ediscovery technology.
Seed Set The first step of Predictive Coding. It is often called "seeding" and involves selecting documents judgmentally via search, concept query, analytics, etc. to input good information so that the system can learn for the purposes of Predictive Coding.
Self-collection When an organisation collects the needed data itself, instead of using an independent third party.
Slack Space The part of a hard drive where parts of deleted documents reside.
Stand-alone document A document without any family members.
Structured Data Data that is extracted from structured databases such as CRM systems. Data can be a mix of structured and unstructured data.
Syncing Once a document is categorised all identical documents will automatically be categorised in the same way.
System Administrator The person within an organisation who holds all the IT passwords and has access to the data.
Tag The category applied to a document.
Terabyte 1024 Gigabytes. Commonly abbreviated as TB.
Text-only A database where all the documents are uploaded with just their text and metadata - excluding the native and TIFF images. This reduces the data volume and therefore used to save on hosting costs.
TIFF Tagged Image File Format is a file format on which documents can be viewed in a review platform (as well as native and text-only). Redaction, printing and highlighting is done in TIFF mode.
Topic Grouping Documents grouped by topic
Topical Search A search based on topic grouping
Training Set When using predictive coding, the training set is the minimum set of documents that has to be reviewed before the algorithm will be effective.
Unitisation The process of splitting up one large document into separate logical divided documents.
User Group Users who need to access documents in the database can be split into user groups, which will have different levels of access and permissions.
Workflow The design of the flow of the documents within the review tool. Documents can be allocated on an automated basis based on which user groups should be receiving which documents.
Work Product Information added to the documents in a Database during the Review Phase. This includes: categorisations, redactions, highlights, and comments.