Ediscovery Glossary



Active DataData or files on a computer that can be accessed without a restoration process.
Active LearningAn algorithm that allows the machine to interactively query the user to obtain the desired clarification of input. This is used in Machine Learning and so in Predictive Coding.
AlgorithmA mathematical formula that is used in review platforms to support the Review Phase, specifically in Predictive Coding.
Archived DataInformation that is not directly accessible to the user of a computer system but that the organisation maintains for long-term storage and record keeping purposes.
AttachmentIn common use, this term refers to a file (or files) associated with an email for transfer and storage as a single message unit.
BackupThe procedure of making extra copies of data in case the original is lost or damaged.
Backup TapePortable media used to store data that is not presently in use by an organisation to free up space but still allow for disaster recovery.
BatchA set of documents within the database often organised for the purposes of the set to be reviewed by a particular person.
Bates NumberUnique number designated to each document that is produced from a database.
BitShort for binary digit, is the smallest unit of data in a computer. A bit has a single binary value, either 0 or 1.
Bit-by-bit CopySee Forensic Copy
ByteA sequence of adjacent bits operated as a unit by a computer. A byte usually consists of eight bits.
Boolean searchA type of search allowing users to combine keywords with operators such as AND, NOT and OR to create a more targeted search.
CategoriesSometimes also referred to as tags, each document review exercise will have a list of categories which relate to issues that are pertinent in that matter. A document is categorised when it has been deemed to belong to a particular category. Examples of categories may be "relevant" or "privileged".
Category TreeA list of the categories organised in a systematic way. Also sometimes referred to as a “coding pane”.
Chain of CustodyA document that tracks who had possession of a piece of media.
ClusteringA form of Topic Grouping.
CodingDocument coding is the process of capturing case-relevant information (e.g. author, date authored, date sent, recipient, date opened, etc.) from a paper document. Sometimes coding is also used to refer to assigning Categories to documents.
CompressionA technology that reduces the size of a file. Compression programs are valuable to network users because they help save both time and bandwidth.
Concept SearchingA type of search which yields a list of other words most commonly found to be in the same document in the data set as the word you searched for. For example, a search for the word "merger" may yield "acquisition", "shares" and "loan". This can provide greater intelligence around how to refine keyword searches, and whether or not code words are being used.
Confidence levelSee Sample Size: This is one of the figures that generate the Sample Size. It helps to establish how much trust we can have in the predictions produced by the machine in Predictive Coding. Normally set at 95%, it states: "If I were to take this test 100 times, 95 times out of 100 I would get the same result."
ConnectorsWords and symbols that are used to create a logical link between Search Terms, to include or exclude certain combinations of words. These are used as part of Boolean logic.
Control NumberA unique number designated to each document in the database, which makes it easy to identify and refer to.
Corrupt fileCorrupt files are those that are damaged prior to or during collection/processing. They cannot be processed.
CullingDecreasing the number of documents by filtering out unnecessary documents such as newsletters.
CustodianRefers to an individual who creates or stores Electronically Stored Information. When an ediscovery provider asks how many custodians there are, they are referring to the number of individuals within an organisation from whom electronically stored information is being collected.
Data CentreThe database will be hosted in a data centre.
Data MappingData mapping involves mapping out a company’s IT infrastructure (the operating systems in use, the hardware, software and data storage areas) and the flow of information into and out of the company.
Data SamplingThe process of checking data by identifying and checking representative individual documents.
Data SetThe total number of documents that are collected for a project.
DatabaseAll the documents that are uploaded into the review platform.
Date filterFiltering documents in or out of the data set by identifying documents within an applicable date range.
Date Range SearchSearch through the database within a specific date range.
Dawn RaidAn unannounced inspection of a company by a competition authority. The authority will typically enter the business premises and look for relevant data throughout the premises - including in lockers, desks, safes, computers and servers.
Decision TreeSee Category Tree
DecryptionUnlocking Encryption on a document or set of documents by using the provided password.
DeduplicationDuring the data filtering stage, duplicate documents can be filtered out, either across the entire data set or per custodian.
Deleted DataDeleted data is data that, in the past, existed on the computer as live data but has been deleted by the computer system or end-user activity.
DeNISTBy DeNISTing a dataset, junk data is removed from the set by applying a filter. This filter - originally designed by the FBI to identify files with no evidentiary value - removes the majority of junk data based on the hash value of each document, such as system files. NIST stands for National Institute of Standards and Technology.
DisclosureA process resulting in a list of documents and copies of such documents to be produced or delivered to other parties in a dispute, particularly in English litigation.
Disk MirroringA way to more secure store data. Multiple hard drives that contain the same information, there will be no data loss when one hard drive fails.
Document CollectionSee Forensic Collection
Document RetentionThe preservation of documents and data, including hard copy and electronic documents, databases and emails that are created, sent and received in an organization’s ordinary course of business.
Document TypeIs linked with the File Extension and identifies the type of document, such as Word document, Excel file or an image (picture).
DuplicateAn identical document in the same data set.
E01 FileA standard file format of a forensic copy taken by Encase, an industry standard forensic software.
EdisclosureThe application of ediscovery processes and technologies for the purpose of the disclosure phase of English litigation. Sometimes written as e-disclosure or eDisclosure.
EdiscoveryTechniques, processes and technologies which allow the analysis of electronically stored information (ESI) in order to respond to a legal request. Often used in dispute resolution matters (eg. litigation, arbitration), regulatory investigations and internal corporate investigations. Sometimes written as e-discovery or eDiscovery. See also EDRM and Edisclosure.
EDRMElectronic Discovery Reference Model. An industry standard and diagram which illustrates the spectrum of stages in an ediscovery exercise.
Electronic DocumentA file stored on digital media. See also ESI. Commonly regarded as the opposite of Hard copy.
Electronic Documents QuestionnaireIn English Civil Procedure, a document which directs parties in litigation to consider the Electronic Documents which may be relevant in their case. It is recommended by Practice Direction 31B in the Civil Procedure Rules. Often abbreviated as EDQ or N264.
Email ThreadingVisual graphic of email chains with their corresponding attachments. This enables reviewers to easily identify how email chains evolved and which parties were copied into which emails.
Embedded filesEmbedded files are files that live within a different file. For example, a company logo in an email signature or an Excel spreadsheet in a PowerPoint presentation. Should not be confused with an email attachment.
EncryptionA procedure/technology that renders the contents of a message or file unintelligible to anyone not authorized to read it.
ESIElectronically Stored Information. Documents or other data stored digitally. In ediscovery, this could include emails, Word documents, Excel spreadsheets, chat logs, audio recordings and other information.
Exception ReportA list of files that could not be processed onto the system. An example is files that are corrupted beyond repair
FamilyA family is group of documents that are related to each other. The most common example is an email and its attachment. Breaking up families during collection or review can have consequences
File ExtensionThree or four letters after the full stop in a file name, which indicates the data file's format or the application used to create the file. E.g. ".doc" to indicate a Word document
File ServerA server within an organisation that contains company-related data.
FilteringThe process during which data is filtered out of the system so that irrelevant documents are not uploaded to the review platform. Examples of filters that can be applied in this context are date ranges, keywords and duplicate documents.
Forensic AnalysisTechnical investigation into a piece of media. An example would be looking at a hard drive to analyse what data had been deleted from it
Forensic CopyDuring a forensic copy everything from a media source will be copied, this includes all unused and partially overwritten spaces.
FormatThe internal structure of a file, which defines the way it is stored and used. Specific applications may define unique formats for their data (e.g. “MS Word document file format”).
FTPFile Transfer Protocol is used to transfer documents from one computer to another over the Internet.
Gigabyte1024 Megabytes. Commonly abbreviated as GB.
Hash ValueA unique reference code for each document derived from a mathematical calculation based on the content and/or properties of a document. It is used to identify duplicate files and also to verify copies of files, because any difference to a file will result in a completely different hash value. Some common hash algorithms include MD5 and SHA (Secure Hash Algorithm).
HashingDetermining the hash value of a document.
HostingThe database is hosted on a server in order to keep the review platform populated with documents.
Image (copy)Also known as a forensic copy.
Image (file type)A flat 'copy' of the native file. Comparable to a scanned paper document saved as a picture. Some common types if images include JPEG (.jpg) and TIFF (.tif).
Intelligent PrioritisationA technology which relies on the decisions applied by human reviewers to a percentage of the document population, to suggest other documents for review. It is the most basic form of predictive coding.
Keyword SearchA search through the database by using words that are determined to be "key" to finding relevant documents within it.
KilobyteA kilobyte is a 1,024 bytes, but is often used loosely as a synonym for 1,000 bytes. Commonly abbreviated as kB.
Linear ReviewA document review where a team reviews an entire data set without the use of predictive coding
Load FileA file created from a database containing the Work Product, so it can be uploaded into a different system.
Machine LearningA type of artificial intelligence that allows a machine to learn and understand based upon human input. This is the underlying technology for Predictive Coding.
Mail containerThis file type which holds many email message files. Some common mail containers include PST and OST (from Microsoft Exchange, Outlook) or NSF (Lotus Notes).
Megabyte1024 Kilobytes. Commonly abbreviated as MB.
MetadataInformation held within a document other than the actual content. This includes: create date, last modified, language, when an email was sent, and who was copied into the email.
Native ApplicationThe application needed to open the file in its native format. For example, the native application for a spreadsheet is Microsoft Excel
Native fileThe original format of a document, as created by its native application, for example Word, PDF, Excel, or PowerPoint.
NearlineA way of hosting data between warm and cold storage, which enables the user to reduce hosting fees while maintaining full flexibility over their dataset.
Near-duplicateTwo or more documents that are almost identical based on the text content. They may be in different file formats.
OCROptical Character Recognition. Hard copy documents are scanned and the text extracted to make it searchable in the review platform.
Onsite collectionA third independent party enters the premises of a client to collect data on their behalf.
Parent documentIn a family of documents, the email that holds an attachment is called the parent document.
Petabyte1024 Terabytes. Commonly abbreviated as PB.
Predictive CodingA type of machine-learning technology which enables a computer to predict how documents should be categorised based on how an expert lawyer in a particular case has trained it to do so. This technology can help clients find relevant data more consistently and cost-effectively, and potentially eliminate irrelevant data so they don’t have to spend time and money reviewing it.
Privileged DocumentsDocuments that are privileged are those that are protected from the need to disclose relevant documents to third parties because they either involve legal advice from a lawyer, or because the document was created for the dominant purpose of a litigation
Processing PhaseThe phase in the EDRM model where the data is being processed and uploaded to the review platform.
ProductionThe process by which certain documents that have been uploaded onto the database are produced. This may be in various formats, including USB stick or paper.
Project ManagerA person that oversees an ediscovery project and manages the database.
Quality ControlThe process by which documents which have been categorised in the database are checked for accuracy.
RAIDRedundant Array of Independent Disks. A backup method that combines multiple hard drives to reduce the risk of data loss when a hard drive fails.
RAMAn acronym for Random Access Memory, the short-term memory of a device. Valuable data can be recovered from this with the forensic tools.
RedactionsSensitive information can be redacted in documents when they are being reviewed. Redactions place a rectangular mark over the sensitive information so that it cannot be viewed by other parties.
ReportsReports can be run in review tools to provide an array of information about the progress of the document review; including how many documents are remaining, and how many documents are being reviewed per hour.
Review PhaseOnce the data is uploaded to the review platform, the reviewers can start reviewing the documents.
ReviewedA document which has been viewed and categorised.
Sample SetA statistically valid random selection of documents that is used for two purposes in Predictive Coding. The first is to generate prevalence estimates for the composition of the data (how many relevant documents to expect) and the second is to test to see how well the system has learned for the purposes of Predictive Coding (how many of those relevant documents are we likely to find).
Sample SizeSee Sample Set: In order for the Sample Set to be statistically sound, the sample size needs to be correctly calculated. This involves three numbers - the data corpus, the confidence interval and the margin of error. These three figures will impact upon the amount of human input that is required for Predictive Coding, and the amount of confidence we have in the results of predictive coding.
Search HitA term that has been entered in a Keyword Search that matches with a word in the database.
Search TermSee Keyword Search.
Search Term ListA list of search terms in combination with connectors.
Second RequestA discovery procedure where the authorities investigate mergers and acquisitions which may have anticompetitive consequences on the market. These are often responded to utilising ediscovery technology.
Seed SetThe first step of Predictive Coding. It is often called "seeding" and involves selecting documents judgmentally via search, concept query, analytics, etc. to input good information so that the system can learn for the purposes of Predictive Coding.
Self-collectionWhen an organisation collects the needed data itself, instead of using an independent third party.
Slack SpaceThe part of a hard drive where parts of deleted documents reside.
Stand-alone documentA document without any family members.
Structured DataData that is extracted from structured databases such as CRM systems. Data can be a mix of structured and unstructured data.
SyncingOnce a document is categorised all identical documents will automatically be categorised in the same way.
System AdministratorThe person within an organisation who holds all the IT passwords and has access to the data.
TagThe category applied to a document.
Terabyte1024 Gigabytes. Commonly abbreviated as TB.
Text-onlyA database where all the documents are uploaded with just their text and metadata - excluding the native and TIFF images. This reduces the data volume and therefore used to save on hosting costs.
TIFFTagged Image File Format is a file format on which documents can be viewed in a review platform (as well as native and text-only). Redaction, printing and highlighting is done in TIFF mode.
Topic GroupingDocuments grouped by topic
Topical SearchA search based on topic grouping
Training SetWhen using predictive coding, the training set is the minimum set of documents that has to be reviewed before the algorithm will be effective.
UnitisationThe process of splitting up one large document into separate logical divided documents.
User GroupUsers who need to access documents in the database can be split into user groups, which will have different levels of access and permissions.
WorkflowThe design of the flow of the documents within the review tool. Documents can be allocated on an automated basis based on which user groups should be receiving which documents.
Work ProductInformation added to the documents in a Database during the Review Phase. This includes: categorisations, redactions, highlights, and comments.