Glossary/Detection Engineering/Cloud Data Loss Prevention (DLP)

What Is Cloud DLP? Discover, Classify, Enforce

Cloud DLP is the set of capabilities that discovers, classifies, monitors, and protects sensitive data inside cloud apps, storage, and email so it does not leak or get exfiltrated.

The data that leaves an organization rarely leaves through the front door. It leaves in a Google Drive link set to "anyone with the link." It leaves in a customer export attached to a personal Gmail. It leaves in an S3 bucket that someone made public to debug a deploy and forgot to lock back down. It leaves in a Slack message with an API key pasted into it. None of these trigger malware alerts. None of them look like an attack. They look like work, and that is exactly why the sensitive data is gone before anyone notices.

Cloud data loss prevention is the discipline of finding where sensitive data lives across cloud apps, storage, and email, deciding what is allowed to happen to it, and enforcing that decision before the data leaks or is exfiltrated. This guide covers what cloud DLP actually does: the three states of data it has to cover, how discovery and classification feed policy and enforcement, the cloud-specific problems that break traditional endpoint DLP, how it relates to CASB and DSPM, and what a defender does with the alerts it generates. It is written for the people who answer for the data after it moves: SOC analysts, threat hunters, and DFIR responders who have to explain what left and whether it mattered.

What is cloud DLP?

Cloud data loss prevention is the set of capabilities that discovers, classifies, monitors, and protects sensitive data inside cloud services, so that data does not leak or get exfiltrated through them. NIST defines data loss prevention as "a system's ability to identify, monitor, and protect data in use, data in motion, and data at rest." Cloud DLP is that capability applied where the data now lives: SaaS applications, object storage, collaboration suites, and email, rather than on a laptop or a file server inside a perimeter.

The distinction matters because the perimeter assumption is gone. Classic DLP watched the edges of a corporate network: an agent on the endpoint, a gateway on the egress link, a scanner on the mail relay. That model assumed data sat inside a boundary and you controlled every path out. In a cloud-first organization, the data sits in Microsoft 365, Salesforce, Workday, and a dozen object stores you provision with an API call. The paths out are sharing links, OAuth-connected third-party apps, and API access keys. Cloud DLP exists because the thing you are protecting and the thing you used to inspect at the boundary are now the same cloud service.

Cloud DLP is not one product either. It shows up as native controls inside a SaaS suite (Microsoft Purview, Google Workspace DLP), as a function of a cloud access security broker (CASB) sitting between users and cloud apps, and increasingly as a feature of broader data security platforms. What unites them is the job: know what sensitive data you hold, where it sits, and stop it from going where it should not.

The three states of data DLP covers

DLP protects data in three states, and the controls differ for each. The states come straight from the definition: data at rest, data in motion, and data in use. A cloud DLP program that only covers one of them leaves the other two open.

Data at rest is data sitting in storage: files in an S3 bucket, records in a database, documents in SharePoint or Google Drive, messages archived in a mailbox. Cloud DLP protects it by scanning the store to find sensitive content, checking how it is shared and who can reach it, and flagging or remediating exposure. A public bucket holding customer PII is a data-at-rest finding. So is a spreadsheet of credit card numbers shared with an external domain.

Data in motion is data moving across a network: an upload to a SaaS app, an email leaving the organization, a file synced to a personal cloud account, an API call carrying records out. Cloud DLP inspects these transfers and applies policy in line: block the upload, quarantine the email, strip the attachment, or allow it and log it. This is the state classic network DLP focused on, now applied to cloud egress paths rather than a single network gateway.

Data in use is data being actively worked on at an endpoint or in an application: copied to a clipboard, printed, screen-captured, or saved to removable media. This state is the hardest to cover in a pure-cloud model because it happens on the device, and it is where endpoint DLP agents still do most of the work. Cloud DLP overlaps here when the "use" happens inside a browser-based SaaS app, where a CASB or in-app control can govern download, copy, and share actions.

How cloud DLP works: discovery, classification, policy, enforcement

Cloud DLP Pipeline
Discover, classify, policy, enforce
01
Discover
Connect to cloud stores and SaaS over APIs. Scan every bucket, drive, mailbox, and table. Surface shadow data.
02
Classify
Label what is sensitive. Regex plus validation (Luhn) for structured data. ML and trained classifiers for unstructured.
03
Policy
Bind a data type to a condition to an action. PII, PCI, PHI, secrets mapped to the rule that fires.
04
Enforce
Act on a match: block, quarantine, encrypt, redact, or alert. Start in monitor mode, then turn on blocking.
Enforcement actions block (stop the transfer or share), quarantine (hold pending review), encrypt (authorized parties only), redact or mask (strip the sensitive elements), alert (allow, record, notify). Bad classification upstream makes every action unsafe, so tune before you block.

Every cloud DLP deployment runs the same four stages in order. Get the early stages wrong and the late ones fail loudly: bad classification produces either a flood of false positives or silent misses, and no enforcement action is safer than the wrong one fired on a misclassified file.

Discovery finds where sensitive data actually is. The tool connects to cloud stores and SaaS apps over their APIs and scans content: every bucket, drive, mailbox, and table it can reach. Discovery is what surfaces "shadow data," the copies of sensitive records sitting in places no one is tracking, a test database cloned from production, an export dropped in a personal drive, a backup in a forgotten bucket. You cannot protect data you do not know you have, so discovery is the stage everything else depends on.

Classification decides what is sensitive and how. The two common methods are pattern matching and machine learning. Pattern matching uses regular expressions and validation logic to catch structured data: a regex for credit card numbers backed by a Luhn checksum, a pattern for Social Security numbers, formats for passport and national ID numbers. Machine learning and trained classifiers handle the unstructured cases a regex cannot, a contract, source code, a medical record written in prose, by learning what the category looks like rather than matching a fixed shape. Good classification combines both and tunes them, because a raw regex for nine-digit numbers will flag every order ID in the company.

Policy is the rule set that says what is allowed. A policy binds a data type to a condition to an action: "if a file classified as PCI is shared with an external domain, block and alert," or "if a document classified as PHI is uploaded to an unmanaged app, quarantine it." Policies encode the organization's actual obligations, regulatory, contractual, and internal, into machine-checkable rules.

Enforcement is what the tool does when a policy matches. The common actions are block (stop the transfer or share outright), quarantine (move the data to a holding location pending review), encrypt (apply protection so only authorized parties can open it), redact or mask (remove the sensitive elements and pass the rest), and alert (let it proceed but record it and notify). Most programs start in alert-only "monitor mode" to measure the false-positive rate before turning on blocking, because a block on a misclassified file stops legitimate work and trains people to route around the control.

Why the cloud breaks traditional DLP

The cloud does not just move DLP to a new place. It changes the problem in ways that defeat the endpoint-and-gateway model.

SaaS sprawl. Sensitive data spreads across dozens or hundreds of SaaS apps, many adopted by a team without security's knowledge. Each app has its own sharing model, its own API, and its own idea of what "external" means. There is no single egress point to inspect, so coverage means integrating with each app rather than watching one gateway.

Shadow data. Cloud makes copying data trivial. A snapshot, a clone, a CSV export, a sync to a personal account, each creates a new copy in a new place. These copies are the ones that leak, precisely because no one is governing them. Discovery is the only thing that finds them, and in the cloud the volume of places to look is enormous.

Multi-cloud inconsistency. Data in AWS, Azure, and Google Cloud sits behind three different storage models, three permission systems, and three sets of native controls. A policy that works in one does not translate cleanly to another, and centralized visibility across all three is hard to assemble. Misconfiguration is the dominant root cause of cloud data exposure, and a single public bucket or over-permissive share can undo the rest of the program.

Sharing links and OAuth grants. The most common cloud leak is not an attacker. It is a "share with anyone who has the link" toggle, or a third-party app a user granted broad OAuth scopes to read their entire drive. These are sanctioned features doing exactly what they were built to do, which is why they slip past controls aimed at malicious behavior.

The thread is that cloud data leaves through legitimate, sanctioned channels far more often than through an exploit. Cloud DLP has to govern the features themselves, not just watch for attacks.

Cloud DLP vs CASB vs DSPM

These three terms get used interchangeably and they are not the same. They overlap, and a mature program often runs all three, but they answer different questions.

Cloud DLP answers "is sensitive data about to leave or be exposed, and what do I do about it?" It is the data-type-aware policy and enforcement layer: classify the content, match a rule, take an action.

CASB answers "what is happening between my users and my cloud apps, and can I control it?" A cloud access security broker, a category Gartner defined in 2012, sits between users and cloud services as a policy enforcement point. It governs access, surfaces shadow IT, and enforces controls on app usage. DLP is one of the functions a CASB performs, which is why the two blur, but a CASB is the broader broker and DLP is the specific data-protection capability.

DSPM answers "where is all my sensitive data, who can reach it, and how exposed is it?" Data security posture management, a term Gartner introduced in its 2022 Hype Cycle for Data Security, is discovery and risk assessment at scale: it maps where sensitive data lives across cloud and SaaS, who has access, and where the exposure is, continuously. DSPM is heavy on discovery and posture; it tells you the risk. DLP acts on the data in motion to prevent the loss.

DimensionCloud DLPCASBDSPM
Core questionIs data leaking, and stop itWhat are users doing with cloud appsWhere is sensitive data and how exposed
Primary focusContent classification + enforcementAccess broker between users and appsData discovery + posture/risk
Acts onData in motion, at rest, in useApp usage and accessData at rest (mostly)
Typical actionBlock, quarantine, encrypt, alertAllow, block, govern app sessionSurface risk, recommend remediation
Coined / originLong-standing data security controlGartner, 2012Gartner, 2022
Best atStopping a specific leakControlling SaaS usageFinding shadow data and exposure

The honest summary: DSPM finds and rates the data, CASB controls how users reach the apps, and DLP classifies content and enforces what happens to it. They are layers, not competitors. A program with DSPM but no DLP knows its risk and cannot stop a leak; a program with DLP but no discovery enforces policy on the data it happens to see and misses the shadow copies.

Common cloud DLP policies

Most cloud DLP programs are built around a handful of data categories, each tied to a regulation or a contract that defines what must be protected.

PII (personally identifiable information). Names, government identifiers, dates of birth, contact details. The data covered by privacy regimes like the GDPR. Policies typically watch for PII leaving to external domains, landing in unmanaged apps, or sitting in publicly shared stores.

PCI (payment card data). Cardholder data, which PCI DSS defines as the full Primary Account Number (PAN), alone or combined with cardholder name, expiration date, or service code, plus sensitive authentication data like card verification codes and PINs. Classification leans on regex plus a Luhn check, and policies almost always block, because card data has no business sitting in a chat message or a shared drive.

PHI (protected health information). Medical records and health data covered by HIPAA in the US. PHI is often unstructured prose, which is why this category leans hardest on trained classifiers rather than patterns alone.

Secrets and credentials. API keys, access tokens, private keys, and passwords committed to repositories, pasted into chat, or exposed in logs and API responses. This category has grown fast because cloud runtime leaks, a credential exposed in a log line or an API response through a misconfiguration, are a common path to compromise. Secret-scanning policies look for the distinctive shapes of keys and tokens and tend to alert or block on the spot. Encrypting data at rest reduces the blast radius when a store is exposed; see cloud encryption for how key management and encryption fit a data-protection program.

The categories overlap and a single file can hit several. The work is less in listing them than in tuning the classifiers so each policy fires on the real thing and not on its look-alikes.

The defender's view: tuning and false positives

Cloud DLP is a detection system, and like any detection system its value is set by its false-positive rate. A DLP alert is a signal that sensitive data did something it maybe should not have. The analyst's job is to decide whether it mattered, and that job is only possible if the signal is trustworthy.

The dominant failure mode is over-classification. A regex for credit card numbers with no Luhn check flags every 16-digit number. A PII classifier with a loose threshold flags every spreadsheet with a "name" column. When the queue fills with these, analysts stop reading the queue, and the one real exfiltration scrolls past with the noise. This is why mature programs run in monitor mode first, measure the false-positive rate per policy, tune the classifiers and conditions, and only then turn on blocking, starting with the highest-confidence policies (secrets, PCI) where the patterns are tight.

What a DLP alert gives a defender is a specific, investigable artifact: a data type, a source, a destination, a user, a timestamp, and an action taken. That maps directly onto an investigation: what data, whose account, leaving to where, allowed or blocked. The same alert stream that prevents a leak in real time is the evidence trail that reconstructs one after the fact, which is why DLP findings feed both the prevention side and the data breach investigation side of a security program. Tune it so the signal is real, and a DLP queue is one of the better-shaped data sources a SOC has, because every entry already carries the who, what, and where an investigation starts from.

The bottom line

Cloud DLP discovers where sensitive data lives across cloud apps, storage, and email, classifies what is sensitive, applies policy, and enforces it with block, quarantine, encrypt, or alert before the data leaks or is exfiltrated. It covers three states of data, at rest, in motion, and in use, and it exists because the cloud erased the network perimeter that traditional DLP inspected. The data leaves through sharing links, OAuth grants, misconfigured buckets, and personal accounts, which look like normal work, so the control has to govern the features themselves.

It is not the same as CASB or DSPM. DSPM finds and rates the data, CASB controls how users reach the apps, and DLP classifies content and enforces what happens to it. They layer. For a defender, the payoff is a detection system whose every alert already carries a data type, a source, a destination, a user, and a timestamp. Tune it so the signal is trustworthy, and the same stream that stops a leak in real time becomes the evidence that reconstructs one after the fact.

Frequently asked questions

What is cloud DLP in simple terms?

<p>Cloud DLP is the practice of finding sensitive data across cloud apps, storage, and email, deciding what is allowed to happen to it, and enforcing that decision before the data leaks or is stolen. It discovers where data lives, classifies what is sensitive, applies policies, and takes actions like block, quarantine, encrypt, or alert. It is data loss prevention applied where data now lives, in the cloud rather than on a corporate network.</p>

What are the three states of data that DLP protects?

<p>Data at rest (sitting in storage like buckets, drives, and mailboxes), data in motion (moving across a network, such as an upload or an outbound email), and data in use (being actively worked on at an endpoint, such as copied or printed). Cloud DLP scans data at rest for exposure, inspects data in motion to block leaks, and overlaps with endpoint controls for data in use.</p>

How does cloud DLP classify sensitive data?

<p>It uses two main methods. Pattern matching with regular expressions and validation logic catches structured data like credit card numbers (a regex plus a Luhn check) and Social Security numbers. Machine learning and trained classifiers handle unstructured content like contracts, source code, and medical records that a fixed pattern cannot match. Good programs combine both and tune them to cut false positives.</p>

What is the difference between cloud DLP, CASB, and DSPM?

<p>Cloud DLP classifies content and enforces what happens to it (block, quarantine, encrypt, alert). A CASB is a broker between users and cloud apps that governs access and app usage, with DLP as one of its functions. DSPM continuously discovers where sensitive data lives across cloud and SaaS and rates its exposure. DSPM finds and rates the data, CASB controls how users reach apps, and DLP acts to prevent the loss.</p>

What policies does cloud DLP commonly enforce?

<p>The common categories are PII (personal data covered by privacy laws like GDPR), PCI (payment card data defined by PCI DSS), PHI (health data covered by HIPAA), and secrets (API keys, tokens, and passwords). Each binds a data type to a condition and an action, for example, block payment card data shared to an external domain, or alert on an API key pasted into chat.</p>

Why is cloud DLP harder than traditional DLP?

<p>The cloud removes the single network perimeter that classic DLP inspected. Data spreads across many SaaS apps with different sharing models, copies multiply as shadow data, multi-cloud environments use inconsistent controls, and the most common leaks happen through sanctioned features like public sharing links and OAuth grants rather than through attacks. Coverage means integrating with each cloud service rather than watching one gateway.</p>

Practice track
Network Forensics
Investigate security incidents by analyzing packet captures, identifying malicious traffic patterns, and reconstructing cyber attacks from network communications.
Browse Network Forensics Labs โ†’