Threat Intelligence

What Is a Deepfake Attack? How It Works and Defenses

12 min read·Updated June 2026·incident responseBlue TeamFundamentalsThreat DetectionCyber Threat intelligence

In January 2024, a finance worker in the Hong Kong office of the engineering firm Arup joined a routine video call. The chief financial officer was on it. So were several colleagues he recognized. They walked him through a set of confidential transactions, and over the course of the day he made 15 wire transfers totaling about 25 million US dollars to five Hong Kong bank accounts. Every face on that call was fake. Every voice was synthetic. The CFO and the colleagues were AI-generated reconstructions built from publicly available footage of real Arup executives. The money was gone before anyone noticed.

That is a deepfake attack. It takes an old crime, the fraudulent payment request, and removes the one defense employees still trusted: seeing and hearing the person who asked. For a blue teamer this is not a novelty. It is the latest delivery mechanism for fraud and intrusion, and it breaks the verification habits a security awareness program spent years building.

This guide covers what a deepfake attack is, the technology that makes it possible, the forms it takes, how it fits the wider attack chain, how to detect it, and how a defender actually builds a process that holds up when a familiar voice on the phone is lying.

What is a deepfake attack?

A deepfake is synthetic media, an image, an audio clip, or a video, generated by artificial intelligence to convincingly impersonate a real person saying or doing something they never said or did. A deepfake attack is the use of that synthetic media to deceive a target for a malicious end: authorizing a payment, handing over credentials, leaking data, or shaping opinion.

The term combines deep learning and fake. The defining feature is the same one that makes a deepfake attack dangerous: it targets human trust in a face and a voice. A spoofed email asks you to trust text. A deepfake asks you to distrust your own eyes and ears, which is a far higher bar for a person to clear in the moment.

This makes it a form of social engineering, not a software exploit. No firewall rule blocks a phone call. No patch closes the gap when the attacker's payload is a face the victim recognizes. The control surface is the human, and the attacker now has a tool that defeats the instinct most people rely on to tell real from fake.

The technology behind a deepfake

Two model families do most of the work, and knowing the difference matters for detection.

The first is the generative adversarial network, or GAN. A GAN pairs two neural networks against each other. One, the generator, produces fake images or audio. The other, the discriminator, tries to tell the fakes from real samples. Each correction makes the generator better, and after enough rounds the output is good enough to fool the discriminator, and often a person. GANs powered the first wave of convincing face swaps.

The second, and the one driving the current surge, is the diffusion model and the large generative models built on it. These are the engines behind modern text-to-image, text-to-video, and voice-cloning tools. The practical shift is the input cost. Early deepfakes needed hours of target footage and real machine-learning skill. Today a usable voice clone can be produced from a short audio sample, and the tooling is packaged for people with no technical background. That collapse in cost and skill is why the threat moved from a research curiosity to a fraud commodity.

The creation process is consistent across both. The attacker collects source material on the target, a few clips of their voice or face pulled from earnings calls, conference talks, social media, or YouTube. A model learns the target's voice, appearance, and mannerisms. The attacker then drives the model to produce new media: the CFO saying words he never said, on a call he never joined.

Types of deepfake attacks

Deepfakes split by the medium they forge and by what the attacker is trying to achieve.

Type	Medium	Typical use	Primary goal
Voice cloning (vishing)	Audio	Fake call from a CEO or relative	Fraudulent payment, credential reset
Video impersonation	Video	Fake executive on a conference call	Wire fraud, access requests
Real-time face swap	Live video	Live deepfake during a meeting or interview	Fraud, fake job candidate, access
Synthetic identity	Image and document	Fake person for KYC or account fraud	Account opening, evasion
Disinformation media	Video and audio	Fake clip of a public figure	Manipulate opinion, move markets

A few matter more than their row suggests.

Voice cloning is the cheapest and most common. A cloned voice over a phone call carries no visual tells and exploits the trust people place in a familiar voice. It powers both corporate fraud and the family emergency scam, where a victim hears what sounds exactly like a relative in distress.

Video impersonation on a conference call is the Arup pattern, and it is the most expensive. The attacker often opens with a pretext message, frequently a payment request that reads like ordinary business, then uses the live or pre-rendered video call to overcome the target's doubt. The call is not proof. It is the manipulation.

Real-time face swap is the fastest-moving category. Attackers now use it to pass remote job interviews and onboard as fake employees, a tactic tied to state-backed operations placing operatives inside target companies. The face on the screen is not the person who took the job.

Synthetic identity and disinformation widen the blast radius beyond a single victim, from defrauding identity-verification systems to fabricating a clip of a public figure to move a stock or a vote.

How a deepfake attack fits the attack chain

Deepfake Attack Chain

The synthetic media is the delivery stage

The fake call gets the target to act. The real objective begins after.

01

Collect source media

Voice and face clips from calls, talks, social media

→

02

Build the clone

Model learns voice, appearance, mannerisms

→

03

Set the pretext

A written request that reads like ordinary business

→

04 EXPLOIT

Deploy the fake call

Synthetic voice or video closes the ask

→

05 PAYOUT

Victim acts

Wire sent, credentials reset, or access granted

→

06

Cash out or go deeper

Account takeover, wire fraud, insider access

Break the chain Out-of-band verification at stage 05 stops the attack no matter how good the fake is. A single callback to a known number would have stopped the Arup transfers.

A deepfake is rarely the whole attack. It is the delivery and exploitation stage, the thing that gets the target to act, after which the real objective begins.

The pattern is consistent. The attacker gathers source media and intelligence on the target and the organization. They clone the voice or build the video model. They contact the victim, often with a written pretext first to set up the ask, then deploy the synthetic call to close it. The victim acts: the wire goes out, the credentials are reset, the access is granted. Then the attacker cashes out or moves deeper.

From there a deepfake feeds the same chains every social-engineering attack does. A cloned-voice help-desk call becomes an account takeover. A fake executive's request becomes a wire transfer that never comes back, the deepfake-era version of business email compromise, where the fraudulent call now backstops the fraudulent email. A fake job candidate becomes an insider with legitimate access. This is why deepfakes sit squarely inside AI social engineering: the AI does not breach the system, it manipulates the person who can.

How to detect a deepfake

Detection runs on two tracks: spotting the artifacts in the moment, and verifying through a channel the attacker does not control. The second is far more reliable than the first.

The in-the-moment tells still exist, though they shrink with every model release:

Visual artifacts. Unnatural or absent blinking, edges that blur where the face meets hair or neck, lighting and skin tone that do not match the scene, teeth and eyes that render poorly.
Audio artifacts. A flat or slightly robotic cadence, odd pacing, missing breaths, background audio that does not fit the supposed location.
Sync and behavior. Lip movement that lags the audio, a face that holds too still, emotion that does not match the words.
Context. An urgent, unusual, confidential request, especially one involving money or access, that pushes the target to skip normal checks. This is the strongest signal, because it is the social-engineering core under the synthetic media.

The technical detection layer adds AI-based detectors that score media for manipulation, digital forensics that examine a file's integrity and provenance, and emerging content-authentication standards that attach verifiable provenance to genuine media. None of these is a complete answer on its own, and a detector trained on last year's models degrades against this year's.

The reliable control is out-of-band verification. Detection of artifacts is a losing race as the models improve. Verification through an independent, known channel is not, because it does not depend on the quality of the fake at all.

How to defend against deepfake attacks

No single tool stops deepfake attacks, because they span technology, people, and process. The defense that actually holds is process, and it is cheaper than the fraud it prevents.

Process controls. These do the heavy lifting.

Verify money and access out of band, every time. Any payment, payment-detail change, or credential or MFA reset request gets confirmed through a separate known channel: a callback to a number from the corporate directory, not one supplied in the request. A single callback would have stopped the Arup transfers.
Require multi-party approval for high-value transactions. No single person, convinced by any call, should be able to move large sums alone.
Use a verification word or challenge for sensitive voice and video requests, agreed in advance and never shared over the channel being verified.

Human controls.

Train for the deepfake era specifically. The old advice was to trust a phone call over an email. That advice is now wrong. Awareness programs need to teach that a familiar face or voice is no longer proof of identity.
Make reporting easy and blameless. The fastest way a SOC learns about a live campaign is a target who pauses, doubts, and reports instead of complying.

Technical controls.

Reduce the source material. Limit needlessly public audio and video of executives and finance staff, the raw input an attacker needs to build the clone.
Harden identity. Phishing-resistant authentication and strict verification for high-risk actions limit what a successful deception can reach.
Deploy detection where it fits. Use deepfake detection on high-risk channels and feed any confirmed indicators, accounts, numbers, and source media into your phishing and fraud workflows.

How a SOC handles a reported deepfake

When an employee reports a suspected deepfake, whether they complied or paused in time, the response follows a repeatable loop. This is the part most explainers skip.

Triage. Establish what was requested and whether the target acted. A deepfake that was caught is an intelligence win. One that succeeded is the start of a fraud and possibly an intrusion case.
Preserve evidence. Capture the recording, the call metadata, the originating numbers or accounts, and any pretext messages. This media is both evidence and a detection sample.
Scope it. If money moved, work with finance and the bank immediately, speed is the only thing that recovers funds. If access was granted, treat it as a compromised account: where else was the same pretext used, who else was contacted.
Contain and remediate. Reverse or freeze transfers where possible, reset affected credentials, revoke sessions, and block the indicators.
Hunt and improve. Search for the follow-on activity an attacker attempts after a successful deception: new logins, forwarding rules, access changes. Then feed the indicators into detections and seed proactive hunting.

When a deepfake leads to a confirmed transfer or account takeover, this rolls into full incident response. The skill that matters most is not running a detector. It is the analyst and the process that insist on verifying a high-stakes request through a channel the attacker cannot fake.

Getting started with deepfake defense

If you are building the capability, the work is in the process and the artifacts.

Map your high-risk requests. Identify every workflow where one person can move money, change payment details, or grant access on the strength of a call or message. Those are the targets.
Build out-of-band verification into each. Define the known-channel callback and the multi-party approval for each high-risk workflow, and make skipping it the exception that gets flagged.
Learn the artifacts. Practice spotting visual and audio tells on known deepfake samples, while treating them as a secondary check behind verification.
Trace a deepfake-driven fraud case. Follow a real incident from the pretext message through the synthetic call to the transfer, so you see the whole chain rather than just the fake.

Frequently Asked Questions

What is a deepfake attack in simple terms?

A deepfake attack uses AI to fake someone's voice or face so convincingly that a victim believes they are talking to a real, trusted person. The attacker uses that fake call, video, or message to trick the target into sending money, sharing data, or granting access. It targets human trust rather than any software weakness.

How are deepfakes created?

An attacker collects samples of the target's voice or face from public sources like videos, calls, and social media, then uses AI models, originally generative adversarial networks and now diffusion and voice-cloning models, to learn and reproduce them. The cost has dropped sharply: a convincing voice clone can now be built from a short audio sample with no specialist skill.

Can a deepfake beat a video call?

Yes. The 2024 Arup fraud used deepfaked video of a CFO and colleagues on a live conference call to convince an employee to wire about 25 million US dollars. A video call is no longer proof of identity. High-stakes requests must be confirmed through a separate, known channel regardless of who appears on screen.

How can you detect a deepfake?

Look for artifacts such as unnatural blinking, blurred face edges, mismatched lighting, flat or oddly paced audio, and lip-sync that lags. The more reliable signal is context: an urgent, unusual request for money or access. Because detection of artifacts loses ground as models improve, out-of-band verification through a trusted channel is the dependable control.

What is the difference between a deepfake attack and phishing?

Phishing usually relies on fraudulent text, an email or message impersonating a trusted sender. A deepfake attack adds synthetic voice or video so the impersonation extends to a call or meeting. Deepfakes are often combined with phishing: a written pretext sets up the ask, and the deepfake call closes it.

How do you defend an organization against deepfake fraud?

Build process controls that do not depend on spotting the fake: verify all payment and access requests out of band through a known channel, require multi-party approval for large transfers, and use a pre-agreed verification word. Pair that with training that treats a familiar face or voice as unproven, and reduce the public audio and video that attackers use as source material.

The bottom line

A deepfake attack is social engineering with the last visual and audio defenses stripped away. The technology is now cheap, fast, and widely available, which is why a synthetic CFO on a video call could talk an employee out of 25 million dollars in a single day. It does not break systems. It breaks the human habit of trusting a face and a voice.

The defense is not a better detector, because that is a race against models that keep improving. It is process: verify every high-stakes request through a channel the attacker does not control, require more than one person to move money, and train people that seeing and hearing is no longer believing. The constraint, as always, is whether the process holds when the voice on the line sounds exactly like someone you trust.

Frequently asked questions