Detection Engineering

What Is Mean Time to Repair (MTTR)? Explained

11 min read·Updated June 2026·incident responseSOCDetection Engineering

A ransomware operator lands on a finance workstation at 02:14. The SOC's sensor fires the same minute, but the alert sits in a queue until an analyst opens it at 02:51. Triage and escalation eat another 40 minutes. Containment, eradication, and full restoration finish at 06:30. That single incident took four hours and sixteen minutes from foothold to recovery, and every one of those minutes is a measurable interval that a metric is supposed to capture. Mean time to repair is the metric that captures the recovery portion, and it is the number a SOC manager is asked about first after an outage.

Mean time to repair (MTTR) is the average time it takes to restore a system to working order after a failure or security incident. You measure it by summing the repair time across every incident in a window, then dividing by the number of incidents. It answers one question: once we start fixing this, how long until the system is back. It says nothing about how long the problem went unnoticed, which is a different metric, and conflating the two is the most common mistake teams make when they report it.

This guide covers MTTR in the security operations context: the exact formula, the related metrics it is constantly confused with (MTTD, MTTA, MTBF, MTTF), where it fits in the incident timeline, and the levers a SOC actually pulls to bring it down. It is written for the people who own the number: SOC analysts, incident responders, and the managers who report these KPIs upward.

What mean time to repair measures

MTTR is a reliability and incident-response KPI borrowed from maintenance engineering. In its original sense it stands for mean time to repair: the average time spent actively fixing a failed component, from the moment work begins to the moment the component is operational again. It measures the speed of the fix, not the speed of noticing the problem.

The acronym MTTR is overloaded, and that is the root of most confusion. The same four letters are used for at least four different metrics:

Mean time to repair measures the active repair window only, from start of work to restored function. This is the strict maintenance definition.
Mean time to recovery (sometimes mean time to restore) measures from the moment the incident begins to the moment service is fully restored. It includes detection and response delay, not just the repair work.
Mean time to respond measures from alert to the start of remediation, capturing how fast the team mobilizes.
Mean time to resolve is the widest: from first alert through full restoration and the post-incident work that prevents a repeat.

When someone quotes an MTTR number, the first thing to pin down is which R they mean and what clock it starts. A 15-minute mean time to repair and a 15-minute mean time to resolve describe two very different operations. In security operations the number most teams actually care about is recovery, the wall-clock time from incident start to restored service, because that is the window an attacker has access and the business is degraded.

The MTTR formula

The formula is a plain average.

MTTR = total repair time / number of incidents

Sum the time spent across every incident in your measurement window, then divide by the count of incidents in that window. Worked example: a service fails four times during one workday and the team spends one hour fixing it each time. Total repair time is four hours. Divide by four incidents and the MTTR is one hour. If instead each fix took 15 minutes across those four incidents, total repair time is 60 minutes, divided by four incidents gives an MTTR of 15 minutes.

Two decisions make or break the number. First, define the clock precisely. Strict mean time to repair starts when remediation work begins; mean time to recovery starts when the incident begins. Pick one, write it down, and apply it the same way to every incident, or your trend line is measuring your bookkeeping, not your team. Second, segment by incident type. A single global MTTR that averages a password reset against a domain-wide ransomware event is a vanity number. Break it out by severity and by category so the figure means something.

MTTR vs MTTD vs MTTA vs MTBF

MTTR is one interval in a chain of metrics that together describe an incident's full lifecycle. Reporting MTTR without its siblings hides where time is actually lost.

Metric	Full name	Measures	Clock
MTTD	Mean time to detect	Failure onset to discovery	Incident start to detection
MTTA	Mean time to acknowledge	Alert fired to analyst picks it up	Alert to acknowledgement
MTTR	Mean time to repair / recovery	Repair work to restored service	Remediation start (or incident start) to restored
MTBF	Mean time between failures	Average uptime between failures	End of one failure to start of next
MTTF	Mean time to failure	Expected lifespan of a non-repairable asset	Deployment to first and final failure

The distinctions matter operationally:

MTTD (mean time to detect) is the dwell time before anyone knows there is a problem. A low MTTR with a high MTTD means you fix fast but get blindsided slow, and the attacker still had hours of undetected access. This is the number threat detection and detection engineering work directly attacks.
MTTA (mean time to acknowledge) is the gap between an alert firing and an analyst owning it. A high MTTA points at alert fatigue, understaffed shifts, or a noisy detection stack drowning real alerts.
MTBF (mean time between failures) measures reliability of repairable systems. Rising MTBF means a system is failing less often. It is a stability metric, not a response metric.
MTTF (mean time to failure) applies to assets you replace rather than repair, like a disk or a sensor, and estimates how long one lasts before it dies for good.

A useful way to read them together: MTTD plus MTTA plus MTTR is roughly the total time from compromise to recovery. Improving only MTTR while ignoring MTTD optimizes the cheapest part of the timeline and leaves the expensive part untouched.

Where MTTR fits in the incident timeline

Incident timeline · where each metric sits

MTTR is one bracket, not the whole line

MTTD, MTTA, and MTTR each cover a different span. Reporting MTTR alone hides where time is lost.

INCIDENT BEGINS

02:14 foothold

Compromise starts. Recovery clock starts here.

→

DETECTION

Sensor fires

Begin to here is MTTD, the dwell time.

MTTD

→

ACKNOWLEDGE

02:51 analyst owns it

Detection to here is MTTA, the queue gap.

MTTA

→

REPAIR & RESTORE

06:30 service back

Active fix window is MTTR. Begin to here is recovery.

MTTR

Read them together MTTD plus MTTA plus MTTR is roughly the time from compromise to recovery. Cutting only MTTR optimizes the cheapest span and leaves the detection gap untouched.

Lay the metrics on a single timeline and the overlaps become obvious. An incident begins, runs undetected for a while, gets detected, waits for acknowledgement, gets worked, and is finally restored. Each metric is a bracket over a different span of that line, and several of them overlap, which is exactly why a team has to define its clocks before it can trust a trend.

The order a SOC experiences it:

Incident begins. The compromise or failure starts. The clock for mean time to recovery starts here.
Detection. A sensor, correlation rule, or hunt surfaces it. The span from step 1 to here is MTTD.
Acknowledgement. An analyst picks up the alert and takes ownership. The span from detection to here is MTTA.
Repair. The team contains, eradicates, and restores. The span of this active work is strict mean time to repair.
Restored. Service is back. The span from step 1 to here is mean time to recovery; from first alert through the post-incident review is mean time to resolve.

This is why MTTR alone is a partial story. A team can post an excellent repair time while its detection gap quietly grows, and the single MTTR number on the dashboard will not show it. The security operations center that reports the full chain catches that drift; the one that reports only MTTR does not.

How a SOC measures MTTR

You cannot improve a number you cannot measure cleanly, and MTTR is easy to measure badly. Three things have to be in place.

Accurate timestamps for every interval. Each transition in the timeline above needs a recorded, trustworthy time: incident start (often reconstructed during investigation), detection, acknowledgement, remediation start, and restoration. Most of these come from the SIEM, the EDR, and the case-management or ticketing system. If acknowledgement time is whenever an analyst remembers to update the ticket, the metric is fiction.

A consistent clock definition. Decide once whether MTTR means strict repair or full recovery, whether it counts calendar hours or business hours, and when exactly the start and stop events fire. Apply it identically to every incident. Most reporting discrepancies are not performance changes; they are two analysts measuring different spans and calling both MTTR.

Segmentation by severity and type. Track MTTR per incident class: phishing, malware, account compromise, denial of service. A blended average tells you nothing actionable. Segmented MTTR tells you which playbook is slow and where to invest.

With those in place, the math is the formula above: total repair (or recovery) time in the window, divided by incident count, computed per segment, trended over time.

How to reduce MTTR

Reducing MTTR is mostly about removing delay and hesitation from the response, not about working faster under pressure. The levers, in rough order of impact:

A defined incident response process. A team that follows a written incident response plan with per-scenario playbooks does not spend the first 20 minutes deciding what to do. The decisions were made in advance, so the clock spent on the live incident is execution, not improvisation. This is the single biggest lever.
Automation and orchestration. SOAR playbooks execute the repetitive containment and enrichment steps in seconds: isolate the host, disable the account, pull the related logs, open the case. Automating the mechanical work removes the queue time and the human latency that dominate most MTTR figures.
Visibility and the right tooling. You cannot restore what you cannot see. Centralized logging, EDR on every endpoint, and a SIEM that correlates across them shorten the investigation that sits inside the repair window. Less time spent figuring out scope is less time to recovery.
Tuned detections to cut noise. Every false positive an analyst chases is repair capacity spent on nothing. Tuning detections so real incidents surface cleanly shortens both acknowledgement and the investigation inside repair.
Skilled, available responders. Enough trained people on shift to act without escalating through three layers first. Staffing and on-call design show up directly in the number.
Post-incident review. Every incident is data. A short retrospective that asks where the time went, then fixes the slowest step, is how the next MTTR gets smaller. Teams that skip it repeat the same delays.

Machine learning and AI increasingly automate parts of this: triaging alerts, suppressing false positives, recommending or executing containment. Used well they compress the human-latency portion of MTTR, which is usually the largest portion. Used badly they add a new noisy alert source that raises MTTA instead.

Frequently asked questions

What is mean time to repair (MTTR)?

Mean time to repair (MTTR) is the average time required to restore a system to working order after a failure or incident. It is calculated by dividing the total repair time across all incidents in a period by the number of incidents. In security operations it measures the speed of recovery once a response is underway.

What is the difference between MTTR and MTTD?

MTTR (mean time to repair) measures how long it takes to fix and restore a system once a response begins. MTTD (mean time to detect) measures how long a problem goes unnoticed before anyone discovers it. A team can have a low MTTR and still suffer long undetected compromises if its MTTD is high, so the two are reported together.

What does the R in MTTR stand for?

The R is ambiguous and depends on the team. It can mean repair (the active fix window), recovery or restore (incident start to full restoration), respond (alert to start of remediation), or resolve (alert through post-incident work). Pin down which one is meant and what clock it starts before comparing any two MTTR figures.

What is a good MTTR for a SOC?

There is no universal target, because MTTR depends on incident type, environment, and how the clock is defined. A password reset and a domain-wide ransomware event have very different expected times. The useful approach is to segment MTTR by severity and category, set a baseline, and drive a downward trend within each segment rather than chasing a single industry number.

How is MTTR calculated?

Sum the repair or recovery time across every incident in your measurement window, then divide by the number of incidents. For example, four incidents each taking 15 minutes to fix gives a total of 60 minutes divided by four, an MTTR of 15 minutes. Calculate it per incident category rather than as one blended average so the figure is actionable.

Why does MTTR matter in security operations?

MTTR is a direct proxy for how long an attacker retains access and how long the business stays degraded after detection. A shorter recovery time means less data exfiltrated, less lateral movement, and less downtime. Reported alongside MTTD and MTTA, it tells a SOC manager where time is being lost across the full incident lifecycle.