Glossary/Detection Engineering/Application monitoring

What Is Application Monitoring? Metrics, Logs, Traces

Application monitoring is the practice of instrumenting a running application to track its availability, performance, errors, and behavior using metrics, logs, and traces.

A checkout service starts returning HTTP 500s on one in twenty requests. The host is up, CPU is flat, the firewall logs nothing. Infrastructure monitoring says the box is healthy, and it is right. The failure lives one layer up, inside the application: a database connection pool that ran dry after a deploy. Nothing at the host or network layer would ever show it. The only signal that catches it is the application telling you about its own behavior, the error rate climbing, the request latency spiking, the trace that ends at a timed-out query. That signal is what application monitoring produces.

Application monitoring is the practice of instrumenting a running application so you can see its health and behavior from the outside: how fast it responds, how often it fails, what it is doing internally on each request, and whether any of that has changed. It is built on three kinds of telemetry, metrics, logs, and traces, and it is sometimes called application performance monitoring (APM). This guide covers what application monitoring is, the three telemetry types and how they differ, the signals worth alerting on, how it relates to infrastructure monitoring and full observability, and why a SOC analyst should care about a discipline that started life on the performance-engineering side of the house. It is written for the people who read this telemetry under pressure: SOC analysts, detection engineers, and incident responders.

What is application monitoring?

Application monitoring is the process of collecting and analyzing telemetry from a running application to track its availability, performance, errors, and behavior over time, so problems are caught and diagnosed before they reach the user. The subject is the application itself, the code and the services it is built from, not the server it runs on or the network it sits behind. The questions it answers are concrete: is the app up, is it fast, is it failing, and if it changed, what changed and when.

The term is often used interchangeably with application performance monitoring (APM), and in practice they point at the same thing. APM is the older, vendor-driven label that grew out of performance engineering, where the goal was to keep response times low and find the slow code path. Application monitoring is the broader reading of the same job: not only is it fast, but is it healthy, correct, and behaving the way it did yesterday. The distinction rarely matters in conversation. The mechanism is identical: instrument the app, emit telemetry, collect it centrally, and watch it.

That telemetry comes in three forms, metrics, logs, and traces, and an application monitoring setup almost always uses all three. They are not interchangeable. Each answers a different question, and the skill is knowing which one to reach for when something breaks.

Metrics, logs, and traces: the three telemetry types

Application Monitoring: The Three Telemetry Types
Metrics, logs, traces
Each answers a different question. Together they narrow an incident from "something is wrong" to "this line of code."
01 METRICS
Something changed, and when
Numeric measurements over time: request rate, error count, p95 latency, saturation. Cheap to store, fast to query. Tells you when, not why.
02 TRACES
Which request, which service
The path of one request across services, built from spans with timing per hop. Narrows a cross-service problem to the service at fault.
03 LOGS
Exactly what happened
Timestamped records of discrete events: the error message, the stack trace, the user ID, the offending parameter. Detail metrics lack.
The funnel A metric flags the problem and roughly when. A trace narrows it to the request path and service. A log tells you exactly what happened in the code. Missing a layer means guessing.

These three are commonly called the three pillars of observability, a framing popularized by the observability tooling community and adopted by OpenTelemetry, the vendor-neutral standard for application telemetry. OpenTelemetry treats metrics, logs, and traces as the core signals an instrumented application emits. They overlap less than the name suggests, and confusing them is a common reason an investigation stalls.

Metrics are numeric measurements aggregated over time: request rate, error count, response latency at the 95th percentile, CPU saturation, queue depth. OpenTelemetry defines a metric as "a measurement about a service, captured at runtime." Metrics are cheap to store and fast to query because they are just numbers over time, which makes them what dashboards and alerts are built on. Their limit is that they tell you something changed, not why. A latency graph that spikes at 14:02 tells you when, not which request or what it was doing.

Logs are timestamped records of discrete events, emitted by the application as it runs: a request received, a query executed, an exception thrown with its stack trace, a user authenticated. OpenTelemetry defines a log as "a timestamped text record, either structured (recommended) or unstructured, with optional metadata." Logs carry the detail metrics lack, the actual error message, the offending parameter, the user ID, but they are voluminous and expensive to search at scale. They are where you go once a metric tells you something is wrong and you need to know what. These are the same application logs that feed detection and incident response, which is the first overlap with security.

Traces record the full path of a single request as it moves through a distributed system, service to service, with timing at each hop. A trace is made of spans, one per unit of work, and the spans nest to show the call tree. OpenTelemetry defines a trace as tracking "the progression of a single request as it is handled by services that make up an application," built from spans. Traces are the answer to "the checkout is slow, but which of the eight services it calls is the slow one." In a monolith you might never need them; in a microservice architecture they are the only thing that makes a cross-service latency problem tractable.

The working relationship is a funnel. A metric tells you something is wrong and roughly when. A trace narrows it to the request path and the service at fault. A log tells you exactly what happened at that point in the code. An investigation that has all three moves fast; one missing a layer guesses.

The signals worth watching: the four golden signals

Telemetry is only useful if you alert on the right slice of it. The most durable answer to "what do I actually watch" is the four golden signals, defined in Google's Site Reliability Engineering book (Chapter 6, "Monitoring Distributed Systems," by Rob Ewaschuk). The book's guidance is blunt: if you can measure only four things about a user-facing system, measure these.

Signal What it measures Example metric Why it matters
Latency The time it takes to service a request 95th/99th percentile response time Slow is the failure users feel before an outage; separate success latency from error latency
Traffic How much demand is on the system Requests per second, transactions per second Sets the baseline; a sudden drop or spike is itself a signal
Errors The rate of requests that fail HTTP 500 rate, failed-transaction rate The most direct measure of broken; includes implicit failures like a 200 with wrong content
Saturation How "full" the most constrained resource is Memory used, connection pool depth, queue length The leading indicator; services degrade as they approach saturation, before they fail outright

The reason the four golden signals have outlasted a decade of tooling churn is that they are user-facing and resource-aware at once. Latency, traffic, and errors describe what the user experiences. Saturation describes the resource that will break next. Watch all four and you catch both the failure that already happened and the one about to. A monitoring setup that alerts on raw CPU but not on error rate is measuring the wrong layer, which is exactly the checkout-500s scenario from the opening.

Application monitoring vs infrastructure monitoring vs observability

These three terms get used loosely, and the looseness causes real gaps. They sit at different layers and answer different questions.

  Infrastructure monitoring Application monitoring (APM) Observability
Subject Hosts, network, OS, containers The application code and services The whole system as a queryable surface
Core question Is the box healthy? Is the app healthy and fast? Why is this happening, including problems I never predicted?
Primary telemetry Host metrics, syslog Metrics, logs, traces from the app All telemetry, correlated and explorable
Typical signal CPU, memory, disk, packet loss Latency, error rate, traces Arbitrary questions across all signals
Catches Dead host, full disk, link down App errors, slow code paths, bad deploys Novel, unanticipated failure modes

Infrastructure monitoring watches the platform: is the host up, is the disk full, is the link saturated. It is necessary and it is not enough, because a perfectly healthy host can run a completely broken application, which is the failure the opening scenario describes.

Application monitoring watches the app on top of that platform. It is where the error rate, the request latency, and the traces live. The line is not always clean, a container's memory pressure is arguably both, but the test is the subject: if the signal is about the code's behavior, it is application monitoring.

Observability is the larger goal, not a fourth tool. The working definition, from the OpenTelemetry primer, is the ability to understand a system's internal state from its external outputs, "without knowing its inner workings," well enough to answer questions you did not anticipate when you instrumented it. Monitoring tells you whether the things you predicted would break are broken; observability lets you ask why something you never predicted is happening. Application monitoring with all three telemetry types is how most teams reach observability for their applications. The relationship to broader security telemetry mirrors how log analysis turns raw events into answers: collect widely, then query against whatever question the incident raises.

Why application monitoring matters to security

Application monitoring grew up on the performance and reliability side, which is why a blue team can overlook it. That is a mistake. The same telemetry that tells an SRE the app is slow tells a defender the app is under attack, and often the application layer is the only place an attack is visible at all.

Consider what does not show up at the host or network layer. A credential-stuffing run against a login endpoint looks like normal traffic to a firewall, every packet is well-formed HTTPS to a service that is supposed to receive logins. At the application layer it is unmistakable: the authentication error rate climbs, the traffic to one endpoint spikes, the latency rises as the backend strains. Those are golden signals. The same telemetry built to catch a bad deploy catches the attack. An injection attempt, an authorization bypass, an account-takeover spike, a sudden surge of 500s from a path being fuzzed, all of these are anomalies in application metrics and logs before they are anything else.

This is the overlap between application monitoring and threat monitoring. The pipeline is the same: instrument, collect, baseline, alert on deviation. The difference is the question. A reliability team asks "is this normal for our traffic," a security team asks "is this an attack," and the same error-rate spike can be both a failing dependency and an active exploit. Application logs in particular do double duty. They are the diagnostic record for an outage and the evidence trail for an incident, the request that triggered the exception is also the request that carried the payload. A SOC that ignores application telemetry is blind to the layer where modern application attacks actually land, and a monitoring stack that feeds only reliability dashboards and never the detection pipeline is wasting half its value.

The practical takeaway is that application monitoring data belongs in the security pipeline, not only the SRE dashboard. Route application error rates, authentication failures, and anomalous traces into the same place that correlates the rest of your detections, and the instrumentation you already pay for starts earning a second return.

The bottom line

Application monitoring is how you see an application's health from the outside, built on three telemetry types that answer different questions: metrics tell you something changed and when, traces narrow it to the request path and service, and logs tell you exactly what happened in the code. The four golden signals, latency, traffic, errors, and saturation, are the durable answer to what to alert on, because they capture both what the user feels and the resource about to break.

It sits one layer above infrastructure monitoring, watching the application rather than the box, and full observability is the larger goal of being able to ask any question of your telemetry, not just the ones you predicted. For a defender, the payoff is that this telemetry is not only a reliability asset. The error-rate spike that means a bad deploy can equally mean an attack, and the application logs that diagnose an outage are the same evidence that reconstructs an incident. The instrumentation is already there. Route it into the detection pipeline and it works twice.

Frequently asked questions

What is application monitoring?

<p>Application monitoring is the practice of instrumenting a running application to track its availability, performance, errors, and behavior over time, using telemetry the application emits about itself. It is built on three telemetry types, metrics, logs, and traces, and is often called application performance monitoring (APM). The goal is to catch and diagnose problems in the application before they reach the user.</p>

What is the difference between application monitoring and APM?

<p>In practice they are the same discipline. APM (application performance monitoring) is the older, performance-focused label that grew out of performance engineering, where the goal was keeping response times low. Application monitoring is the broader reading of the same job, covering availability, errors, and behavior changes as well as speed. The mechanism is identical: instrument the app, emit telemetry, collect it, and watch it.</p>

What are the three pillars of observability?

<p>The three pillars are metrics, logs, and traces. Metrics are numeric measurements aggregated over time, such as request rate or latency. Logs are timestamped records of discrete events with detail like error messages and stack traces. Traces follow a single request across services to show where time is spent. OpenTelemetry treats these three as the core signals an instrumented application emits.</p>

What are the four golden signals?

<p>The four golden signals are latency, traffic, errors, and saturation, defined in Google's Site Reliability Engineering book. Latency is how long requests take, traffic is how much demand the system sees, errors is the rate of failing requests, and saturation is how full the most constrained resource is. The book's guidance is that if you can measure only four things about a user-facing system, these are the four.</p>

What is the difference between application monitoring and infrastructure monitoring?

<p>Infrastructure monitoring watches the platform, hosts, network, disk, CPU, and tells you whether the box is healthy. Application monitoring watches the app running on that platform and tells you whether the code is healthy, fast, and correct. A perfectly healthy host can run a completely broken application, so a healthy infrastructure dashboard does not mean the application is fine. The two are complementary layers, not substitutes.</p>

How does application monitoring relate to security?

<p>The telemetry application monitoring produces, error rates, traffic patterns, latency, and application logs, is the same data that reveals many application-layer attacks. A credential-stuffing run, an injection attempt, or an account-takeover spike shows up as an anomaly in application metrics and logs before it is visible anywhere else, and is often invisible to host and network monitoring. Routing application telemetry into the detection pipeline lets the instrumentation built for reliability double as security visibility.</p>

Practice track
SOC Analyst Tier 1
Build your foundational skills to monitor, detect, and escalate security alerts. This track includes essential tools, basic log analysis, and introductory incident response labs.
Browse SOC Analyst Tier 1 Labs โ†’