Glossary/Detection Engineering/Observability

What Is Observability? The Three Pillars Explained

Observability is the ability to understand a system's internal state from its external outputs (its metrics, logs, and traces), well enough to answer questions you did not anticipate when you built it.

An outage starts as a question nobody planned for. Checkout latency triples at 02:14, but only for users in one region, only on mobile, and only when they pay with a saved card. No dashboard has a panel for that exact combination, because no one predicted it. With monitoring alone, the team is stuck: the alerts that fired say the system is "mostly fine," and the rest is guesswork. With observability, the same team asks the system directly, slicing the telemetry by region, device, and payment path until the slow dependency falls out of the data. The difference is not more dashboards. It is being able to ask a question you never set up in advance and get an answer from data the system already emits.

Observability is the ability to understand a system's internal state from its external outputs, well enough to answer questions you did not anticipate when you built it. It rests on three kinds of telemetry, metrics, logs, and traces, commonly called the three pillars. This guide covers what observability is, how it differs from monitoring, the three pillars and what each is for, the role of OpenTelemetry, and why a security team should treat observability as a detection asset and not just a reliability one. It is written for the people who end up reading this telemetry under pressure: SOC analysts, detection engineers, and incident responders.

What is observability?

Observability is the property of a system that lets you understand what is happening inside it purely from the data it emits, without shipping new code to answer each new question. The OpenTelemetry primer puts it as the ability to understand a system's internal state from its external outputs "without knowing its inner workings." A system is observable to the degree that its outputs, the metrics, logs, and traces it produces, let you reconstruct what it is doing and why.

The term comes from control theory, where a system is observable if its internal state can be inferred from its outputs. Applied to software, the bar is practical: when something breaks in a way you did not foresee, can the data already flowing out of the system answer the question, or do you have to add instrumentation and wait for it to happen again. A truly observable system answers on the first incident.

That bar matters more as systems get more distributed. A single monolith on one server is easy to reason about; you can almost guess where a failure is. A request that crosses a dozen microservices, queues, and managed cloud services has too many moving parts to hold in your head. The number of ways it can fail is effectively unbounded, and most of them are failure modes no one wrote an alert for. Observability is the discipline of instrumenting the system so those unanticipated failures are still answerable from the data.

Observability vs monitoring

Observability and monitoring get used interchangeably, and the blur causes real gaps. They are related but not the same, and the difference is the kind of question each one answers.

Monitoring watches for conditions you defined in advance. You decide what "broken" looks like, CPU over 90 percent, error rate above one percent, disk nearly full, and you alert when the system crosses that line. Monitoring is excellent at catching the failures you predicted. It answers known questions about known problems: is the thing I am watching in the state I said was bad.

Observability is the broader capability of asking arbitrary questions of the system after the fact, including questions you never thought to set up. It answers unknown questions about problems you did not predict: not "is CPU high" but "why are exactly these users slow, and what do their requests have in common." Monitoring tells you that something you expected is wrong. Observability lets you investigate something you never expected.

MonitoringObservability
Question typeKnown: is a predefined condition true?Unknown: why is this happening, including problems I never predicted?
Set up in advanceYes, you define the alertsThe telemetry is broad; the questions are asked later
CatchesAnticipated failures (CPU, errors, disk)Novel, unanticipated failure modes
OutputAlerts and dashboardsExplorable, queryable telemetry
RelationshipA subset of what observability enablesThe larger capability monitoring sits inside

The two are not rivals. Monitoring is a subset of what an observable system gives you: once the telemetry is rich enough to answer arbitrary questions, encoding the known-bad conditions as alerts is the easy part. The mistake is stopping at monitoring, building dashboards for every failure you can imagine and being blind to the ones you cannot. The unimagined failures are the expensive ones.

The three pillars of observability

Observability · The Three Pillars
Three signals in, any question out
Metrics, logs, and traces are correlated into one queryable surface, so the system answers questions no one set up in advance.
METRICS
Something changed, and when
Numbers over time: request rate, error count, p95 latency, saturation. Cheap and fast. Tells you when, not why.
TRACES
Which request, which service
The path of one request across services, built from spans with timing per hop. Narrows it to the service at fault.
LOGS
Exactly what happened
Timestamped records of discrete events: the error message, the stack trace, the user ID, the offending parameter.
CORRELATE & EXPLORE
One queryable surface
Shared IDs link the three signals, so an analyst pivots from a spiking metric to the traces behind it to the logs those traces produced.
Why it beats monitoring Monitoring answers the conditions you predicted. Observability answers the question you never set up, because the data to answer it is already collected and linked.

Observability is built on three kinds of telemetry, metrics, logs, and traces, commonly called the three pillars. The framing was popularized by the observability community and adopted by OpenTelemetry, which treats metrics, logs, and traces as the core signals an instrumented system emits. They are not interchangeable. Each answers a different question, and an investigation that is missing one of them guesses where it should know.

Metrics are numeric measurements aggregated over time: request rate, error count, response latency at the 95th percentile, CPU saturation, queue depth. OpenTelemetry defines a metric as "a measurement about a service, captured at runtime." Metrics are cheap to store and fast to query, which is why dashboards and alerts are built on them. Their limit is that they tell you something changed, not why. A latency graph that spikes at 02:14 tells you when, not which request or what it was doing.

Logs are timestamped records of discrete events: a request received, a query executed, an exception thrown with its stack trace, a user authenticated. OpenTelemetry defines a log as "a timestamped text record, either structured (recommended) or unstructured, with optional metadata." Logs carry the detail metrics lack, the actual error message, the offending parameter, the user ID, but they are voluminous and expensive to search at scale. They are where you go once a metric says something is wrong and you need to know what.

Traces record the full path of a single request as it moves through a distributed system, service to service, with timing at each hop. A trace is made of spans, one per unit of work, and the spans nest to show the call tree. OpenTelemetry defines a trace as tracking "the progression of a single request as it is handled by services that make up an application." Traces answer "the checkout is slow, but which of the eight services it calls is the slow one." In a monolith you may never need them; in a distributed system they are the only thing that makes a cross-service problem tractable.

The three work as a funnel. A metric tells you something is wrong and roughly when. A trace narrows it to the request path and the service at fault. A log tells you exactly what happened at that point in the code. An investigation with all three moves fast; one missing a layer falls back to guessing. Many teams now treat the three pillars as a starting point rather than the whole story, because the real power is correlating across them, jumping from a spiking metric to the exact traces behind it to the logs those traces produced, but you cannot correlate signals you never collected.

How observability works in practice

Producing observability is a pipeline, from raw signals in the code to a question answered during an incident.

  1. Instrument. Add code, or auto-instrumentation, that emits metrics, logs, and traces from the application and its services. This is where OpenTelemetry lives.
  2. Collect and export. Ship the telemetry off the hosts to a central place, usually through a collector that batches and routes it. Telemetry that stays on the box it was generated on cannot be correlated and is lost when the box is gone.
  3. Store and index. Keep the signals in backends built for each type, time-series storage for metrics, a searchable store for logs, a trace store for spans, indexed so they can be queried quickly.
  4. Correlate and explore. Tie the signals together by shared identifiers (a trace ID stamped on the logs, labels shared across metrics) so an analyst can pivot from one signal to another. This is the part that makes a system observable rather than merely monitored.
  5. Alert and investigate. Encode the known-bad conditions as alerts (the monitoring subset), and keep the rest of the telemetry explorable for the questions no alert was written for.

Most of the value, and most of the difficulty, sits in collection and correlation. Signals that are never centralized or never linked by a common identifier are three disconnected datasets, not observability. The correlation is what lets one spiking metric lead to the trace and the log that explain it.

What is OpenTelemetry?

OpenTelemetry is the vendor-neutral, open standard for generating, collecting, and exporting telemetry, the metrics, logs, and traces themselves. It is a Cloud Native Computing Foundation (CNCF) project and is widely adopted as the default way to instrument an application without locking the telemetry to one vendor's format.

The scope matters and is often blurred. OpenTelemetry handles instrumentation and data collection, not storage or analysis. It defines the signal types, the APIs and SDKs that emit them, and a collector that routes them, then hands the data to a backend that stores it and provides the dashboards, queries, and alerting. In practice a team instruments with OpenTelemetry and pairs it with a backend, commercial or open source, that holds the telemetry and lets people query it. Separating the instrumentation standard from the storage and analysis layer is the cleanest way to reason about an observability stack, and it is the distinction most vendor pages gloss over.

Why observability matters to security

Observability grew up on the reliability and performance side, which is why a blue team can write it off as an SRE concern. That is a mistake. The same telemetry that tells an SRE the system is slow tells a defender the system is under attack, and the application and service layer is often the only place an attack is visible at all.

Consider what monitoring alone misses. A credential-stuffing run against a login endpoint is well-formed HTTPS to a service built to receive logins, so the firewall sees nothing wrong. At the observability layer it is unmistakable: the authentication error rate climbs, traffic to one endpoint spikes, latency rises as the backend strains, and the traces and logs show the same source pattern hammering one path. The instrumentation built to catch a bad deploy catches the attack. Injection attempts, authorization bypasses, account-takeover spikes, and data-exfiltration patterns are all anomalies in this telemetry before they are anything else.

This is where observability and threat monitoring overlap. The pipeline is the same: instrument, collect, baseline, investigate deviation. The difference is the question. A reliability team asks "is this normal for our traffic," a security team asks "is this an attack," and the same error-rate spike can be both a failing dependency and an active exploit. The logs in particular do double duty, the request that triggered an exception is also the request that carried the payload, which is exactly why log analysis is a security discipline and not only an operations one.

The unknown-unknowns property is the real security argument. Monitoring catches the attacks you wrote a rule for. Observability lets a hunter ask a question no rule anticipated, why is this service account suddenly calling an API it never touched, why does this trace cross a trust boundary it should not, and get an answer from data already collected. That is the same capability proactive application monitoring gives reliability teams, pointed at an adversary instead of a bug. A SOC that ignores observability data is blind to the layer where modern application attacks land, and an observability stack that feeds only reliability dashboards and never the detection pipeline is leaving half its value unused.

The practical takeaway is that observability telemetry belongs in the security pipeline, not only the SRE one. Route application error rates, authentication failures, and anomalous traces into the same place that correlates the rest of your detections, and the instrumentation you already pay for starts earning a second return.

Frequently Asked Questions

What is observability in simple terms?

Observability is the ability to understand what is going on inside a system just from the data it produces, without having to add new code to answer each new question. A system is observable when its outputs, its metrics, logs, and traces, let you figure out what it is doing and why, including for problems you never predicted. The goal is to be able to ask any question of the system and get an answer from data it already emits.

What are the three pillars of observability?

The three pillars are metrics, logs, and traces. Metrics are numeric measurements aggregated over time, such as request rate or latency. Logs are timestamped records of discrete events with detail like error messages and stack traces. Traces follow a single request across services to show where time is spent. OpenTelemetry treats these three as the core signals an instrumented system emits, and together they let you move from "something is wrong" to the exact cause.

What is the difference between observability and monitoring?

Monitoring watches for conditions you defined in advance and alerts when the system crosses them, so it catches the failures you predicted. Observability is the broader ability to ask arbitrary questions of the system after the fact, including ones you never set up, so it catches novel failures you did not anticipate. Monitoring is effectively a subset of what an observable system gives you: once the telemetry is rich enough to answer any question, encoding the known-bad conditions as alerts is the easy part.

What is OpenTelemetry?

OpenTelemetry is the vendor-neutral, open standard for generating and collecting telemetry, the metrics, logs, and traces, from an application. It is a Cloud Native Computing Foundation (CNCF) project and is widely adopted as a way to instrument software without tying the data to one vendor. It handles instrumentation and collection, not storage or analysis, so teams pair OpenTelemetry data with a separate backend that stores the telemetry and provides dashboards, queries, and alerting.

Why does observability matter for security?

Because the same telemetry that reveals reliability problems reveals attacks. A credential-stuffing run, an injection attempt, or an account-takeover spike shows up as an anomaly in metrics, logs, and traces at the application layer before it is visible anywhere else, and is often invisible to host and network monitoring. Observability also lets a threat hunter ask questions no detection rule anticipated, which is exactly how the subtle, unanticipated intrusions get caught. Routing observability data into the detection pipeline turns instrumentation built for reliability into security visibility.

Is observability just another word for monitoring with more data?

No. More dashboards is still monitoring, you are watching more predefined conditions. Observability is a different capability: being able to explore and query telemetry to answer questions you never set up in advance. The test is whether, when a novel failure appears, the data already flowing out of the system can answer it, or whether you have to add instrumentation and wait for the problem to recur. A system that can answer on the first incident is observable; one that only shows the panels you built is monitored.

The bottom line

Observability is the ability to understand a system from the outside well enough to answer questions you never anticipated, built on three telemetry types that answer different questions: metrics tell you something changed and when, traces narrow it to the request path and service, and logs tell you exactly what happened in the code. Its difference from monitoring is the difference between checking conditions you predicted and investigating ones you did not, and monitoring is really a subset of what an observable system gives you.

For a defender the payoff is concrete. The same telemetry that an SRE uses to find a slow dependency reveals credential stuffing, injection, and account takeover at the layer where they actually land, and the unknown-unknowns property is what lets a hunter catch an intrusion no rule was written for. The instrumentation is already there for reliability. Route it into the detection pipeline and it works twice.

Frequently asked questions

What is observability in simple terms?

<p>Observability is the ability to understand what is going on inside a system just from the data it produces, without having to add new code to answer each new question. A system is observable when its outputs, its metrics, logs, and traces, let you figure out what it is doing and why, including for problems you never predicted. The goal is to be able to ask any question of the system and get an answer from data it already emits.</p>

What are the three pillars of observability?

<p>The three pillars are metrics, logs, and traces. Metrics are numeric measurements aggregated over time, such as request rate or latency. Logs are timestamped records of discrete events with detail like error messages and stack traces. Traces follow a single request across services to show where time is spent. OpenTelemetry treats these three as the core signals an instrumented system emits, and together they let you move from "something is wrong" to the exact cause.</p>

What is the difference between observability and monitoring?

<p>Monitoring watches for conditions you defined in advance and alerts when the system crosses them, so it catches the failures you predicted. Observability is the broader ability to ask arbitrary questions of the system after the fact, including ones you never set up, so it catches novel failures you did not anticipate. Monitoring is effectively a subset of what an observable system gives you: once the telemetry is rich enough to answer any question, encoding the known-bad conditions as alerts is the easy part.</p>

What is OpenTelemetry?

<p>OpenTelemetry is the vendor-neutral, open standard for generating and collecting telemetry, the metrics, logs, and traces, from an application. It is a Cloud Native Computing Foundation (CNCF) project and is widely adopted as a way to instrument software without tying the data to one vendor. It handles instrumentation and collection, not storage or analysis, so teams pair OpenTelemetry data with a separate backend that stores the telemetry and provides dashboards, queries, and alerting.</p>

Why does observability matter for security?

<p>Because the same telemetry that reveals reliability problems reveals attacks. A credential-stuffing run, an injection attempt, or an account-takeover spike shows up as an anomaly in metrics, logs, and traces at the application layer before it is visible anywhere else, and is often invisible to host and network monitoring. Observability also lets a threat hunter ask questions no detection rule anticipated, which is exactly how the subtle, unanticipated intrusions get caught. Routing observability data into the detection pipeline turns instrumentation built for reliability into security visibility.</p>

Is observability just another word for monitoring with more data?

<p>No. More dashboards is still monitoring, you are watching more predefined conditions. Observability is a different capability: being able to explore and query telemetry to answer questions you never set up in advance. The test is whether, when a novel failure appears, the data already flowing out of the system can answer it, or whether you have to add instrumentation and wait for the problem to recur. A system that can answer on the first incident is observable; one that only shows the panels you built is monitored.</p>

Practice track
SOC Analyst Tier 1
Build your foundational skills to monitor, detect, and escalate security alerts. This track includes essential tools, basic log analysis, and introductory incident response labs.
Browse SOC Analyst Tier 1 Labs โ†’