Glossary/Detection Engineering/AWS Infrastructure Observability

What Is AWS Infrastructure Observability? A Security Guide

AWS infrastructure observability is the practice of collecting and analyzing the telemetry AWS emits, the API calls, network flows, configuration state, and metrics, so security teams can detect threats and reconstruct incidents.

An attacker steals a long-lived access key from a developer laptop, calls AssumeRole from an IP in another country, enumerates S3 with ListBuckets, and copies a sensitive bucket out through a VPC endpoint. Every one of those actions is an ordinary AWS API call. Nothing exploited a flaw. To AWS, it is normal activity by a valid credential. The only thing that separates that intrusion from a Tuesday is whether you recorded it, and whether anyone was watching the record.

That is what AWS infrastructure observability is for. Not dashboards for an SRE to watch latency. The security version asks a different question: can I see what every identity and resource in my AWS environment did, well enough to detect an intrusion and reconstruct it afterward? In AWS, activity is invisible by default. CloudTrail is on for 90 days of history but ships nothing durable until you create a trail. VPC Flow Logs are off until you enable them. The telemetry a SOC needs does not exist until someone turns it on and points it somewhere a detection can read it.

This guide is the defender's view of AWS observability: what it means in a security context, the three pillars (logs, metrics, traces) and which one carries the security weight, the native AWS sources (CloudTrail, CloudWatch, VPC Flow Logs, GuardDuty, Config), how they feed a SIEM, the specific signals worth alerting on, and the gaps that bite teams who assume the cloud watches itself. It is written for blue teamers: SOC analysts, threat hunters, and incident responders who inherit an AWS estate and have to detect intrusions in it.

What is AWS infrastructure observability?

AWS infrastructure observability is the practice of collecting and analyzing the telemetry that AWS emits about your environment, the API calls, the network flows, the resource configurations, and the service metrics, so you can understand what happened, detect what should not be happening, and investigate an incident after the fact. The operational discipline borrows the word from DevOps, where observability means inferring a system's internal state from its outputs. The security discipline keeps the mechanics and changes the question: not "is the system healthy?" but "is anything in here behaving like an attacker?"

Three things make this its own problem rather than on-premises monitoring relabeled.

The control plane is an API, and the API is the crime scene. On-premises, a lot of attacker activity happens below the logging layer: a process spawned on a host, a packet on a wire. In AWS, the highest-value actions are API calls to the control plane, creating a user, attaching a policy, disabling a log, copying a snapshot. Those calls are exactly what CloudTrail records. The audit log is not a supplement to detection in AWS. It often is the detection.

Everything is off until you turn it on. A new AWS account does not ship its security telemetry anywhere. CloudTrail keeps 90 days of management-event history you can browse, but durable, queryable logs require a trail writing to S3. VPC Flow Logs, DNS logs, and most service logs are opt-in. The first finding in most cloud assessments is not a misconfiguration. It is a blind spot: logging that was never enabled, or enabled in one account and not the other forty.

Scale and ephemerality work against you. Resources are created and destroyed by code in seconds. An EC2 instance an attacker used can be gone before you investigate, taking its local logs with it. Observability has to capture evidence centrally and continuously, because the resource that holds the clue may not exist by the time you go looking.

The three pillars: logs, metrics, and traces

Observability is conventionally built on three pillars. For security, they are not equal. Logs do almost all the work; metrics and traces play supporting roles.

Pillar What it is AWS source Security weight
Logs Discrete, timestamped records of events CloudTrail, VPC Flow Logs, CloudWatch Logs, service logs Primary. The audit trail of who did what
Metrics Numeric measurements over time CloudWatch metrics and alarms Secondary. Anomalies (CPU spike, egress surge) as a tripwire
Traces The path of a request across services AWS X-Ray, CloudWatch Application Signals Minor for infrastructure security; matters for app-layer attacks

Logs are the spine of cloud detection. A log entry is a fact: at this time, this identity made this API call from this IP, and it succeeded or failed. That is the raw material of every cloud detection and every cloud investigation. CloudTrail tells you who did what to the control plane; VPC Flow Logs tell you what talked to what on the network; service logs (S3 access logs, load balancer logs, DNS query logs) fill in the data-plane detail. If you do nothing else, get the logs.

Metrics are a coarse tripwire. A metric is a number sampled over time: CPU utilization, network bytes out, failed-login count. Metrics rarely identify an attacker by themselves, but a sudden change is a cheap signal that something is off. A cryptominer pins CPU. An exfiltration spikes outbound bytes. A credential-stuffing run spikes failed authentications. Metrics catch the symptom; logs explain the cause.

Traces matter at the application layer, less for infrastructure. A trace follows one request as it hops across services. For application security and for diagnosing abuse of a specific API path, that is useful. For the infrastructure-level intrusions this guide is about, stolen credentials, misconfigured resources, lateral movement through IAM, traces are rarely where the detection lives. This is the same shift that log analysis underwent on-premises: the structured record of events, not the request path, is what you hunt in.

The practical takeaway: spend your first effort on logs, wire a few high-value metrics as alarms, and treat traces as an application-team concern unless you are defending a specific service.

The native AWS sources a defender lives in

AWS Security Telemetry
Five native sources, one detection pipeline
Each source answers a different question. Centralize them and a SOC correlates AWS activity alongside endpoint and identity logs.
LOGS
CloudTrail
API calls: who did what, where, and did it succeed?
METRICS
CloudWatch
Metrics, logs, alarms: is a resource behaving abnormally?
NETWORK
VPC Flow Logs
IP traffic metadata: what talked to what, accepted or rejected?
THREAT
GuardDuty
Findings from CloudTrail, VPC, and DNS logs: has AWS spotted known-bad?
CONFIG
AWS Config
Resource configuration state: what is it, and when did it change?
CENTRALIZE
SIEM
One logging account, immutable storage, correlated detection.
Default state: off In AWS this telemetry is off until you turn it on. The most common breach finding is a blind spot, not a clever exploit. Turn on an organization-wide CloudTrail to an immutable logging account first.

AWS gives you the telemetry; configuring it correctly is the work. Five native sources carry most of the security signal. Each answers a different question, and each is verified here against AWS's own documentation.

Source What it records The question it answers
CloudTrail API calls (management and data events) Who did what, where, and did it succeed?
CloudWatch Metrics, logs, alarms Is a resource behaving abnormally, and can I alert on it?
VPC Flow Logs IP traffic metadata to and from network interfaces What talked to what, accepted or rejected?
GuardDuty Threat findings from CloudTrail, VPC, and DNS logs Has AWS already spotted a known-bad pattern?
AWS Config Resource configuration state and changes over time What is the configuration, and when did it change?

CloudTrail is the foundation. AWS describes CloudTrail as the service that records "actions taken by a user, role, or an AWS service" as events, covering the console, CLI, SDKs, and APIs. Management events (the control-plane operations: CreateUser, AttachRolePolicy, StopLogging) are recorded by default in the 90-day event history; durable logging needs a trail delivering to S3. Data events (object-level S3 reads, Lambda invokes) are high-volume and off by default, but they are where you see the actual data access. For a defender, CloudTrail answers the most important question in a cloud breach: which identity took which action against which resource, and when.

CloudWatch is AWS's monitoring service, collecting metrics, storing logs (CloudWatch Logs), and firing alarms against thresholds. AWS positions it as real-time observability of "performance, operational health, and resource utilization." For security it plays two roles: it is a destination other logs are shipped to (VPC Flow Logs and CloudTrail can both land in CloudWatch Logs, where metric filters extract numbers from log text), and its alarms turn a metric breach into an automated action. A metric filter that counts ConsoleLogin failures and an alarm that fires on a spike is a working brute-force detection built entirely from CloudWatch.

VPC Flow Logs capture metadata about IP traffic to and from network interfaces: source, destination, ports, protocol, byte counts, and whether the traffic was accepted or rejected. The critical limit, stated plainly in the AWS docs, is that flow logs capture metadata, not packet contents. They tell you that 4 GB moved from an internal instance to an unfamiliar external IP; they do not tell you what was in it. For network-side detection, rejected-traffic patterns, talk to known-bad IPs, and unexplained egress volume, flow logs are the source.

GuardDuty is AWS's managed threat-detection service. When enabled, it automatically ingests its foundational data sources, AWS confirms these as CloudTrail management events, VPC Flow Logs, and DNS logs, and applies threat intelligence feeds and machine learning to flag suspicious activity: compromised credentials, cryptomining, data exfiltration patterns, communication with known-malicious infrastructure. It is the cheapest high-value detection in AWS because it reads logs you would otherwise have to build detections against yourself. It is not a replacement for your own logging and SIEM; it is a strong first layer that produces findings you route into the same pipeline.

AWS Config records the configuration of your resources and how it changed over time. AWS describes it as a "detailed view of the configuration of AWS resources," including historical state and relationships, with rules that evaluate resources for compliance and flag noncompliant ones. For security its value is twofold: it answers "was this bucket ever public, and when did that change?" during an investigation, and its rules continuously catch misconfigurations (open security groups, unencrypted volumes, disabled logging) that are the number-one cause of cloud breaches.

How the sources feed a SIEM

Native AWS tools detect, but a SOC does not want to live in five consoles across forty accounts. The pattern that works is to centralize: each source ships to one place a SIEM can read, where AWS telemetry sits alongside endpoint, identity, and on-premises logs and gets correlated.

The standard flow:

  • CloudTrail writes to a dedicated, locked-down S3 bucket in a logging account, ideally an organization trail covering every account. The SIEM pulls from that bucket, or CloudTrail also delivers to CloudWatch Logs, which forwards on.
  • VPC Flow Logs publish to CloudWatch Logs or S3, then forward to the SIEM. AWS supports CloudWatch Logs, S3, and Data Firehose as flow-log destinations.
  • GuardDuty findings route through Amazon EventBridge to a Lambda, an SNS topic, or directly into the SIEM, so a finding becomes an alert in the same queue an analyst already watches.
  • AWS Config changes and compliance results stream to S3 and EventBridge, feeding both investigation lookups and posture alerting.
  • CloudWatch metrics and alarms publish to SNS or EventBridge for the metric-based tripwires.

Two design rules decide whether this holds up. First, centralize into a separate logging account with tight access and S3 object-lock or equivalent immutability, so an attacker who lands in a workload account cannot delete the evidence of their own intrusion. StopLogging and bucket deletion are classic anti-forensic moves; the logs have to live somewhere the compromised credentials cannot reach. Second, make it an organization-wide trail, not per-account, so a new account does not silently start life with no logging. The most common real-world gap is not a missing detection. It is one account out of many that nobody enabled CloudTrail in.

The security signals worth watching

Collecting telemetry is half the job. The other half is knowing which patterns in it mean trouble. These are the AWS-native signals that earn an alert, mapped to the source that carries them.

Signal Source Why it matters
StopLogging / CloudTrail trail deleted or disabled CloudTrail Anti-forensics. An attacker blinding the logs is itself the alert
Root account used at all CloudTrail Root should be near-dormant; any use is high-severity
Console login without MFA, or from a new geo/IP CloudTrail Credential compromise indicator
AssumeRole chains and unusual cross-account role use CloudTrail Privilege escalation and lateral movement through IAM
Spike in AccessDenied / failed API calls CloudTrail Enumeration: an attacker probing what a stolen credential can do
New IAM user, access key, or policy attachment CloudTrail Persistence: attackers create their own durable access
Security group opened to 0.0.0.0/0 Config / CloudTrail Exposure created, often the prelude to or cover for an intrusion
Outbound traffic to known-bad IPs, or large unexplained egress VPC Flow Logs Exfiltration or C2
GuardDuty finding (cryptomining, exfiltration, recon) GuardDuty AWS already correlated the pattern; triage immediately
Public S3 bucket or disabled bucket encryption Config The classic data-exposure misconfiguration

A few of these deserve emphasis. StopLogging is the canary. A competent attacker tries to disable logging early; an alert on that single API call catches intrusions whose other steps you missed. Enumeration shows up as failed calls. A stolen credential of unknown scope gets tested, and the test generates a burst of AccessDenied across services that stands out sharply against a normal baseline. Identity actions are persistence. Creating a new user, minting an access key, or attaching AdministratorAccess is how an attacker keeps access after the original credential is rotated, and each is a single, alertable CloudTrail event.

The through-line: most high-value AWS detections are single API calls or simple rate anomalies in CloudTrail. You do not need machine learning to catch the basics. You need the log, a baseline of normal, and a rule.

The gaps that bite teams

AWS observability fails in predictable ways. Knowing them is the difference between coverage you trust and coverage you assume.

Logging that was never enabled, or enabled unevenly. The default state is off. A trail in the primary account and none in the dev account, where the breach actually starts, is the most common real gap. Organization trails fix this; per-account setup invites drift.

Data events off means data access is invisible. CloudTrail management events show that a role was assumed and a bucket's policy changed. They do not show the objects read out of it. S3 data events are off by default and high-volume, so teams skip them, and then cannot answer "what did they actually take?" during a breach. Decide deliberately which buckets warrant data-event logging.

Flow logs are metadata, not content. They prove 4 GB left for an external IP. They cannot tell you whether it was the customer database or cache files. Network-layer detection in AWS is volumetric and relational, not content-based, and an investigation that needs payload has to get it elsewhere.

GuardDuty is broad, not deep, and not free. It catches known patterns well and is the best first layer, but it will not catch a slow, careful, novel intrusion that never trips a known signature. Treating a quiet GuardDuty console as proof of safety is the trap. It complements your own detections; it does not replace them.

Cost throttles collection. Data events, verbose flow logs, and high log volumes cost real money to ingest and store, in AWS and in the SIEM. Teams quietly sample or drop logs to control spend, and the dropped log is the one the next investigation needed. The fix is deliberate: log the high-value sources fully, sample the noisy low-value ones, and never let the cost conversation silently turn off CloudTrail.

The cloud does not watch itself. The deepest gap is the assumption behind all the others, that moving to AWS means AWS handles security monitoring. AWS secures its infrastructure. Watching your activity inside it is your job, the same customer-owned responsibility that defines cloud security generally. Observability is how you discharge it.

Building AWS observability for detection

If you are standing this up, the order that reduces risk fastest:

  1. Turn on an organization-wide CloudTrail to a locked, centralized logging account before anything else. This is the single highest-value step; it is the audit trail every other detection depends on.
  2. Enable GuardDuty across the organization. It is low-effort, reads logs you already have, and gives an immediate detection baseline while you build your own.
  3. Enable VPC Flow Logs on production VPCs and ship them centrally for network-side visibility.
  4. Turn on AWS Config with a baseline ruleset (public buckets, open security groups, unencrypted storage, disabled logging) so misconfigurations are caught continuously, not at audit time.
  5. Centralize into a SIEM and write detections for the signals in the table above, starting with StopLogging, root usage, MFA-less logins, and AccessDenied spikes.
  6. Decide on data events deliberately, enabling S3 object-level logging on the buckets that hold sensitive data.

The fastest way to build the instinct for reading this telemetry is to work real cloud intrusions and trace them through CloudTrail and flow logs the way you would in production. Following an attacker from a leaked key through AssumeRole, enumeration, and exfiltration, in the actual logs, teaches what each source shows and where each one goes quiet.

The bottom line

AWS infrastructure observability for security is the discipline of recording what every identity and resource in your environment does, and watching that record for the patterns that mean an intrusion. The mechanics come from DevOps; the question is a defender's. Logs carry almost all the weight, CloudTrail above all, with VPC Flow Logs, GuardDuty, and Config filling in the network, threat-intel, and configuration sides, and CloudWatch metrics as cheap tripwires.

The trap is the default state: in AWS the telemetry is off until you turn it on, the cloud does not watch itself, and the most common breach finding is a blind spot rather than a clever exploit. Turn on an organization-wide CloudTrail to an immutable logging account, enable GuardDuty, ship everything to a SIEM, and alert on the signals that are single API calls.

Frequently asked questions

What is AWS infrastructure observability?

<p>AWS infrastructure observability is the practice of collecting and analyzing the telemetry AWS emits about your environment, API calls, network flows, configuration state, and service metrics, so you can detect threats and investigate incidents. For security teams it answers a specific question: can I see what every identity and resource in my AWS account did, well enough to catch an intrusion and reconstruct it afterward. In AWS most of this telemetry is off by default and must be enabled.</p>

What are the three pillars of observability?

<p>Logs, metrics, and traces. Logs are timestamped records of events (in AWS, CloudTrail, VPC Flow Logs, and service logs). Metrics are numeric measurements over time (CloudWatch). Traces follow a request across services (X-Ray). For infrastructure security the weight is on logs, which carry the audit trail of who did what; metrics serve as anomaly tripwires; traces matter mostly at the application layer.</p>

What is the difference between CloudTrail and CloudWatch?

<p>CloudTrail records API activity: which identity called which AWS API, from where, and whether it succeeded. It is the audit log of actions taken in your account. CloudWatch is the monitoring service: it collects metrics, stores logs, and fires alarms on thresholds. In short, CloudTrail answers "who did what," and CloudWatch answers "is a resource behaving abnormally, and can I alert on it." They are often used together, with CloudTrail events forwarded into CloudWatch Logs.</p>

What does GuardDuty monitor?

<p>Amazon GuardDuty is AWS's managed threat-detection service. It automatically analyzes three foundational data sources, CloudTrail management events, VPC Flow Logs, and DNS logs, using threat intelligence feeds and machine learning. It flags suspicious activity such as compromised credentials, cryptomining, data exfiltration, and communication with known-malicious infrastructure, and generates findings you can route into a SIEM. It reads those logs without you building detections against them yourself, which makes it a strong first detection layer.</p>

Do VPC Flow Logs capture packet contents?

<p>No. VPC Flow Logs capture metadata about IP traffic, source and destination addresses and ports, protocol, byte and packet counts, and whether the traffic was accepted or rejected. They do not record the contents of the packets. They can show that a large volume of data left an instance for an external address, which is a strong exfiltration signal, but they cannot tell you what the data was. Payload-level inspection requires a different source.</p>

How do AWS logs feed a SIEM?

<p>CloudTrail writes to a centralized, locked-down S3 bucket (ideally one organization-wide trail), which the SIEM pulls from. VPC Flow Logs publish to CloudWatch Logs or S3 and forward on. GuardDuty findings route through Amazon EventBridge into the SIEM. AWS Config and CloudWatch alarms publish through EventBridge or SNS. The goal is a single logging account with immutable storage, so AWS telemetry is correlated alongside endpoint and identity logs and an attacker cannot delete the evidence of their own intrusion.</p>

Practice track
SOC Analyst Tier 2
Advance your expertise with hands-on labs focusing on threat detection, in-depth log analysis, and the effective use of SIEM tools for investigating and triaging incidents.
Browse SOC Analyst Tier 2 Labs โ†’