Detection Engineering

What Is Data Onboarding? Getting Logs Into a SIEM

12 min read·Updated June 2026·SIEMBlue TeamSOCDetection Engineering

A detection engineer writes a perfect rule for a credential-stuffing attack: ten failed logins then a success from one source. It never fires. Not because the attack did not happen, but because the authentication logs reach the SIEM with the source IP parsed into the wrong field, the timestamp in local time instead of UTC, and the username buried in a free-text message the rule never reads. The rule is fine. The data onboarding is broken.

Data onboarding is the work of getting a log source into the SIEM so it is actually usable: connected, parsed into fields, normalized to a common schema, enriched, and routed to the right place. It is the least glamorous stage of running a SIEM and the one that decides whether everything downstream works. A detection is only as good as the data under it.

This guide defines data onboarding precisely, walks the pipeline stage by stage, shows the table of what a source needs before it is "onboarded," and is honest about where onboarding breaks. It is written for the people who do the work: SOC analysts, detection engineers, and the platform owners who own the ingestion.

What is data onboarding?

Data onboarding is the process of bringing a new data source into a SIEM or analytics platform and making its events queryable, correlatable, and ready for detection. It is more than pointing a log at the platform. A source is onboarded only when its events arrive reliably, parse into the right fields, map to the platform's schema, carry the context an analyst needs, and land where detections and searches can reach them.

The distinction that matters: ingestion is not onboarding. Ingestion is the raw act of accepting bytes. Onboarding is the full job of turning those bytes into structured, normalized, searchable telemetry. A firewall that streams syslog into the SIEM is ingested. It is not onboarded until the deny events are parsed, the source and destination IPs sit in known fields, the action maps to a standard value, and a detection can ask "show me denies to this host" and get an answer.

Onboarding happens at two moments. The first is initial onboarding: adding a brand-new source, building or selecting its parser, mapping its fields, and validating the result. The second is ongoing onboarding: a vendor changes a log format, a new application ships, a cloud account is added, and the existing parser silently breaks. Onboarding is not a one-time project. It is a standing operational discipline, because the sources never stop changing.

Why it carries weight: every detection, dashboard, hunt, and investigation reads onboarded data. If a source is missing, your coverage has a hole you cannot see. If a source is onboarded wrong, your detections run against garbage and either miss real attacks or bury analysts in false positives. Coverage and data quality both live or die at onboarding.

How the data onboarding pipeline works

Data onboarding · raw log to detection-ready

Five stages. A failure at any one corrupts the rest.

A raw event enters at one end; structured, searchable, detection-ready telemetry comes out the other.

01

Collection

Get every needed event off the source and to the platform, reliably.

→

02

Parsing

Break the raw event into named fields: user, IP, action, timestamp.

→

03

Normalization

Map fields to one common schema so sources speak the same language.

→

04

Enrichment

Add context: geo, asset, identity, threat-intel tags.

→

05

Routing & storage

Hot tier for detection, cold tier for retention, cost controlled.

Bottom line Ingestion accepts bytes. Onboarding runs all five stages, then validates a real event end to end. A source is onboarded only when a detection fires on it.

Data onboarding is a pipeline. A raw log enters at one end; structured, searchable, detection-ready telemetry comes out the other. Each stage has one job, and a failure at any stage corrupts everything after it.

Collection

Get the events off the source and to the platform. The method depends on the source: an agent on an endpoint, syslog from a network device, an API pull from a cloud provider, a forwarder reading flat files, a webhook from a SaaS app. The decisions here are reliability and completeness. Does the transport survive a network blip without dropping events? Are you collecting every event type you need, or only the easy ones? A collection gap is the worst onboarding failure because nothing downstream can recover data that never arrived.

Parsing

Turn the raw event into structured fields. A syslog line, a JSON blob, a CSV row, a Windows Event XML record: each has to be broken into named fields the platform can index. This is where the username, source IP, action, and timestamp get pulled out of the raw text. Parsing is the stage that breaks most often and most silently, because a vendor can change a log format in a minor update and the parser keeps running while quietly dropping fields into the wrong place.

Normalization

Map the parsed fields to a common schema so sources speak the same language. One firewall calls it src_ip, another calls it source-address, a cloud log calls it sourceIPAddress. Normalization maps all three to one field name, so a detection or a log analysis query can ask one question across every source instead of one question per vendor. Schemas like the Elastic Common Schema or the Open Cybersecurity Schema Framework (OCSF) exist precisely to standardize this mapping. Without normalization, cross-source correlation is impossible: the SIEM cannot connect a firewall deny to an endpoint process if the host identifier is named differently in each.

Enrichment

Add context the raw event does not carry. Geolocation on an IP, a hostname resolved from an asset inventory, a user's department from the directory, a threat-intelligence tag on a known-bad domain. Enrichment is what lets an analyst read an alert and understand it without pivoting to five other tools. A login from a new country means more when the event already carries the user's normal location and role.

Routing and storage

Decide where each event goes and how long it lives. Not all data deserves the same treatment. High-value security logs go to hot, searchable storage feeding real-time detection. High-volume, low-signal data may go to cheaper storage or a separate tier, kept for compliance and retroactive hunting but not indexed for live correlation. This is where data onboarding meets cost: SIEM pricing is usually tied to volume, so routing decisions directly control the bill.

What a source needs before it is "onboarded"

A source is not onboarded because it is sending data. It is onboarded when it clears every column below. Use this as the checklist before you call a source done.

Stage	The question	Onboarded when
Collection	Are all needed events arriving reliably?	No drops, all required event types covered
Parsing	Are events broken into the right fields?	Key fields extracted correctly, no data in free-text
Normalization	Do fields map to the common schema?	Standard field names, consistent values across sources
Enrichment	Is analyst context attached?	Geo, asset, identity, and threat-intel context present
Routing	Is the data in the right tier?	Hot for detection, cold for retention, cost controlled
Validation	Has the result been confirmed?	Test events verified end to end, detections firing

Read the table top to bottom: each stage depends on the one above it. Perfect routing cannot fix broken parsing. Enrichment cannot rescue events that collection dropped. Onboarding is only as strong as its weakest stage.

Where data onboarding breaks

Onboarding fails quietly, which is what makes it dangerous. A detection that never fires looks exactly like an environment with no attacks. These are the failure modes to watch.

Silent collection gaps. A forwarder dies, a cloud API key expires, a firewall stops sending. No error reaches the SOC, the dashboard just goes quiet, and the gap is discovered during an incident when the logs are not there. Onboarding needs monitoring for the absence of data, not only for the data.
Parser drift. A vendor changes a log format in a routine update. The parser keeps running but starts misfiling fields. Detections built on those fields silently stop matching. This is the most common ongoing onboarding failure and the hardest to spot, because nothing breaks loudly.
Normalization mismatches. A new source is parsed but never mapped to the schema, so its events sit in the SIEM unreachable by cross-source detections. The data is technically there and effectively invisible.
Onboarding the easy fields only. A source gets connected with a default parser that grabs the obvious fields and leaves the rest in a raw message blob. The source looks onboarded on a dashboard and fails the moment a detection needs a field the default parser skipped.
Volume without value. Onboarding everything because you can, not because you need it, inflates cost and buries signal. High-volume debug logs with no security value still get indexed and still get billed. Onboarding is a decision about what to collect, not only how.
No validation. A source is "onboarded" and never tested with a known event to confirm it parses, normalizes, and triggers a detection end to end. The first real test becomes the first real incident.

None of these are exotic. They are the ordinary ways a SIEM ends up with blind spots its operators do not know about.

Data onboarding best practices

The discipline that keeps onboarding healthy is mostly about validation and monitoring, not the initial connection.

Onboard against use cases, not availability. Start from the detections and investigations you need, then onboard the sources that feed them. Collecting a log because it exists, with no detection that reads it, is paying to store noise. This keeps both cost and signal under control.

Validate every source end to end. Generate a known test event at the source and confirm it arrives, parses into the right fields, normalizes to the schema, and trips a detection. A source is not onboarded until you have seen a real event travel the full pipeline. Re-validate after any vendor update.

Monitor for the absence of data. Build alerts that fire when an expected source goes quiet. Collection gaps are invisible by default. The only way to catch them before an incident does is to watch each source's volume and alert on a drop.

Standardize on a schema early. Pick a common schema, such as OCSF or the Elastic Common Schema, and map every source to it. Retrofitting normalization across dozens of already-onboarded sources is far more expensive than mapping each one as it lands.

Treat parsers as code. Version parsers, test them against sample logs, and review them when a source changes. Parser drift is a code-maintenance problem in disguise, and the teams that handle it well treat it like one.

How a blue team uses data onboarding

Onboarding is not a one-time setup task handed to a platform admin. It is continuous work that shapes what every analyst can see.

Detection engineering depends on it directly. A detection engineering team cannot write a rule for a field that onboarding never parsed. Before building detections for a new source, the engineer confirms the fields the rule needs are extracted and normalized. Half of detection work is verifying the data is there in the shape the rule expects.

Threat hunting reveals onboarding gaps. A hunter who goes looking for a behavior and finds the source missing, or the field empty, has found an onboarding gap before an attacker did. Hunts double as coverage audits.

Incident response pays the bill for bad onboarding. When responders need the logs and a source was never onboarded, or was onboarded wrong, the evidence is not there. The investigation stalls on missing data that should have been collected months earlier. The quality of an investigation is set long before the incident, at onboarding.

SOC analysts feel parser drift first. When a once-reliable detection goes quiet or a field that used to populate is suddenly empty, the analyst working the queue is the early warning that a parser broke. Closing that loop back to whoever owns onboarding is part of the job.

Working real logs end to end, parsing them, normalizing them, and deciding what is signal, is the same skill onboarding demands, and the fastest way to build it is on realistic data.

Frequently Asked Questions

What is data onboarding in a SIEM?

Data onboarding is the process of bringing a log source into a SIEM and making its events usable: collected reliably, parsed into fields, normalized to a common schema, enriched with context, and routed to the right storage. A source is onboarded only when a detection or search can actually read its data, not just when it is sending events.

What is the difference between data ingestion and data onboarding?

Ingestion is the raw act of accepting log data into the platform. Onboarding is the full job of turning that raw data into structured, normalized, searchable telemetry that detections can use. A source can be ingested (sending bytes) but not onboarded (parsed and mapped) yet, which is why ingested data often sits in a SIEM unusable.

Why does data onboarding matter for detection?

Every detection reads onboarded data, so a missing or badly onboarded source creates a blind spot. If a field a rule needs was never parsed, the rule cannot fire even when the attack happens. Coverage gaps and false negatives usually trace back to onboarding, not to the detection logic itself.

What is log normalization?

Normalization maps parsed fields from different sources to one common schema, so a source IP is the same field name whether it came from a firewall, an endpoint, or a cloud log. It is what makes cross-source correlation possible. Schemas like OCSF and the Elastic Common Schema exist to standardize this mapping.

What is parser drift?

Parser drift is when a log source changes its format, often in a routine vendor update, and the existing parser keeps running while misfiling or dropping fields. Detections built on those fields silently stop matching. It is the most common ongoing onboarding failure because nothing throws an error; the data just quietly becomes wrong.

How do you validate that a source is onboarded correctly?

Generate a known test event at the source and follow it through the full pipeline: confirm it arrives, parses into the correct fields, normalizes to the schema, and triggers a detection. A source is not onboarded until a real event has traveled end to end and a detection has fired on it. Re-validate after any source or vendor change.

The bottom line

Data onboarding is the work of getting a source into a SIEM so it is actually usable: collected reliably, parsed into fields, normalized to a common schema, enriched with context, and routed to the right tier. Ingestion accepts bytes; onboarding makes them mean something. Every detection, hunt, and investigation downstream reads that onboarded data, so the quality of your security operations is capped by the quality of your onboarding.

It is not a one-time project. Sources change, vendors update formats, parsers drift, and collection gaps open silently. The teams that run onboarding well treat it as standing discipline: onboard against use cases, validate every source end to end, monitor for the absence of data, standardize on a schema early, and version parsers like the code they are. Get onboarding right and the detections have something solid to stand on. Get it wrong and the best rule in the world fires on nothing.

Frequently asked questions

What is data onboarding in a SIEM?

What is the difference between data ingestion and data onboarding?

Why does data onboarding matter for detection?