What Are Log File Formats? JSON, CEF, Syslog, W3C
A log file format is the convention that defines how a system writes each event: which fields appear, in what order, what separates them, and how they are encoded.
Two log lines describe the same event. One reads CEF:0|Security|threatmanager|1.0|100|worm successfully stopped|10|src=10.0.0.1 dst=2.1.2.2. The other reads {"vendor":"Security","product":"threatmanager","event":"worm stopped","src":"10.0.0.1","dst":"2.1.2.2"}. A human can read both. A parser cannot, unless it was told which one to expect. That gap, between a line a person can skim and a line a machine can split into fields, is what a log file format is for. Get the format wrong and the source IP that should be a pivot point stays buried in a string.
A log file format is the agreed structure a system uses to write each event: which fields appear, in what order, separated by what, and whether a machine can split them back apart without guessing. Web servers, firewalls, operating systems, and cloud services each pick a format, and the format determines how cheaply you can search, correlate, and alert on what they wrote. This guide covers what a log file format is, the split between structured and unstructured logs, the six formats you will actually meet (JSON, Windows Event logs, CEF, CLF, ELF, and W3C), the syslog framing that carries many of them, and why format mismatches are where ingestion pipelines break. It is written for the people who feed these files into a detection stack: SOC analysts, threat hunters, and DFIR responders.
What is a log file format?
A log file format is the convention that defines how a single log event is written to a line or record: the set of fields, their order, the delimiter between them, and the encoding. Two systems can record the identical event, a blocked connection from 10.0.0.1, and produce completely different bytes, because they follow different formats. The event is the same. The structure is not.
Format matters because almost nothing reads a raw log by eye at scale. The log gets shipped to a collector, split into fields, normalized to a common schema, and queried. Every one of those steps depends on the parser knowing the format in advance. A parser built for space-delimited web logs will mangle a JSON line. A regex tuned for one vendor's syslog will drop a field when the vendor reorders its output. The format is the contract between the system that writes the log and the pipeline that reads it, and when the contract is unstated or wrong, fields go missing silently.
For a defender that silent failure is the risk. A log that ingests but parses into one giant unsplit message field is worse than no log, because dashboards show green while the field you need to alert on is unsearchable. Knowing the formats is how you catch that before an incident, not during one.
Structured, semi-structured, and unstructured logs
Before the named formats, one distinction organizes all of them: how much pattern the log carries, and therefore how much work a machine does to read it.
Structured logs follow a clear, consistent pattern that both humans and machines can read. Fields sit in a fixed order, separated by a known character, a comma, a space, an equals sign. A CSV log or a fixed-field web log is structured: the third field is always the status code, so a parser splits on the delimiter and counts. Predictable and cheap to parse, but rigid. Add a field in the wrong place and every downstream parser that counts positions breaks.
Semi-structured logs carry a schema without forcing a fixed position. JSON is the canonical example: each value is labeled by a key, so order does not matter and fields can nest, but a machine still needs to parse the JSON to read it. This is the sweet spot most modern logging targets, readable enough for a human, structured enough for a machine, flexible enough to add fields without breaking the reader.
Unstructured logs have no enforced pattern. They are human-readable free text, a line an application printed for a person, and a machine cannot reliably split them without custom parsing, usually regex written per-source. Many application and debug logs land here. They are the most expensive to work with at scale, because every new message shape is a new parsing problem.
The practical takeaway: the more structure a format enforces, the less custom parsing you write, and the more reliably a field survives the trip into your search index. Most of the named formats below exist to push logs out of the unstructured bucket.
The six log file formats you will meet
Six formats cover the large majority of logs a defender ingests. Read them as a spectrum, from rigid fixed-field web logs to fully customizable schemas.
JSON
JSON (JavaScript Object Notation) is semi-structured and is the default for most modern applications, cloud services, and APIs. An event is a set of key-value pairs, values can nest, and the format supports data types (strings, numbers, booleans, arrays) and UTF-8 text. Because every value is labeled, field order is irrelevant and a source can add a field without breaking a parser that does not know about it. That flexibility is exactly why it dominates new logging: it tolerates change. The cost is verbosity (the keys repeat on every line) and that a malformed brace breaks the whole record.
A JSON log line looks like this:
{"timestamp":"2026-06-20T02:14:07Z","src_ip":"10.0.0.1","action":"blocked","status":403,"user_agent":"python-requests/2.31.0"}
Windows Event logs
Windows Event logs are the operating system's own structured record of OS and security events, stored in the binary .evtx format and read through Event Viewer or exported to XML. Each event carries a timestamp, an Event ID, the source, the host, and often a username, plus event-specific data. They are detailed and consistent, which is why so much endpoint detection keys on Event IDs (the 4624 / 4625 logon family, 4688 process creation). The catch is they are a Windows-native binary, so getting them into a cross-platform pipeline means forwarding or converting them first. The Windows event log is the first place most host-based investigations start.
CEF (Common Event Format)
CEF is an open, text-based format ArcSight created to carry security events from many device types into a SIEM in one shape. A CEF message is a syslog line whose body has a pipe-delimited header of exactly seven fields followed by a key-value extension:
CEF:Version|Device Vendor|Device Product|Device Version|Device Event Class ID|Name|Severity|Extension
The canonical example from the specification:
Sep 19 08:26:10 host CEF:0|Security|threatmanager|1.0|100|worm successfully stopped|10|src=10.0.0.1 dst=2.1.2.2 spt=1232
CEF:0 is the format version (0 or 1). Then vendor, product, and product version identify the source. 100 is the device event class ID, worm successfully stopped is the human-readable name, and 10 is the severity (0 to 10). Everything after the seventh pipe is the extension: space-separated key=value pairs in any order (src, dst, spt here). CEF's whole point is normalization at the source, so a SIEM ingesting twenty vendors sees one structure instead of twenty.
CLF (Common Log Format)
CLF is one of the oldest web server log formats and is rigidly fixed: seven fields, fixed order, no customization. Apache defines it as %h %l %u %t "%r" %>s %b, and a line reads:
127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326
Client IP, identity, userid, timestamp, request line, status code, response bytes. Because the structure never changes, a parser counts fields and never guesses. The limitation is the flip side: CLF cannot record anything outside those seven fields, which is why most operators run a richer format instead.
ELF (Extended Log Format)
ELF (the NCSA Extended/Combined Log Format) extends CLF with more fields and more flexibility. In practice this is Apache's Combined Log Format: CLF plus the referer and the user-agent on the end. It also allows metadata directives, lines beginning with # that declare version and field layout, so a reader can learn the format from the file itself. The extra fields are exactly the ones a hunt wants: the user-agent is where scanners like sqlmap and python-requests announce themselves, and the referer can expose where a malicious link was hosted.
W3C (W3C Extended Log Format)
The W3C Extended Log Format is the most customizable of the six and is the default for Microsoft IIS. The administrator picks which fields to log, and #Fields: header directives at the top of the file declare exactly which fields appear and in what order, so the file is self-describing. Field names use prefix notation to mark direction: s- for server, c- for client, cs- for client-to-server, and sc- for server-to-client. So cs-method is the request method the client sent, sc-status is the status the server returned, and c-ip is the client address. The flexibility is the strength and the trap: because every IIS deployment can log a different field set, you cannot assume column positions, you read the #Fields: line first.
How the formats compare
The same five questions sort all six formats: how much structure, who writes it, can a machine parse it without custom code, and how flexible is the field set.
| Format | Structure | Typical source | Machine-parsable | Field flexibility |
|---|---|---|---|---|
| JSON | Semi-structured | Apps, cloud, APIs | Yes, native | High, add keys freely |
| Windows Event | Structured (binary) | Windows OS / security | Yes, after export | Fixed schema per Event ID |
| CEF | Structured | Security devices to SIEM | Yes, known layout | Fixed header, open extension |
| CLF | Structured | Web servers (legacy) | Yes, fixed positions | None, 7 fixed fields |
| ELF / Combined | Structured | Web servers | Yes, fixed positions | Low, a few extra fields |
| W3C Extended | Structured, self-describing | IIS | Yes, read #Fields first | High, admin selects fields |
The pattern: JSON and W3C buy flexibility at the cost of needing the schema, CLF and ELF buy simplicity at the cost of being fixed, CEF buys cross-vendor normalization, and Windows Event logs buy OS-level detail at the cost of being a native binary.
Syslog: the framing that carries the rest
Syslog is often called a format, but it is more accurately the transport and framing that many of these logs travel inside. CEF, for instance, is a body wrapped in a syslog line. The modern syslog message format is defined by RFC 5424 (2009), which obsoleted the older RFC 3164.
An RFC 5424 message is a header, optional structured data, and the message text. The header carries seven fields in order: PRI, VERSION, TIMESTAMP, HOSTNAME, APP-NAME, PROCID, and MSGID. The PRI value, written in angle brackets like <165>, encodes both the facility and the severity in one number: facility times eight plus severity. Facilities run 0 to 23 (kernel, mail, auth, local0 through local7), severities run 0 to 7 (emergency through debug). So <165> is facility 20, severity 5. The structured data element carries machine-parsable [SD-ID key="value"] pairs in UTF-8.
Why a defender cares: when logs from network gear, firewalls, and Unix hosts all arrive over syslog, the PRI and the header are how you filter and route before you ever parse the body. A flood of high-severity messages from the auth facility is a signal on its own, before the message text is parsed at all.
Why formats break ingestion pipelines
Format is where log pipelines fail quietly, and the failures all look the same from the dashboard: data is flowing, so nothing alarms, but a field you need is gone.
Wrong parser, one big field. Point a JSON parser at syslog text, or a CLF parser at a W3C file, and the line ingests but lands as one unsplit message blob. Searches on src_ip return nothing because there is no src_ip field, only text that happens to contain an IP. The log is present and useless at the same time.
Position assumptions on a flexible format. W3C and JSON do not guarantee field order. A parser that hard-codes "status is the ninth column" breaks the moment IIS is reconfigured to log a different field set, which is exactly why the #Fields: line exists and must be read first.
Timestamp and timezone drift. Formats disagree on time. CLF uses [10/Oct/2000:13:55:36 -0700], JSON tends toward ISO 8601 2026-06-20T02:14:07Z, syslog uses its own RFC 3339 form. If the parser misreads the format or the timezone, every event lands at the wrong time and your incident timeline is wrong in a way that is hard to notice.
Encoding and escaping. A literal pipe inside a CEF extension value, or an unescaped quote in a JSON string, can split a field early or break the record. Multi-line stack traces in unstructured application logs get chopped into separate events.
The reason this is a format problem and not just an ops problem: getting raw logs of mixed formats into one queryable, correlatable store is the core of log analysis, and that pipeline is only as reliable as its weakest parser. Pulling many formats into one place, centralized logging, is what makes correlation possible, but it only works if each source is parsed to the right schema first. The format you ignore at onboarding is the field you cannot search during the incident.
Frequently Asked Questions
What is a log file format?
A log file format is the convention that defines how a system writes each log event: which fields appear, in what order, what separates them, and how they are encoded. The same event written in two different formats produces different bytes. Format matters because parsers must know the structure in advance to split a log line back into searchable fields.
What is the difference between structured and unstructured logs?
Structured logs follow a fixed, consistent pattern with fields in a known order separated by a known delimiter, so a machine parses them cheaply. Unstructured logs are free text with no enforced pattern, readable by humans but requiring custom parsing (usually regex) for machines. Semi-structured logs like JSON sit between: they carry a labeled schema but still need parsing, trading a little cost for flexibility.
What are the most common log file formats?
The most common are JSON (apps and cloud), Windows Event logs (Windows OS and security), CEF (security devices feeding a SIEM), CLF and its Extended/Combined form ELF (web servers), and W3C Extended Log Format (Microsoft IIS). Many of these, CEF in particular, travel inside syslog framing when shipped between hosts.
What is the CEF log format?
CEF (Common Event Format) is an open, text-based format created by ArcSight to carry security events from many devices into a SIEM in one shape. A CEF message has a pipe-delimited header of seven fields (version, device vendor, product, product version, event class ID, name, and severity) followed by a key-value extension of space-separated pairs. Its purpose is to normalize logs at the source.
What is the syslog message format?
Syslog is the framing and transport many logs travel inside. The current message format is defined by RFC 5424 (2009), which obsoleted RFC 3164. A message has a header (PRI, version, timestamp, hostname, app-name, procid, msgid), optional structured data, and the message text. The PRI value encodes facility and severity as facility times eight plus severity.
Why do log file formats matter for security?
Because detection depends on parsing logs into the right fields. If a parser does not match the format, the log ingests but the fields you alert and pivot on (source IP, status code, username) are buried in one unsplit blob. A log that parses wrong looks healthy on a dashboard while being unsearchable, so knowing the format of each source is how you keep telemetry usable before an incident.
What is the W3C Extended Log Format?
The W3C Extended Log Format is a customizable, self-describing format that is the default for Microsoft IIS. The administrator selects which fields to log, and #Fields: directives at the top of the file declare which fields appear and in what order. Field names use prefixes (s-, c-, cs-, sc-) to mark server, client, client-to-server, and server-to-client direction, so a parser must read the #Fields: line before assuming any column.
The bottom line
A log file format is the structure a system uses to write each event, and it decides how cheaply a machine can read that event back. The split that organizes them all is structured versus semi-structured versus unstructured: the more pattern a format enforces, the less custom parsing you write. The six you will meet are JSON (flexible, semi-structured, everywhere), Windows Event logs (detailed OS binary), CEF (cross-vendor normalization for a SIEM), CLF and ELF (fixed web logs), and W3C (self-describing, IIS), with syslog as the framing that carries many of them between hosts.
For a defender the formats are not trivia. The format determines whether a source IP is a field you can pivot on or a substring lost in a blob. Pipelines fail quietly at the format layer, ingesting fine while the field you need goes missing, so the work is to know each source's format and confirm it parses to the right schema before the incident, not during it.
Frequently asked questions
<p>A log file format is the convention that defines how a system writes each log event: which fields appear, in what order, what separates them, and how they are encoded. The same event written in two different formats produces different bytes. Format matters because parsers must know the structure in advance to split a log line back into searchable fields.</p>
<p>Structured logs follow a fixed, consistent pattern with fields in a known order separated by a known delimiter, so a machine parses them cheaply. Unstructured logs are free text with no enforced pattern, readable by humans but requiring custom parsing (usually regex) for machines. Semi-structured logs like JSON sit between: they carry a labeled schema but still need parsing, trading a little cost for flexibility.</p>
<p>The most common are JSON (apps and cloud), Windows Event logs (Windows OS and security), CEF (security devices feeding a SIEM), CLF and its Extended/Combined form ELF (web servers), and W3C Extended Log Format (Microsoft IIS). Many of these, CEF in particular, travel inside syslog framing when shipped between hosts.</p>
<p>CEF (Common Event Format) is an open, text-based format created by ArcSight to carry security events from many devices into a SIEM in one shape. A CEF message has a pipe-delimited header of seven fields (version, device vendor, product, product version, event class ID, name, and severity) followed by a key-value extension of space-separated pairs. Its purpose is to normalize logs at the source.</p>
<p>Syslog is the framing and transport many logs travel inside. The current message format is defined by RFC 5424 (2009), which obsoleted RFC 3164. A message has a header (PRI, version, timestamp, hostname, app-name, procid, msgid), optional structured data, and the message text. The PRI value encodes facility and severity as facility times eight plus severity.</p>
<p>Because detection depends on parsing logs into the right fields. If a parser does not match the format, the log ingests but the fields you alert and pivot on (source IP, status code, username) are buried in one unsplit blob. A log that parses wrong looks healthy on a dashboard while being unsearchable, so knowing the format of each source is how you keep telemetry usable before an incident.</p>