Batch vs Streaming Architecture
Batch processing collects data over a window (an hour, a day) and processes it in scheduled runs — high throughput, cheap, simple. Streaming processing handles each event as it arrives — low latency, expensive, complex. Modern data stacks usually combine both: batch for analytics, finance, ML training; streaming for fraud detection, alerting, personalization. Apache Kafka is the dominant streaming substrate; Apache Flink, Spark Streaming, and ksqlDB are the leading processors. The architecture decision is not 'which is better' — it is 'which problems genuinely need streaming, and which are batch problems people are dressing up as streaming because it sounds modern.'
The Trap
The trap is streaming-by-default in 2026 data stacks. Teams reach for Kafka + Flink because it's the prestigious architecture, then spend years debugging exactly-once semantics, watermarks, late-arriving events, and operational on-call burden — for use cases where a Cron job running dbt every 30 minutes would have been adequate. Streaming pipelines are roughly 5x more expensive to operate than batch (more infrastructure, more on-call, more skill scarcity in the team). The business value rarely justifies it. The other trap is the opposite: using batch for genuinely time-sensitive use cases like fraud or operational alerting, then wondering why the business is unhappy with delayed signals.
What to Do
Run the latency-vs-cost decision: write down the business consumer of each pipeline and the maximum acceptable end-to-end delay. If the consumer is a daily dashboard, that's 24 hours. If it's an analyst running ad-hoc queries, that's hours. If it's a fraud alert, that's seconds. Map each pipeline to the cheapest tier that meets its SLA. Default to batch unless you can articulate why seconds matter. For genuine streaming needs, isolate them — don't streamify the entire stack.
Formula
In Practice
Apache Kafka was open-sourced from LinkedIn in 2011 to solve the 'every team builds a custom queue' problem. By 2023 it powered the streaming substrate at Netflix, Uber, Airbnb, Pinterest, and most Fortune 500 data stacks. But Confluent's own customer surveys consistently show that the majority of Kafka topics power what are effectively batch use cases — events flowing through Kafka but consumed by hourly batch jobs into a warehouse. The lesson: Kafka as transport is broadly useful; full-streaming compute is narrow.
Pro Tips
- 01
The 'micro-batch' middle ground (Spark Structured Streaming, dbt every 5 minutes) gives you near-real-time freshness at near-batch cost. For most 'real-time' business asks, micro-batch is the right answer.
- 02
Confluent's Jay Kreps (Kafka co-creator) has publicly written that 'streaming is not a replacement for batch' — they coexist. Read 'Questioning the Lambda Architecture' for the canonical thinking.
- 03
Streaming on-call is materially more painful than batch on-call. Consider that cost when evaluating: a streaming pipeline that pages your team three times a quarter consumes engineering capacity invisibly.
Myth vs Reality
Myth
“Streaming is the modern way; batch is legacy”
Reality
Batch underpins most analytics, ML training, and finance reporting at every modern data company. Snowflake, BigQuery, Databricks all primarily run batch workloads. Streaming is a specialized tool for use cases that genuinely need it — not a general replacement for batch.
Myth
“Streaming is faster than batch for everything”
Reality
Streaming is lower-latency per event but often lower-throughput per dollar than batch for the same total volume. Batch can use cheaper spot compute, larger parallelism, and fewer correctness guarantees. For weekly reports, batch finishes faster and cheaper.
Try it
Run the numbers.
Pressure-test the concept against your own knowledge — answer the challenge or try the live scenario.
Knowledge Check
Marketing wants 'real-time' attribution data so campaign managers can pause underperforming ads. Currently the data lands in the warehouse every 4 hours via batch. They claim they need streaming. What should you ask?
Industry benchmarks
Is your number good?
Calibrate against real-world tiers. Use these ranges as targets — not absolutes.
Cost Multiplier (Streaming vs Batch, same workload)
Includes infrastructure + on-call + engineering overheadBest Case (well-tuned)
2-3x
Typical
4-6x
Common
6-10x
Poorly Designed
> 10x
Source: Hypothetical synthesis of Confluent, Databricks, and Snowflake customer reports
Real-world cases
Companies that lived this.
Verified narratives with the numbers that prove (or break) the concept.
LinkedIn (Apache Kafka origin)
2010-2011
LinkedIn built Kafka because every internal team was building one-off pipelines for activity events, metrics, and logs. The original Kafka design was a unified streaming substrate that any system could publish to and any system could consume from. But LinkedIn explicitly designed Kafka to support BOTH streaming consumers (real-time alerting) AND batch consumers (Hadoop pulled from Kafka in chunks every hour). The 'log as a unified substrate' insight, not 'streaming everywhere,' is what made Kafka transformative.
Original Use Case
Activity stream + metrics
Architecture
Streaming substrate, mixed consumers
Year Open-Sourced
2011
Even Kafka — the canonical streaming technology — was designed to serve batch consumers as a first-class use case. The architecture that won was 'streaming transport, mixed compute,' not 'streaming everything.'
Related concepts
Keep connecting.
The concepts that orbit this one — each one sharpens the others.
Beyond the concept
Turn Batch vs Streaming Architecture into a live operating decision.
Use this concept as the framing layer, then move into a diagnostic if it maps directly to a current bottleneck.
Typical response time: 24h · No retainer required
Turn Batch vs Streaming Architecture into a live operating decision.
Use Batch vs Streaming Architecture as the framing layer, then move into diagnostics or advisory if this maps directly to a current business bottleneck.