Streaming Data Pipeline
A Streaming Data Pipeline is a continuous, low-latency data flow that processes events as they arrive โ rather than processing batches of accumulated data at scheduled intervals. The defining stack: an event broker (Apache Kafka, AWS Kinesis, Apache Pulsar, Confluent Cloud, Redpanda) for ingestion and durable buffering, a stream processor (Apache Flink, Kafka Streams, Spark Structured Streaming, Materialize) for stateful computation, and downstream sinks (warehouse, lake, search index, online feature store, downstream microservice). Streaming pipelines enable use cases impossible with batch: fraud blocking in <200ms, real-time recommendations updated as users browse, operational dashboards reflecting current state, and search indices kept fresh. They cost 5-15x more than equivalent batch in operational complexity (24/7 on-call, replication slots, exactly-once semantics, schema evolution, dead-letter queues) โ the question is always whether the use case justifies the premium.
The Trap
The trap is building a streaming-first data platform when 90% of analytical use cases are dashboards reviewed daily. The expensive failure mode: companies adopt Kafka + Flink as the universal data movement pattern, then spend years maintaining streaming infrastructure to power dashboards that batch could have served at 5% of the cost. The streaming complexity tax compounds: 24/7 on-call, exactly-once-semantics edge cases, schema-evolution incidents, partition rebalancing, lag monitoring, replication slot management. Without a dedicated platform team, streaming pipelines become the source of weekly incidents that batch pipelines never produce. KnowMBA POV: streaming is justified by the consuming USE CASE, not the source data shape. A Postgres CDC stream feeding a once-daily warehouse refresh is just expensive batch โ the streaming infrastructure adds cost without value. The honest test: name the sub-minute decision that depends on this pipeline. If you can't, you don't need streaming for it.
What to Do
Adopt streaming pipelines selectively. Categorize use cases by required freshness: (1) Sub-minute decisions (fraud, personalization, real-time inventory, search indexing) โ streaming required. (2) Minute-to-15-minute freshness โ micro-batch via CDC or scheduled Flink jobs. (3) Hourly-to-daily โ batch via dbt + Fivetran. Use streaming for tier 1 only. Choose the streaming stack based on team capacity and workload: Confluent Cloud or AWS MSK for managed Kafka with low ops burden; Apache Kafka self-hosted only with a dedicated platform team; Apache Flink for stateful complex stream processing; Materialize for SQL-defined streaming where latency matters more than scale; AWS Kinesis if you're AWS-native and want fewer moving parts. Sequence implementation: ship the FIRST streaming use case end-to-end on the simplest stack possible. Don't build a 'general-purpose streaming platform' before proving the pattern in a single use case. Generalize after 3+ proven use cases, never before.
Formula
In Practice
Apache Kafka (originally LinkedIn, 2011) defined the modern event streaming category. Confluent (founded 2014 by Kafka's creators) commercialized it; their published case studies span Walmart, Target, Capital One, Wise, and many financial services and retail customers. Apache Flink (originally TU Berlin / Data Artisans, 2014) is the dominant open-source stream processor for stateful computation; commercial backers include Ververica (acquired by Alibaba) and AWS (managed Flink as part of Kinesis Data Analytics). AWS Kinesis is the AWS-native streaming alternative, popular for AWS-only shops. Recurring case studies: Wise (formerly TransferWise) runs CDC from MySQL into Kafka for cross-border payment processing; Netflix uses Kafka for trillions of events per day; Uber uses Kafka + Flink for surge pricing computation; Pinterest uses Kafka for real-time feature ingestion. The pattern: streaming wins when the consuming use case has sub-minute decision latency, loses to batch when it doesn't.
Pro Tips
- 01
Replication lag is your #1 operational metric. A streaming pipeline with growing lag is a failing pipeline โ within hours it becomes hours-stale, defeating the purpose. Alert at 30-second lag; page at 5-minute lag for any pipeline marketed as real-time.
- 02
Schema evolution is the second-hardest streaming problem (after exactly-once semantics). Use a schema registry (Confluent Schema Registry, AWS Glue Schema Registry, Apicurio) and enforce backward-compatible changes only. An upstream column rename can cascade into 12 broken downstream consumers within minutes.
- 03
Exactly-once semantics are easier to claim than to deliver. Most streaming pipelines provide at-least-once with deduplication on the consumer side. For financial use cases where double-counting matters, build idempotency into your consumers explicitly โ don't trust the streaming framework's exactly-once promises without testing them under failure conditions.
Myth vs Reality
Myth
โStreaming is the modern default; batch is legacyโ
Reality
Batch is the right answer for the vast majority of analytical use cases. Daily and hourly dashboards don't benefit from sub-second freshness. Streaming infrastructure costs 5-15x more in operational overhead than equivalent batch pipelines, and most companies underestimate the on-call burden until it hits engineering morale. Use streaming where it matters; use batch where it's good enough.
Myth
โStreaming pipelines eliminate the need for batch transformationsโ
Reality
Streaming handles ingestion and stateful processing for latency-critical paths. You still need transformations (joins, aggregations, business logic) on the destination side, and most are still better expressed as batch dbt models running every 5-15 minutes than as continuous stream processing. Stream processing is hard to debug, hard to backfill, and overkill for most aggregation logic. The dominant modern pattern is selective streaming + micro-batch dbt, not full end-to-end streaming.
Try it
Run the numbers.
Pressure-test the concept against your own knowledge โ answer the challenge or try the live scenario.
Knowledge Check
Your team is debating whether to build a Kafka + Flink streaming platform as the foundation for the data team. Most use cases are dashboards reviewed daily, with one fraud-detection model needing sub-minute features. What's the right architecture?
Industry benchmarks
Is your number good?
Calibrate against real-world tiers. Use these ranges as targets โ not absolutes.
Streaming Pipeline Operational Burden vs Batch
Operational burden multipliers across pipeline typesBatch (dbt + warehouse)
1x ops burden baseline
Managed Streaming (Confluent Cloud, Kinesis)
3-5x ops burden
Self-Hosted Kafka + Flink
8-15x ops burden
Custom-Built Streaming Platform
20x+ ops burden
Source: https://www.confluent.io/blog/cloud-event-streaming-deployment-options/
Real-world cases
Companies that lived this.
Verified narratives with the numbers that prove (or break) the concept.
Apache Kafka / Confluent
2011-present
Apache Kafka was originally built at LinkedIn (open-sourced 2011) and commercialized by Confluent (founded 2014 by Kafka's creators). Today it processes trillions of events per day across global enterprises โ Walmart, Target, Capital One, Wise, Netflix, Uber, and many others. Confluent's published case studies emphasize the use cases where streaming wins decisively: fraud detection, real-time inventory, customer personalization, IoT telemetry, and operational dashboards. The same case studies emphasize the operational maturity required โ Kafka at scale demands a dedicated platform team and disciplined SRE practices.
Daily Events at Scale
Trillions across enterprises
Notable Customers
Walmart, Target, Capital One, Wise, Netflix
Confluent Revenue (2024)
~$1B annual run-rate
Operational Reality
Requires dedicated platform team
Kafka is the dominant streaming foundation for use cases that genuinely need it. The operational burden is real and requires platform team investment.
Apache Flink
2014-present
Apache Flink is the dominant open-source stream processor for stateful computation โ sessionization, windowed aggregations, complex event processing, exactly-once semantics. Commercial backing comes from Ververica (acquired by Alibaba in 2019), AWS (managed Flink in Kinesis Data Analytics), and Confluent (Flink-on-Confluent-Cloud GA 2024). Production users include Uber (Marmaray, surge pricing), Pinterest, Lyft, Netflix, Alibaba (massive scale), and many financial services firms. The operational burden of Flink at scale is significant โ requires deep expertise in checkpointing, state backends, watermarks, and recovery patterns.
Era
2014+, ongoing
Commercial Backers
Ververica/Alibaba, AWS, Confluent
Notable Production Users
Uber, Pinterest, Lyft, Netflix, Alibaba
Operational Skill Required
Deep expertise: checkpointing, state, watermarks
Flink is the right answer for stateful complex stream processing at scale, but requires deep operational expertise. Most companies don't need this depth and should use simpler alternatives.
AWS Kinesis
2013-present
AWS Kinesis (Streams, Firehose, Data Analytics) is the AWS-native streaming alternative to Kafka. Popular for AWS-only shops that want fewer moving parts and lower operational burden โ Kinesis is fully managed, integrates natively with other AWS services, and has predictable per-shard pricing. The trade-off vs Kafka: less throughput per dollar at very high scale, less ecosystem flexibility, AWS lock-in. Public customers include Lyft, Netflix (for some workloads), Hulu, and many AWS-native SaaS companies. For mid-scale streaming on AWS, Kinesis is often the right choice over self-hosted Kafka.
Differentiator
Fully managed, AWS-native
Lower Ops Burden
vs self-hosted Kafka
Trade-Off
Less throughput per dollar at extreme scale
Best Fit
AWS-native shops at mid-scale
Managed streaming services trade flexibility for operational simplicity. For most companies below hyperscale on AWS, Kinesis or Confluent Cloud beats self-hosted Kafka on TCO.
Decision scenario
The 'Streaming-First Platform' Pitch
You're CTO at a Series C SaaS company at $30M ARR. Your new VP Data Engineering wants to build a 'streaming-first data platform' on Kafka + Flink + ClickHouse, replacing the current Fivetran + Snowflake + dbt stack. Budget request: $1.8M one-time + $700K/year ongoing. Current real-time use cases: zero. Future hypothetical use cases: 'we'll need real-time eventually'.
Current Real-Time Use Cases
0
Current Stack Cost
$300K/year (Fivetran + Snowflake + dbt)
Proposed Streaming Stack Cost
$1.8M one-time + $700K/year
VP's Justification
'Future-proofing'
Engineering Team Size
12 (no dedicated SRE for streaming)
Decision 1
The VP wants approval this quarter. The CEO is impressed by the streaming-first vision. You know that streaming-first without real use cases is the canonical overengineering trap, but pushing back will be politically difficult.
Approve the streaming-first platform โ the VP is energetic and the future-proofing argument is plausibleReveal
Reject the streaming-first platform. Counter-propose: identify the FIRST real real-time use case (with measurable business value) and build streaming for that one use case using AWS Kinesis or Confluent Cloud (managed). If 3+ real-time use cases emerge over the next 12 months, revisit a broader platform. Otherwise, stay on Fivetran + Snowflake + dbt.โ OptimalReveal
Related concepts
Keep connecting.
The concepts that orbit this one โ each one sharpens the others.
Beyond the concept
Turn Streaming Data Pipeline into a live operating decision.
Use this concept as the framing layer, then move into a diagnostic if it maps directly to a current bottleneck.
Typical response time: 24h ยท No retainer required
Turn Streaming Data Pipeline into a live operating decision.
Use Streaming Data Pipeline as the framing layer, then move into diagnostics or advisory if this maps directly to a current business bottleneck.