Data StrategyAdvanced8 min read

Streaming Data Pipeline

Q: What are common mistakes with Streaming Data Pipeline?

The trap is building a streaming-first data platform when 90% of analytical use cases are dashboards reviewed daily. The expensive failure mode: companies adopt Kafka + Flink as the universal data movement pattern, then spend years maintaining streaming infrastructure to power dashboards that batch could have served at 5% of the cost. The streaming complexity tax compounds: 24/7 on-call, exactly-once-semantics edge cases, schema-evolution incidents, partition rebalancing, lag monitoring, replication slot management. Without a dedicated platform team, streaming pipelines become the source of weekly incidents that batch pipelines never produce. KnowMBA POV: streaming is justified by the consuming USE CASE, not the source data shape. A Postgres CDC stream feeding a once-daily warehouse refresh is just expensive batch — the streaming infrastructure adds cost without value. The honest test: name the sub-minute decision that depends on this pipeline. If you can't, you don't need streaming for it.

Q: How do you apply Streaming Data Pipeline in practice?

Adopt streaming pipelines selectively. Categorize use cases by required freshness: (1) Sub-minute decisions (fraud, personalization, real-time inventory, search indexing) — streaming required. (2) Minute-to-15-minute freshness — micro-batch via CDC or scheduled Flink jobs. (3) Hourly-to-daily — batch via dbt + Fivetran. Use streaming for tier 1 only. Choose the streaming stack based on team capacity and workload: Confluent Cloud or AWS MSK for managed Kafka with low ops burden; Apache Kafka self-hosted only with a dedicated platform team; Apache Flink for stateful complex stream processing; Materialize for SQL-defined streaming where latency matters more than scale; AWS Kinesis if you're AWS-native and want fewer moving parts. Sequence implementation: ship the FIRST streaming use case end-to-end on the simplest stack possible. Don't build a 'general-purpose streaming platform' before proving the pattern in a single use case. Generalize after 3+ proven use cases, never before.

A Streaming Data Pipeline is a continuous, low-latency data flow that processes events as they arrive — rather than processing batches of accumulated data at scheduled intervals. The defining stack: an event broker (Apache Kafka, AWS Kinesis, Apache Pulsar, Confluent Cloud, Redpanda) for ingestion and durable buffering, a stream processor (Apache Flink, Kafka Streams, Spark Structured Streaming, Materialize) for stateful computation, and downstream sinks (warehouse, lake, search index, online feature store, downstream microservice). Streaming pipelines enable use cases impossible with batch: fraud blocking in <200ms, real-time recommendations updated as users browse, operational dashboards reflecting current state, and search indices kept fresh. They cost 5-15x more than equivalent batch in operational complexity (24/7 on-call, replication slots, exactly-once semantics, schema evolution, dead-letter queues) — the question is always whether the use case justifies the premium.

Also known asEvent StreamingKafka PipelineApache Flink PipelineKinesis PipelineReal-Time Data Flow

Challenge a friend Browse library

The Trap

The trap is building a streaming-first data platform when 90% of analytical use cases are dashboards reviewed daily. The expensive failure mode: companies adopt Kafka + Flink as the universal data movement pattern, then spend years maintaining streaming infrastructure to power dashboards that batch could have served at 5% of the cost. The streaming complexity tax compounds: 24/7 on-call, exactly-once-semantics edge cases, schema-evolution incidents, partition rebalancing, lag monitoring, replication slot management. Without a dedicated platform team, streaming pipelines become the source of weekly incidents that batch pipelines never produce. KnowMBA POV: streaming is justified by the consuming USE CASE, not the source data shape. A Postgres CDC stream feeding a once-daily warehouse refresh is just expensive batch — the streaming infrastructure adds cost without value. The honest test: name the sub-minute decision that depends on this pipeline. If you can't, you don't need streaming for it.

What to Do

Adopt streaming pipelines selectively. Categorize use cases by required freshness: (1) Sub-minute decisions (fraud, personalization, real-time inventory, search indexing) — streaming required. (2) Minute-to-15-minute freshness — micro-batch via CDC or scheduled Flink jobs. (3) Hourly-to-daily — batch via dbt + Fivetran. Use streaming for tier 1 only. Choose the streaming stack based on team capacity and workload: Confluent Cloud or AWS MSK for managed Kafka with low ops burden; Apache Kafka self-hosted only with a dedicated platform team; Apache Flink for stateful complex stream processing; Materialize for SQL-defined streaming where latency matters more than scale; AWS Kinesis if you're AWS-native and want fewer moving parts. Sequence implementation: ship the FIRST streaming use case end-to-end on the simplest stack possible. Don't build a 'general-purpose streaming platform' before proving the pattern in a single use case. Generalize after 3+ proven use cases, never before.

Formula

Streaming Justification = (Use Case Decision Latency Requirement < 60 sec) AND (Use Case Volume > 100K events/day) AND (Team On-Call Maturity ≥ 24/7). All three must hold; missing any one means batch wins.

In Practice

Apache Kafka (originally LinkedIn, 2011) defined the modern event streaming category. Confluent (founded 2014 by Kafka's creators) commercialized it; their published case studies span Walmart, Target, Capital One, Wise, and many financial services and retail customers. Apache Flink (originally TU Berlin / Data Artisans, 2014) is the dominant open-source stream processor for stateful computation; commercial backers include Ververica (acquired by Alibaba) and AWS (managed Flink as part of Kinesis Data Analytics). AWS Kinesis is the AWS-native streaming alternative, popular for AWS-only shops. Recurring case studies: Wise (formerly TransferWise) runs CDC from MySQL into Kafka for cross-border payment processing; Netflix uses Kafka for trillions of events per day; Uber uses Kafka + Flink for surge pricing computation; Pinterest uses Kafka for real-time feature ingestion. The pattern: streaming wins when the consuming use case has sub-minute decision latency, loses to batch when it doesn't.

Pro Tips

01
Replication lag is your #1 operational metric. A streaming pipeline with growing lag is a failing pipeline — within hours it becomes hours-stale, defeating the purpose. Alert at 30-second lag; page at 5-minute lag for any pipeline marketed as real-time.
02
Schema evolution is the second-hardest streaming problem (after exactly-once semantics). Use a schema registry (Confluent Schema Registry, AWS Glue Schema Registry, Apicurio) and enforce backward-compatible changes only. An upstream column rename can cascade into 12 broken downstream consumers within minutes.
03
Exactly-once semantics are easier to claim than to deliver. Most streaming pipelines provide at-least-once with deduplication on the consumer side. For financial use cases where double-counting matters, build idempotency into your consumers explicitly — don't trust the streaming framework's exactly-once promises without testing them under failure conditions.

Myth vs Reality

Myth

“Streaming is the modern default; batch is legacy”

Reality

Batch is the right answer for the vast majority of analytical use cases. Daily and hourly dashboards don't benefit from sub-second freshness. Streaming infrastructure costs 5-15x more in operational overhead than equivalent batch pipelines, and most companies underestimate the on-call burden until it hits engineering morale. Use streaming where it matters; use batch where it's good enough.

Myth

“Streaming pipelines eliminate the need for batch transformations”

Reality

Streaming handles ingestion and stateful processing for latency-critical paths. You still need transformations (joins, aggregations, business logic) on the destination side, and most are still better expressed as batch dbt models running every 5-15 minutes than as continuous stream processing. Stream processing is hard to debug, hard to backfill, and overkill for most aggregation logic. The dominant modern pattern is selective streaming + micro-batch dbt, not full end-to-end streaming.

Try it

Run the numbers.

Pressure-test the concept against your own knowledge — answer the challenge or try the live scenario.

🧪

Knowledge Check

Your team is debating whether to build a Kafka + Flink streaming platform as the foundation for the data team. Most use cases are dashboards reviewed daily, with one fraud-detection model needing sub-minute features. What's the right architecture?

Industry benchmarks

Is your number good?

Calibrate against real-world tiers. Use these ranges as targets — not absolutes.

Streaming Pipeline Operational Burden vs Batch

Operational burden multipliers across pipeline types

Batch (dbt + warehouse)

1x ops burden baseline

Managed Streaming (Confluent Cloud, Kinesis)

3-5x ops burden

Self-Hosted Kafka + Flink

8-15x ops burden

Custom-Built Streaming Platform

20x+ ops burden

Source: https://www.confluent.io/blog/cloud-event-streaming-deployment-options/

Real-world cases

Companies that lived this.

Verified narratives with the numbers that prove (or break) the concept.

🌊

Apache Kafka / Confluent

2011-present

success

Apache Kafka was originally built at LinkedIn (open-sourced 2011) and commercialized by Confluent (founded 2014 by Kafka's creators). Today it processes trillions of events per day across global enterprises — Walmart, Target, Capital One, Wise, Netflix, Uber, and many others. Confluent's published case studies emphasize the use cases where streaming wins decisively: fraud detection, real-time inventory, customer personalization, IoT telemetry, and operational dashboards. The same case studies emphasize the operational maturity required — Kafka at scale demands a dedicated platform team and disciplined SRE practices.

Daily Events at Scale

Trillions across enterprises

Notable Customers

Walmart, Target, Capital One, Wise, Netflix

Confluent Revenue (2024)

~$1B annual run-rate

Operational Reality

Requires dedicated platform team

Kafka is the dominant streaming foundation for use cases that genuinely need it. The operational burden is real and requires platform team investment.

Source ↗

🐿️

Apache Flink

2014-present

success

Apache Flink is the dominant open-source stream processor for stateful computation — sessionization, windowed aggregations, complex event processing, exactly-once semantics. Commercial backing comes from Ververica (acquired by Alibaba in 2019), AWS (managed Flink in Kinesis Data Analytics), and Confluent (Flink-on-Confluent-Cloud GA 2024). Production users include Uber (Marmaray, surge pricing), Pinterest, Lyft, Netflix, Alibaba (massive scale), and many financial services firms. The operational burden of Flink at scale is significant — requires deep expertise in checkpointing, state backends, watermarks, and recovery patterns.

Era

2014+, ongoing

Commercial Backers

Ververica/Alibaba, AWS, Confluent

Notable Production Users

Uber, Pinterest, Lyft, Netflix, Alibaba

Operational Skill Required

Deep expertise: checkpointing, state, watermarks

Flink is the right answer for stateful complex stream processing at scale, but requires deep operational expertise. Most companies don't need this depth and should use simpler alternatives.

Source ↗

☁️

AWS Kinesis

2013-present

success

AWS Kinesis (Streams, Firehose, Data Analytics) is the AWS-native streaming alternative to Kafka. Popular for AWS-only shops that want fewer moving parts and lower operational burden — Kinesis is fully managed, integrates natively with other AWS services, and has predictable per-shard pricing. The trade-off vs Kafka: less throughput per dollar at very high scale, less ecosystem flexibility, AWS lock-in. Public customers include Lyft, Netflix (for some workloads), Hulu, and many AWS-native SaaS companies. For mid-scale streaming on AWS, Kinesis is often the right choice over self-hosted Kafka.

Differentiator

Fully managed, AWS-native

Lower Ops Burden

vs self-hosted Kafka

Trade-Off

Less throughput per dollar at extreme scale

Best Fit

AWS-native shops at mid-scale

Managed streaming services trade flexibility for operational simplicity. For most companies below hyperscale on AWS, Kinesis or Confluent Cloud beats self-hosted Kafka on TCO.

Source ↗

Decision scenario

The 'Streaming-First Platform' Pitch

You're CTO at a Series C SaaS company at $30M ARR. Your new VP Data Engineering wants to build a 'streaming-first data platform' on Kafka + Flink + ClickHouse, replacing the current Fivetran + Snowflake + dbt stack. Budget request: $1.8M one-time + $700K/year ongoing. Current real-time use cases: zero. Future hypothetical use cases: 'we'll need real-time eventually'.

Current Real-Time Use Cases

Current Stack Cost

$300K/year (Fivetran + Snowflake + dbt)

Proposed Streaming Stack Cost

$1.8M one-time + $700K/year

VP's Justification

'Future-proofing'

Engineering Team Size

12 (no dedicated SRE for streaming)

Decision 1

The VP wants approval this quarter. The CEO is impressed by the streaming-first vision. You know that streaming-first without real use cases is the canonical overengineering trap, but pushing back will be politically difficult.

Approve the streaming-first platform — the VP is energetic and the future-proofing argument is plausibleReveal

18 months in: streaming platform deployed, 12 pipelines built, all powering daily dashboards (no real-time use cases emerged). On-call burden is brutal — 3-4 incidents per week. Two senior engineers quit citing the streaming pager. Costs: $2.5M+ spent. Business value: equivalent to what the prior $300K Fivetran + Snowflake stack delivered. CEO begins questioning the bet. You spend year 3 unwinding back to the simpler stack.

Streaming Spend: $0 → $2.5M+Real-Time Use Cases Delivered: 0 → 0Senior Engineers Lost: 0 → 2

Reject the streaming-first platform. Counter-propose: identify the FIRST real real-time use case (with measurable business value) and build streaming for that one use case using AWS Kinesis or Confluent Cloud (managed). If 3+ real-time use cases emerge over the next 12 months, revisit a broader platform. Otherwise, stay on Fivetran + Snowflake + dbt.Reveal

12 months later: 1 real real-time use case built (fraud detection, $400K annual value) on managed Kinesis with $80K annual cost. The rest of the data team continues shipping value on the simpler stack. The VP initially frustrated but accepts the framing when no other real real-time use cases materialize. Engineering team morale strong, no streaming-related on-call burnout. Total spend prevented: ~$2M. The 'pragmatic streaming' approach becomes the team's published case study at the next data conference.

Streaming Spend: $0 → $80K (one use case)Real Use Cases Delivered: 0 → 1 (with measurable ROI)Spend Prevented: ~$2M+

Related concepts