Data StrategyAdvanced7 min read

CDC and Streaming

Change Data Capture (CDC) is the technique of reading a source database's transaction log (PostgreSQL WAL, MySQL binlog, Oracle redo log, SQL Server CDC tables) to capture every insert, update, and delete as a stream of change events — typically published into Kafka, Kinesis, or directly into a destination warehouse. CDC + streaming replaces the traditional 'batch ETL every 4 hours' pattern with continuous, low-latency replication — change events flow within seconds of the source commit. The architecture pairs a CDC tool (Debezium is the dominant open-source implementation; Fivetran, Airbyte, Striim, and Estuary offer managed alternatives) with a streaming backbone (Confluent Kafka, AWS Kinesis, Redpanda) and a destination (warehouse, lakehouse, downstream microservice, search index). The honest test: does your business actually need sub-minute data freshness for the use cases you'd actually build? If yes, CDC pays for itself; if no, you're paying streaming infrastructure tax for batch data.

Also known asChange Data CaptureCDC PipelinesReal-Time ReplicationEvent StreamingDebeziumKafka CDCLog-Based Replication

Challenge a friend Browse library

The Trap

The trap is adopting CDC + streaming because real-time sounds modern, then discovering that 95% of your dashboards are reviewed daily and the streaming infrastructure cost (Kafka cluster, ops overhead, schema registry, monitoring) is 5-10x what an hourly batch job would have cost. The other trap is the operational maturity gap — CDC pipelines fail in subtle ways (replication slot exhaustion in Postgres, binlog rotation in MySQL, schema drift cascades, exactly-once semantics edge cases) that batch ETL does not. Without 24/7 on-call and strong observability, CDC will create incidents you didn't have before. KnowMBA POV: most companies that adopt CDC + streaming would be better served by Fivetran + 15-minute batch loads + dbt for 90% of use cases, and add CDC selectively for the genuinely real-time use cases (fraud detection, operational dashboards, search indexing, customer-facing personalization). Treating CDC as the universal pipeline pattern is overengineering.

What to Do

Adopt CDC selectively, not universally. Step 1: list your actual use cases by required freshness — 'CFO dashboard reviewed Monday morning' (daily is fine), 'fraud detection model' (sub-minute), 'product analytics' (15-min usually fine), 'customer-facing search index' (sub-minute). Step 2: use batch ETL (Fivetran, Airbyte) for the 80% that's daily/hourly. Step 3: use CDC + streaming only for the 20% that genuinely needs sub-minute freshness. Step 4: choose tooling — Debezium + Kafka (open source, high control, high ops burden), Fivetran HVR (managed, lower control, lower burden), or your warehouse vendor's native CDC (Snowflake Snowpipe Streaming, Databricks Auto Loader). Step 5: invest in observability — replication lag dashboards, schema-drift alerts, dead-letter queue monitoring, exactly-once verification. Step 6: write runbooks for the failure modes (replication slot full, schema break, lag spike) and rehearse them.

Formula

CDC Adoption Decision: Required Freshness × Use Case Volume × Operational Maturity. CDC fits when freshness < 5 minutes is required for >20% of use cases AND the team has 24/7 on-call. Otherwise batch + dbt is cheaper, simpler, and good enough.

In Practice

Confluent (the company commercializing Apache Kafka) and Debezium (the open-source CDC framework now under the Red Hat umbrella) together define the modern CDC + streaming reference architecture. Public case studies: Wise (formerly TransferWise) runs CDC from MySQL into Kafka into multiple downstream services for cross-border payment processing — sub-second freshness on transaction state is a regulatory and customer experience requirement. Netflix uses CDC for replication between their Cassandra-based services. Uber's Marmaray and Apache Hudi work was driven by CDC ingestion needs for their massive operational data volumes. The recurring pattern: CDC + streaming wins decisively when sub-minute freshness has clear business value (payments, fraud, real-time inventory, customer-facing search/personalization) and loses to simpler batch when the data is consumed in dashboards reviewed once a day.

Pro Tips

01
Replication lag is your #1 operational metric. A CDC pipeline with growing lag is a failing pipeline — within hours it becomes hours-stale, defeating the entire point. Alert at 30-second lag; page at 5-minute lag for any pipeline marketed as real-time.
02
Schema evolution is the second-hardest CDC problem (after exactly-once semantics). Use a schema registry (Confluent Schema Registry, AWS Glue Schema Registry) and enforce backward-compatible changes only at the source. An upstream column rename can cascade into 12 broken downstream consumers within minutes.
03
Exactly-once semantics are easier to claim than to deliver. Most CDC pipelines provide 'at-least-once' with deduplication on the consumer side. For financial use cases where double-counting matters, build idempotency into your consumers explicitly — don't trust the streaming framework's exactly-once promises without testing them under failure conditions.

Myth vs Reality

Myth

“Real-time streaming is the modern default; batch is legacy”

Reality

Batch is the right answer for the vast majority of analytics use cases. Daily and hourly dashboards don't benefit from sub-second freshness. Streaming infrastructure costs 5-10x more in operational overhead than equivalent batch pipelines, and most companies underestimate that delta until the on-call burden hits engineering morale. Use streaming where it matters; use batch where it's good enough.

Myth

“CDC eliminates the need for batch transformations”

Reality

CDC handles ingestion. You still need transformations (joins, aggregations, business logic) on the destination side, and most of those are still better expressed as batch dbt models running every 5-15 minutes than as continuous stream processing. Stream processing is hard to debug, hard to backfill, and overkill for most aggregation logic. The dominant modern pattern is CDC ingestion + micro-batch dbt transformations, not full end-to-end streaming.

Try it

Run the numbers.

Pressure-test the concept against your own knowledge — answer the challenge or try the live scenario.

🧪

Knowledge Check

A 300-person company is debating whether to migrate all data pipelines from Fivetran (15-min batch) to Debezium + Kafka (CDC streaming). Their use cases: 60 dashboards reviewed daily, 8 dashboards reviewed hourly, 2 fraud detection models requiring sub-minute data, and 1 customer-facing personalization service requiring sub-second data. What is the right architecture?

Industry benchmarks

Is your number good?

Calibrate against real-world tiers. Use these ranges as targets — not absolutes.

CDC Pipeline Replication Lag (production benchmarks)

Debezium + Kafka pipelines in production at mid-to-large enterprises

Excellent

< 5 seconds end-to-end

Good

5-30 seconds

Acceptable

30 seconds - 2 minutes

Degraded (failing the SLA)

> 2 minutes

Source: https://debezium.io/documentation/reference/stable/architecture.html

Real-world cases

Companies that lived this.

Verified narratives with the numbers that prove (or break) the concept.

💸

Wise (formerly TransferWise)

2018-present

success

Wise runs CDC pipelines from MySQL into Kafka into multiple downstream services for cross-border payment processing. Sub-second freshness on transaction state is required by both regulators (real-time fraud and AML monitoring) and customers (instant balance updates). Their published architecture uses Debezium for change capture, Kafka as the streaming backbone, and downstream consumers ranging from fraud-detection ML models to customer-facing balance services to compliance reporting pipelines. The CDC + streaming architecture is foundational to the product, not a layer added later — for Wise, sub-second data is the product.

Source

MySQL via Debezium

Streaming Backbone

Apache Kafka

Latency Requirement

Sub-second end-to-end

Business Driver

Regulatory (AML/fraud) + UX (balance freshness)

Streaming wins decisively when sub-second freshness is part of the product. The cost is justified by the experience and regulatory outcomes the architecture enables.

Source ↗

🌊

Confluent + Debezium

2014-present

success

Confluent (commercializing Apache Kafka) and Debezium (now under Red Hat) together define the open-source reference architecture for CDC + streaming. Confluent Cloud handles managed Kafka, Schema Registry, and ksqlDB; Debezium provides connectors for MySQL, PostgreSQL, MongoDB, SQL Server, Oracle, and others. Customer adoption is concentrated in financial services, fraud detection, real-time inventory, and customer-facing personalization. Public case studies (Wise, Robinhood, Trivago, Lyft) all share a common pattern — sub-second freshness is required by either a regulator or a customer-facing experience.

Reference Stack

Debezium + Kafka + Schema Registry

Confluent Cloud Customers

5,000+

Sweet Spot Industries

Finance, fraud, real-time commerce

Common Pattern

CDC ingestion + downstream micro-services

The Debezium + Kafka stack is the industry default for serious CDC + streaming. The question to ask is not 'can we do CDC' but 'do we need CDC for this specific use case'.

Source ↗

📋

Hypothetical: Mid-Market SaaS

2021-2023

failure

A 350-person SaaS company decided to standardize all data pipelines on self-managed Kafka + Debezium because the CTO wanted a 'real-time data platform'. They migrated 35 pipelines over 18 months at a fully-loaded cost of ~$1.4M (infrastructure + 2 dedicated streaming engineers + opportunity cost of slower analytical ship). The actual freshness benefit: 4 of the 35 pipelines had a use case for sub-minute freshness; the other 31 dashboards were reviewed daily. After the new CFO did the math, the company hybridized — Fivetran for the 31 batch pipelines, Kafka for the 4 real-time ones — saving ~$700K/year in ongoing operational cost. The lesson written up internally: 'real-time should be a feature for the use cases that need it, not a default for everything.'

Migration Investment

~$1.4M over 18 months

Pipelines Genuinely Needing Real-Time

4 of 35

Annual Operational Cost Reduction (after hybrid)

~$700K

Hindsight Architecture

Hybrid (batch + selective streaming)

'Real-time everywhere' is the most expensive architectural choice you can make for the wrong reasons. Reserve streaming for the use cases that actually need sub-minute data.

Decision scenario

The CDC Adoption Decision

You're VP of Data at a 600-person ecommerce company. Currently using Fivetran ($150K/year) for 50 source-to-warehouse pipelines, dbt for transformations, mostly batch use cases. The product team wants real-time inventory updates for the storefront (sub-second freshness on stock levels) and the fraud team wants sub-minute transaction streaming. Your CTO suggests 'while we're at it, let's migrate everything to Kafka and standardize'. Engineering capacity is tight. You have 6 months and need to deliver the real-time use cases without blowing the data team's roadmap.

Total Pipelines

Pipelines Needing Sub-Minute Freshness

2 (real-time inventory, fraud)

Current Annual Pipeline Spend

$150K (Fivetran)

CTO Proposal

Migrate all 50 to Kafka

Engineering Headroom

Tight

Decision 1

You can either accept the CTO's universal-streaming proposal, or push for hybrid (batch for 48 pipelines, streaming for 2). The CTO is a respected technical voice and the proposal sounds modern.

Universal streaming — migrate all 50 pipelines to Confluent Cloud + Debezium over 12 months. Hire 2 streaming engineers.Reveal

Month 6: 12 of 50 pipelines migrated; replication lag incidents are now a weekly occurrence. Schema drift cascade caused a 4-hour analytics outage. The 2 hired streaming engineers are senior and expensive ($240K each loaded). Total spend trajectory: ~$1.2M/year ongoing. CFO asks why the company is spending 8x what Fivetran cost for marginal freshness on 96% of dashboards. Project paused at month 9. Hybrid retrofit eventually consumes another $400K. Total damage: ~$2M and 18 months of distraction.

Annual Cost: $150K → $1.2M+ trajectoryOutages Caused: +8 in year 1Roadmap Slip: 12+ months of analytics work delayed

Hybrid: keep Fivetran for the 48 batch pipelines, deploy Confluent Cloud + Debezium for the 2 real-time use cases (inventory + fraud). Add 1 streaming engineer (not 2).Reveal

Month 4: real-time inventory live with <1-second freshness; fraud streaming live with <30-second latency. Storefront conversion lift from accurate stock display: ~1.8% (worth ~$3M/year on company revenue). Fraud chargebacks down 22% from faster detection. Total incremental cost: ~$200K/year ($80K Confluent + $120K incremental engineer). Fivetran handles the batch 48 pipelines uneventfully. CFO sees 15x ROI on the streaming investment because it was scoped to where it actually matters. Roadmap remains intact.

Annual Incremental Cost: +$200K (vs +$1M alt)Storefront Conversion Lift: +1.8%Fraud Chargebacks: -22%

Related concepts