K
KnowMBAAdvisory
Data StrategyIntermediate8 min read

Data Engineering Practice

A Data Engineering Practice is the team and operating model responsible for the pipes: ingestion, storage, orchestration, schema management, and the underlying compute platform. Their work product is reliable, performant, well-modeled raw data โ€” not dashboards, not insights. They own SLAs on data freshness and pipeline uptime; they own the cost of the warehouse; they own the schema evolution policy. A healthy DE practice runs like a SRE team for data: on-call rotations, post-mortems, capacity planning, and a roadmap measured in 'platform reliability' metrics, not 'tickets closed.'

Also known asDE PracticeData Platform EngineeringData Infrastructure Team

The Trap

The trap is staffing a data engineering team and then expecting them to also write analytics queries, build dashboards, and answer business questions. This is the 'one team does everything' anti-pattern. KnowMBA POV: data engineers are software engineers who happen to work on data โ€” most are bad at SQL business logic and uninterested in stakeholder management. Asking them to do analytics work makes them quit AND produces bad analytics. The fix is splitting data engineering from analytics engineering (the next step in the maturity model).

What to Do

Define your DE practice charter on three axes: (1) Scope โ€” they own raw + staging layers and the platform; analytics engineers own the modeled marts. (2) Reliability targets โ€” publish freshness and uptime SLAs (e.g., 99.5% on-time delivery for tier-1 pipelines). (3) On-call rotation โ€” a real one, with a runbook and a paging tool. If you can't fund all three, you have an 'ad-hoc data engineering effort,' not a practice.

Formula

Pipeline Reliability = (Successful On-Time Pipeline Runs รท Total Scheduled Runs) ร— 100%

In Practice

Netflix's Data Platform team famously operates one of the world's largest data infrastructures (multi-petabyte daily processing) with strict reliability SLAs. Their engineering blog details on-call rotations, formal post-mortems for pipeline outages, and dedicated platform PMs โ€” practices borrowed from SRE, not from traditional analytics teams. The result: thousands of internal users self-serve on a platform that almost never goes down, and the central data engineering team scales sub-linearly with usage.

Pro Tips

  • 01

    Adopt SRE practices wholesale: error budgets, blameless post-mortems, SLO/SLI/SLA distinction, and weekly operational reviews. Data engineering is closer to SRE than to traditional ETL engineering.

  • 02

    Build a 'pipeline catalog' that lists every pipeline with: owner, tier (1/2/3), freshness SLA, and last 30-day reliability score. This single artifact unlocks executive accountability and engineering prioritization.

  • 03

    The single biggest skill gap in data engineering hiring is software engineering rigor โ€” testing, version control, code review, CI/CD. Hire from backend engineering rather than from BI/ETL backgrounds and you'll see immediate quality improvement.

Myth vs Reality

Myth

โ€œData engineers and analytics engineers are interchangeableโ€

Reality

They are different jobs requiring different skills. Data engineers optimize Spark jobs and design Kafka topologies; analytics engineers write dbt models and define metric semantics. The Venn overlap is maybe 30%. Treating them as one role produces a team that is mediocre at both.

Myth

โ€œPipelines should be built by whoever needs the dataโ€

Reality

Decentralized pipeline ownership without governance creates a maintenance nightmare. Within 18 months you have 200 hand-rolled pipelines, no one knows which are still in use, and the warehouse bill is 3x what it should be. Centralized DE practice + decentralized analytics engineering is the proven model.

Try it

Run the numbers.

Pressure-test the concept against your own knowledge โ€” answer the challenge or try the live scenario.

๐Ÿงช

Knowledge Check

Your data engineering team is constantly being pulled into 'urgent' analytics requests from sales and marketing. Reliability of core pipelines is declining. What is the structural fix?

Industry benchmarks

Is your number good?

Calibrate against real-world tiers. Use these ranges as targets โ€” not absolutes.

Tier-1 Pipeline Reliability

Business-critical data pipelines (revenue, billing, executive reporting)

Elite

โ‰ฅ 99.9%

Strong

99-99.9%

Acceptable

98-99%

Poor

< 98%

Source: Hypothetical: KnowMBA synthesis from Monte Carlo State of Data Quality 2024 + practitioner interviews

Real-world cases

Companies that lived this.

Verified narratives with the numbers that prove (or break) the concept.

๐ŸŽฌ

Netflix Data Platform

2015-Present

success

Netflix's data platform team operates one of the world's largest data infrastructures, processing multi-petabyte volumes daily for billions of viewing events. They publish their architecture and operational practices openly: dedicated on-call rotations, formal post-mortems, error budgets, and platform PMs. The result is a self-serve platform used by thousands of internal users with sub-linear central headcount growth โ€” the team has grown a fraction as fast as the data volume.

Daily Data Volume

Multi-PB scale

Internal Users

Thousands

Operating Model

SRE-style rotations + post-mortems

Headcount Growth vs Usage Growth

Sub-linear

Treat data engineering like infrastructure SRE, not like a service desk. The discipline that comes from on-call, SLOs, and post-mortems is what enables sub-linear scaling.

Source โ†—

Decision scenario

Scaling the Data Engineering Team

You're the head of data at a 400-person company. Your data engineering team of 6 is drowning. Pipeline reliability has dropped from 98% to 91% over six months. The CFO is asking for a hiring case. The CTO is asking why you can't 'just use AI to fix this.'

DE Team Size

6 engineers

Total Pipelines

320

Tier-1 Pipeline Reliability (6mo ago)

98%

Tier-1 Pipeline Reliability (today)

91%

Open Pipeline Backlog

47 requests

01

Decision 1

Investigation reveals: 60% of the team's time is spent on ad-hoc analytics requests from business teams (which they're poorly suited to). Only 40% goes to actual platform work. The reliability drop correlates exactly with when the central BI team was 'consolidated' into data engineering.

Hire 4 more data engineers to absorb the workloadReveal
Six months later, you have 10 engineers, the same problems, and a $1.2M+ annual bill. The new hires also get pulled into ad-hoc requests. Reliability stays at ~92%. The CFO is questioning the entire data team's ROI. You've doubled cost without fixing the structural problem.
Team Size: 6 โ†’ 10Annual Cost: +$1.2MReliability: 91% โ†’ 92%
Restructure: keep 5 data engineers focused only on platform + pipelines; hire 3 analytics engineers to own modeled data and serve business teamsReveal
Within one quarter, data engineers have time to fix the reliability issues (back to 98% within 4 months). Analytics engineers absorb the business request queue with a 70% faster turnaround because they think in business semantics. Total headcount adds 2 (not 4), CFO is happier with the smaller bill, business teams are dramatically more satisfied. The structural separation pays for itself.
Team Composition: 6 DE โ†’ 5 DE + 3 AEReliability: 91% โ†’ 98%Annual Cost Increase: +$540K (vs $1.2M)Business Request Turnaround: 70% faster

Related concepts

Keep connecting.

The concepts that orbit this one โ€” each one sharpens the others.

Beyond the concept

Turn Data Engineering Practice into a live operating decision.

Use this concept as the framing layer, then move into a diagnostic if it maps directly to a current bottleneck.

Typical response time: 24h ยท No retainer required

Turn Data Engineering Practice into a live operating decision.

Use Data Engineering Practice as the framing layer, then move into diagnostics or advisory if this maps directly to a current business bottleneck.