K
KnowMBAAdvisory
Data StrategyIntermediate6 min read

Data Lineage

Data Lineage is the map of how data flows from source systems through transformations to final consumption (dashboards, ML models, reports). At its simplest, it answers: 'where did this number come from?' At its most useful, it answers two questions a data team gets asked daily: (1) Impact analysis โ€” 'if I change this upstream column, what dashboards and models will break?' (2) Root cause โ€” 'this dashboard shows wrong numbers; which transformation introduced the error?' Modern lineage tools (Atlan, dbt's exposures, Monte Carlo, OpenLineage) parse SQL and code to auto-generate column-level lineage across the warehouse. Without lineage, every schema change becomes a 2-week archaeology project, and every data incident becomes a panicked investigation across Slack channels.

Also known asData ProvenancePipeline LineageColumn-Level LineageData Flow MappingImpact Analysis

The Trap

The trap is treating lineage as a pretty visualization to show in board decks rather than an operational tool. A lineage graph nobody actually queries before making changes is worth zero. The other trap is column-level-lineage perfectionism: spending 12 months to map every column across 5,000 tables when only ~20% of columns are actively used in production. Coverage of the right datasets matters far more than total coverage. The most expensive failure: shipping a 'data catalog with lineage' that becomes a wiki of dead links โ€” no one trusts it because it's outdated, no one updates it because it's not part of any workflow.

What to Do

Implement lineage as part of your CI/CD workflow, not as a separate documentation effort. Step 1: adopt a tool that auto-extracts lineage from your warehouse (dbt for transformations, Atlan/Monte Carlo/Datafold for end-to-end). Step 2: establish a hard rule โ€” every PR that changes an upstream model must show downstream impact (this is what dbt's `--defer` and slim CI patterns enable). Step 3: integrate lineage into incident response โ€” when a dashboard breaks, the on-call engineer's first move is to walk lineage upstream to find the bad input. Step 4: integrate lineage into deprecation decisions โ€” never deprecate a table without checking lineage shows zero downstream consumers. The discipline is enforcing that lineage is consulted; the tool is just the surface.

Formula

Lineage Value = Coverage of Production Models ร— Workflow Integration ร— Freshness. A complete graph that's never consulted (Workflow Integration = 0) has zero value, regardless of coverage.

In Practice

Atlan, the active metadata platform, publishes case studies showing how customers use lineage to prevent breakage. One customer (a Fortune 500 retailer) reduced incidents caused by upstream schema changes by ~70% after enforcing a policy that every PR to upstream tables must include a downstream impact review using Atlan lineage. The decisive change was workflow integration โ€” making lineage a required gate in the PR review process, not just a graph available in a UI. Before the policy, lineage existed but wasn't used. After, it became the most-checked artifact in the data engineering workflow.

Pro Tips

  • 01

    Column-level lineage matters more than table-level lineage. Knowing 'orders.fact feeds revenue_dashboard' is interesting; knowing 'orders.fact.discount_amount feeds revenue_dashboard.net_revenue' is what prevents a bad change from breaking a board metric.

  • 02

    Lineage is most valuable in three workflows: (1) Pre-change impact analysis (what breaks if I rename this?), (2) Post-incident root cause (what upstream change caused this dashboard to be wrong?), (3) Deprecation decisions (can I delete this table?). If your lineage tool isn't being consulted in those three moments, it's failing.

  • 03

    Auto-extracted lineage from SQL parsers covers ~80% of the warehouse but misses lineage that runs in Python notebooks, ad-hoc scripts, or BI tools. Map and enforce the gaps explicitly โ€” anything outside the auto-extracted graph is a 'dark zone' where impact analysis is impossible.

Myth vs Reality

Myth

โ€œData lineage is documentationโ€

Reality

Documentation is human-maintained, gets stale, and is consulted only when someone remembers to. Real lineage is auto-extracted from code, updated continuously, and enforced as a CI gate. The shift from 'lineage as docs' to 'lineage as code-driven artifact' is what separates organizations where lineage works from those where it sits unused.

Myth

โ€œBI tools provide enough lineage nativelyโ€

Reality

BI tools show table-level dependencies for their own dashboards but don't see across the upstream warehouse, ETL, or other BI tools in the same org. End-to-end lineage requires a dedicated platform that integrates source systems, transformation layer, BI tools, and reverse ETL. Companies relying on BI-native lineage have a partial map that misses 60-80% of the actual graph.

Try it

Run the numbers.

Pressure-test the concept against your own knowledge โ€” answer the challenge or try the live scenario.

๐Ÿงช

Knowledge Check

An upstream engineer renames a column in a Salesforce export from `account_status` to `account_state`. Three weeks later, your CFO complains the churn dashboard is showing wrong numbers. You have no lineage tooling. What is the typical investigation time?

Industry benchmarks

Is your number good?

Calibrate against real-world tiers. Use these ranges as targets โ€” not absolutes.

Lineage Coverage of Production Models

Active metadata / lineage adoption surveys (Atlan, Alation, dbt 2024)

Mature data org

>90% column-level coverage

Good

70-90% table-level

Average

40-70% partial coverage

Lineage Dark Zone

<40% coverage

Source: https://atlan.com/active-metadata/

Real-world cases

Companies that lived this.

Verified narratives with the numbers that prove (or break) the concept.

๐Ÿ›’

Atlan customer (Fortune 500 retailer)

2022-2024

success

A Fortune 500 retailer adopted Atlan as its active metadata platform with column-level lineage across the warehouse, dbt models, and BI tools. Initially the lineage existed but wasn't part of any workflow. After establishing a policy that every PR modifying upstream models must include an Atlan-generated downstream impact review, breaking changes shipped to production fell ~70%. Engineering time spent on schema-change root cause dropped sharply. The decisive move was workflow integration โ€” making lineage a required artifact in PR reviews, not a side-channel reference.

Breaking Changes Reduction

~70%

Lineage Coverage

Column-level across warehouse + BI

Workflow Gate

PR review enforcement

Time to Root Cause

Hours โ†’ minutes

Lineage delivers value when it's enforced in workflows (PRs, deprecation, incident response). Lineage as a documentation artifact delivers near-zero value.

Source โ†—
๐Ÿงฑ

dbt Labs (open source community)

2018-present

success

dbt's transformation framework includes built-in lineage extraction from compiled SQL โ€” every model declares its dependencies, and dbt generates a DAG visualization plus the `--defer` and slim CI patterns that show downstream impact of any change. This has become the de facto standard for lineage in the modern data stack. dbt Cloud customers regularly cite lineage-driven impact analysis as a primary reason for adoption. The dbt model proves that lineage works best when it's a byproduct of how the work is already done โ€” not a separate documentation step.

Active Projects (open source + Cloud)

30,000+

Lineage Type

Model-level + column-level (Cloud)

Source

Auto-extracted from SQL DAG

CI Pattern

Slim CI / --defer

The best lineage is the lineage that comes free from how you already build pipelines. Adopting dbt automatically delivers ~80% of lineage value because the DAG is the source of truth.

Source โ†—
๐Ÿ“‹

Hypothetical: Mid-Market SaaS

2021

failure

A 350-person SaaS bought a data catalog tool with lineage features for $80K/year. The catalog was launched with great fanfare but was never integrated into PR reviews, incident response, or deprecation workflows. After 6 months, the lineage UI was visited an average of 3 times per week (mostly by the catalog admin). When a schema change broke a key dashboard, engineers grep'd the dbt repo as before โ€” they didn't think to check the catalog. Renewal was declined the next year. The lesson: a $80K wiki is still a wiki if no one uses it.

Annual Cost

$80K

Lineage UI Usage

~3 visits/week

Workflow Integration

None

Incidents Prevented

Zero attributable

Lineage tools without workflow enforcement are the most common form of data investment waste. The tool buys you nothing โ€” the workflow integration buys you everything.

Related concepts

Keep connecting.

The concepts that orbit this one โ€” each one sharpens the others.

Beyond the concept

Turn Data Lineage into a live operating decision.

Use this concept as the framing layer, then move into a diagnostic if it maps directly to a current bottleneck.

Typical response time: 24h ยท No retainer required

Turn Data Lineage into a live operating decision.

Use Data Lineage as the framing layer, then move into diagnostics or advisory if this maps directly to a current business bottleneck.