K
KnowMBAAdvisory
Data StrategyAdvanced8 min read

Data Lakehouse Architecture

A Data Lakehouse is an architecture that combines the cheap, flexible storage of a data lake (S3, ADLS, GCS) with the ACID transactions, schema enforcement, and fast SQL of a data warehouse. The technical breakthrough is open table formats — Apache Iceberg, Delta Lake, Apache Hudi — which sit on top of Parquet files in object storage and provide warehouse-like semantics (transactions, time travel, schema evolution, performant queries) without locking data into a proprietary engine. The strategic appeal: store data once in open formats, query it from any engine (Spark, Trino, Snowflake, Databricks SQL, DuckDB), and avoid vendor lock-in. The trade-off vs a pure cloud warehouse (Snowflake, BigQuery): more flexibility and lower storage cost, but more engineering complexity to operate well. The lakehouse is now the dominant architecture for new data platforms at scale (>5 PB).

Also known asLakehouseOpen Table Format ArchitectureIceberg/Delta/HudiLake + Warehouse ConvergenceData Lakehouse

The Trap

The trap is adopting a lakehouse architecture for a 50-person company with 10 TB of data because the engineering blogs say it's the future. At that scale, Snowflake or BigQuery will be cheaper, faster, and dramatically simpler to operate than a self-managed Iceberg + Spark + Trino stack. The other trap: choosing a table format (Iceberg vs Delta vs Hudi) based on which engineering blog is loudest, then realizing 18 months in that the format doesn't integrate well with your downstream consumers. The most expensive failure is the 'lakehouse' that's actually just a S3 bucket of Parquet files with no table format, no governance, no ACID — i.e., a swamp wearing a lakehouse t-shirt.

What to Do

Apply a scale + use-case test before adopting lakehouse architecture. (1) Below ~1 PB and ~50 sources: cloud warehouse (Snowflake/BigQuery) is almost always cheaper and faster — skip lakehouse. (2) 1-5 PB or 50-200 sources or significant ML/data science workloads on raw data: hybrid (warehouse for BI + open formats for data science). (3) 5+ PB or strong vendor-lock-in concerns or polyglot engine requirements: full lakehouse with Iceberg/Delta. Then choose the table format based on your dominant engine (Delta if Databricks-centric, Iceberg if multi-engine / Snowflake / Trino, Hudi if heavy CDC/streaming workloads). Invest in catalog (Unity, Polaris, AWS Glue) and governance from day one — without these, lakehouse becomes swamp.

Formula

Lakehouse Worth-It Score ≈ (Data Volume in PB × Number of Query Engines × Vendor Lock-In Cost) ÷ Operating Complexity Tolerance. Score < 3 favors cloud warehouse; > 8 favors lakehouse; 3-8 is hybrid territory.

In Practice

Apple, Netflix, Apple, Pinterest, and Shopify run massive Iceberg-based lakehouses. Netflix is a particularly well-documented case: they created Iceberg specifically because Hive table format was breaking at their scale (hundreds of PB, thousands of concurrent queries). Iceberg solved schema evolution, hidden partitioning, and atomic writes that Hive couldn't. Today Netflix runs ~hundreds of PB on Iceberg, queried from Spark, Trino, Flink, and Pinot — one storage layer, many engines. Without Iceberg, Netflix would have had to either commit to a proprietary warehouse (expensive at PB scale) or accept the limitations of Hive (which were causing real production incidents). The decisive insight: at hyperscale, the cost difference between proprietary warehouse compute and open-format lakehouse is hundreds of millions per year.

Pro Tips

  • 01

    Choosing a table format is a 5-year commitment. Iceberg is winning the multi-engine race (now supported by Snowflake, BigQuery, Databricks, Trino, Spark, Flink). Delta has the best Databricks experience but weaker non-Databricks support. Hudi excels at CDC/streaming but has narrower adoption. Pick based on your engine future, not blog volume.

  • 02

    The catalog matters as much as the table format. Without a strong catalog (Unity Catalog, Apache Polaris, AWS Glue), a lakehouse devolves into ungoverned files. The catalog is what enforces schemas, permissions, lineage, and consistency across engines. Plan catalog architecture before storage architecture.

  • 03

    Cloud warehouse vendors (Snowflake, BigQuery, Databricks SQL) have all added lakehouse interop with open formats — meaning the binary 'warehouse vs lakehouse' choice is dissolving. The pragmatic 2024+ architecture is often: warehouse engine for BI + open table format for storage + multi-engine read for ML. You get warehouse simplicity AND open format flexibility.

Myth vs Reality

Myth

Lakehouses always replace data warehouses

Reality

For most companies under ~1 PB, a cloud warehouse is faster, cheaper, and simpler. Lakehouses become economically dominant only at large scale (5+ PB) or when polyglot engines are required (Spark + Trino + Flink + ML frameworks). Below that, the operational overhead of lakehouse exceeds the cost savings vs Snowflake or BigQuery.

Myth

An S3 bucket of Parquet files is a lakehouse

Reality

Without an open table format (Iceberg/Delta/Hudi) and a catalog, a Parquet-on-S3 setup has no ACID transactions, no schema evolution, no time travel, and no query optimization — it's a data lake (and likely a swamp). The 'house' part of 'lakehouse' is what the table format and catalog provide. Skipping them is the most common form of fake lakehouse adoption.

Try it

Run the numbers.

Pressure-test the concept against your own knowledge — answer the challenge or try the live scenario.

🧪

Knowledge Check

A 200-person Series B SaaS company has 25 source systems and ~8 TB of total data growing to ~50 TB in 3 years. The CTO is excited about adopting a Databricks-based lakehouse. What's the right answer?

Industry benchmarks

Is your number good?

Calibrate against real-world tiers. Use these ranges as targets — not absolutes.

Storage Architecture by Data Volume

Industry architecture norms 2024 across enterprise data platforms

<10 TB

Cloud warehouse (Snowflake, BigQuery)

10 TB - 1 PB

Cloud warehouse, optional Iceberg interop

1-5 PB

Hybrid warehouse + open format

>5 PB

Full lakehouse (Iceberg/Delta) on object storage

Source: https://www.databricks.com/glossary/data-lakehouse

Real-world cases

Companies that lived this.

Verified narratives with the numbers that prove (or break) the concept.

🎬

Netflix

2018-present

success

Netflix created Apache Iceberg specifically because Hive table format was breaking at their scale. With hundreds of PB across thousands of tables, Hive's lack of atomic writes, slow partition listing, and schema evolution problems were causing real production incidents. Iceberg introduced hidden partitioning, snapshots, time travel, and atomic operations — and Netflix open-sourced it in 2018. Today Netflix runs ~hundreds of PB on Iceberg, queried by Spark, Trino, Flink, and Pinot from one storage layer. Iceberg has since been adopted by Apple, Pinterest, Shopify, Snowflake, and most major data platforms.

Data on Iceberg

Hundreds of PB

Query Engines

Spark, Trino, Flink, Pinot

Open-Sourced

2018

Industry Adoption

Now de facto multi-engine standard

At hyperscale, open table formats deliver compounding value: cost savings, engine flexibility, and freedom from any one vendor. The lakehouse is the architectural answer for the largest data estates in the world.

Source ↗
🚕

Uber

2017-present

success

Uber created Apache Hudi to handle the unique requirements of incremental data lake updates from CDC streams. With hundreds of PB and constant updates from operational systems (trips, payments, user state), Uber needed a lakehouse table format that could handle upserts efficiently. Hudi provides incremental query, record-level upsert, and time travel on top of Parquet/ORC files. Today Uber runs much of their analytics and ML data platform on Hudi, with Spark, Presto, Hive, and Flink as compute engines. Hudi has been adopted by ByteDance, Walmart, and Robinhood for similar CDC-heavy lakehouse use cases.

Data on Hudi

Hundreds of PB

Use Case Strength

CDC, upserts, streaming

Open-Sourced

2017

Architecture Type

Streaming-first lakehouse

The right table format depends on your dominant workload. Iceberg for batch + multi-engine. Delta for Databricks-centric. Hudi for CDC and streaming-heavy. The choice locks you in for years — analyze workload first, blogs second.

Source ↗
📦

Hypothetical: Series B SaaS

2022

failure

A 180-person SaaS company with 12 TB of data adopted a self-managed Iceberg + Spark + Trino lakehouse after the CTO returned from a conference. The migration took 11 months and required hiring 2 platform engineers. After 18 months operating the lakehouse, total cost (engineering + infrastructure) was 2.4x what Snowflake would have cost for the same workload. BI users complained about slower dashboards. The team migrated back to Snowflake at month 22, retaining Iceberg only for a small data science workload. Total opportunity cost: ~$2.5M and 22 months.

Data Volume

12 TB

Migration Time

11 months

Total Cost vs Snowflake

2.4x

Eventual Outcome

Migrated back at month 22

Lakehouse complexity is justified by scale. At 12 TB, the operational overhead of a self-managed lakehouse dwarfs any storage cost savings. Architecture must match scale, not aspiration.

Decision scenario

The Lakehouse Migration Pitch

You're CTO at a 1,400-person retailer. Current state: on-prem Hadoop cluster with ~1.5 PB of data, increasingly unreliable. The Databricks team pitches a Delta-based lakehouse ($1.6M/year). The Snowflake team pitches their warehouse with Iceberg interop ($2.4M/year). Your data team is 35 people: 20 on BI/analytics, 15 on data engineering and platform. You have a 6-month deadline before the Hadoop cluster reaches end of vendor support.

Data Volume

1.5 PB

Annual Workloads

BI + ML + ad-hoc analytics

Data Team Size

35

Deadline

6 months

Budget Range

$1.6M - $2.4M/year

01

Decision 1

The CFO wants the cheapest option (Databricks lakehouse). The BI team wants Snowflake because it 'just works' for SQL analysts. The data science team wants Databricks for Spark and ML. You have to choose one to meet the deadline.

Choose Snowflake despite higher cost — BI team familiarity reduces migration risk and meets the 6-month deadline confidentlyReveal
Migration completes in 5 months. BI team is happy and productive immediately. Data science team continues on a separate (now legacy) Spark setup, creating two copies of governed data. By month 12, governance is bifurcated and the data science team complains about freshness. You spend year 2 evaluating Snowflake's Iceberg interop to unify the two — adding another year of architectural debt. Net: works but expensive and creates unification debt.
Time to Migrate: 5 monthsAnnual Cost: $2.4MUnification Debt: Created year 2 problem
Choose Databricks lakehouse with Delta as the table format. Migrate BI workloads to Databricks SQL Warehouse and ML to Spark, both reading the same Delta tables under Unity Catalog. Accept the harder upskill curve to unify the data layer.Reveal
Migration takes 7 months (1 month over deadline; you negotiated extended Hadoop support for the gap). BI team initially struggles with Databricks SQL but stabilizes by month 9. Data science team is immediately productive. By month 12, both teams query the same governed Delta tables. ML deployment time drops 60%. Total annual cost runs $1.7M, $700K below the Snowflake alternative. By year 2, the unified data layer is publicly cited as the architectural foundation enabling AI/ML work that wouldn't have been possible with split platforms.
Time to Migrate: 7 monthsAnnual Cost: $1.7M (saved $700K)Data Layer: Unified BI + ML governance

Related concepts

Keep connecting.

The concepts that orbit this one — each one sharpens the others.

Beyond the concept

Turn Data Lakehouse Architecture into a live operating decision.

Use this concept as the framing layer, then move into a diagnostic if it maps directly to a current bottleneck.

Typical response time: 24h · No retainer required

Turn Data Lakehouse Architecture into a live operating decision.

Use Data Lakehouse Architecture as the framing layer, then move into diagnostics or advisory if this maps directly to a current business bottleneck.