K
KnowMBAAdvisory
Data StrategyIntermediate6 min read

Data Discovery

Data Discovery is the practice — and the user experience — of letting an analyst answer 'where is the data I need?' in seconds, without pinging anyone. It's the consumption surface of the catalog. Where Data Catalog is the inventory, Discovery is the search bar, the ranking algorithm, the 'people who used this also used' suggestions, and the workflow integration that surfaces datasets in Slack or the BI tool. The honest measure is the time-to-trusted-dataset: from the moment a question forms ('what's revenue by segment?') to the moment the analyst is querying the right, certified table. Best-in-class orgs hit under 2 minutes; typical orgs sit at 30 minutes to 2 hours; broken orgs measure it in days.

Also known asSelf-Serve Data DiscoverySearch-Driven AnalyticsDataset SearchData FindabilityKnowledge Discovery

The Trap

The trap is confusing 'we have search' with 'discovery works'. Most catalogs ship with full-text search that returns 200 results when an analyst types 'revenue', with no ranking signal beyond keyword match. The analyst still has to ask a senior engineer which one is real. Real discovery uses popularity (how many queries hit this table last 30 days), certification status, recency, and lineage proximity to ground-truth sources. The KnowMBA POV: most companies invest in cataloging metadata (the inventory) but neglect ranking and surfacing (the discovery), then wonder why nobody uses the catalog. It's like building a library with great cataloging but no Dewey Decimal, no recommendations, and no librarian — accurate but unusable.

What to Do

Treat discovery as a search/recommendation product, not a documentation problem. Step 1: instrument query logs from your warehouse — popularity is your single most useful ranking signal. Step 2: combine certification status + popularity + freshness into a ranking score; show certified-and-popular tables first. Step 3: surface 'related datasets' using lineage and co-query patterns ('analysts who used orders.fact also used customers.dim'). Step 4: integrate into Slack and the BI tool — search needs to happen where the question is asked. Step 5: track discovery success metrics weekly: median time-to-trusted-dataset, % of questions resolved without engineering escalation, % of queries hitting non-certified tables.

Formula

Discovery Effectiveness = Coverage × Ranking Quality × Workflow Integration. Median time-to-trusted-dataset is the operational KPI; % of questions answered without escalation is the strategic KPI.

In Practice

Airbnb's Dataportal (open-sourced as Amundsen, now an Apache LF project) was the canonical internal discovery tool. They built it in 2018 because their analyst hiring was outpacing their ability to onboard people to the data warehouse — new hires were spending weeks in Slack asking 'which table has bookings?' Dataportal ranked datasets by query frequency from the warehouse logs, surfaced owners and sample queries, and turned dataset discovery from a 2-week scavenger hunt into a 5-minute search. Spotify built Lexikon for the same reason. Lyft built Amundsen (which Airbnb later adopted). The pattern: every large data org reaches a scaling crisis around 50-100 analysts where discovery becomes the bottleneck, and they all build essentially the same product.

Pro Tips

  • 01

    Your warehouse query logs are the highest-signal ranking input you have. Tables queried 1,000+ times per month are real; tables queried 3 times per month are exhibits. Most catalogs ignore this signal because they treat metadata as static — the ones that win use query logs as the popularity oracle.

  • 02

    Certification + popularity is a 2D ranking that solves 80% of discovery problems. Show certified-and-popular tables at the top, certified-but-niche tables next, popular-but-uncertified tables (with a warning badge) third, and everything else last. Analysts naturally gravitate to the right places without policy enforcement.

  • 03

    Add 'top queries on this table' to every dataset detail page. The single most useful discovery aid is showing the 5-10 most common SQL queries other analysts have run on this table — it teaches new users not just where the data is but how it's actually used. Mode, Hex, and Atlan all do this. It cuts onboarding time roughly in half.

Myth vs Reality

Myth

Better search keywords solve discovery

Reality

Search is necessary but insufficient. The hard problem isn't matching the keyword 'revenue' to a table — it's deciding which of the 47 tables containing the word 'revenue' is the one you should actually use. That's a ranking and trust problem, not a keyword problem. Tools that win on discovery win on ranking, not on lexical matching.

Myth

AI/LLM-powered semantic search is the future of discovery

Reality

Semantic search helps at the margins. But analysts don't usually fail at finding tables that match their query semantically — they fail at deciding which match to trust. Trust signals (certification, ownership, freshness, popularity) are doing more work than semantic understanding. LLM-powered descriptions are useful but they don't fix the trust problem; they just make the inventory more navigable.

Try it

Run the numbers.

Pressure-test the concept against your own knowledge — answer the challenge or try the live scenario.

🧪

Knowledge Check

An analyst types 'monthly recurring revenue' into the catalog search bar. The system returns 12 tables, all containing those words somewhere in column names or descriptions. Which ranking signal will most reliably surface the 'right' table to the top?

Industry benchmarks

Is your number good?

Calibrate against real-world tiers. Use these ranges as targets — not absolutes.

Median Time-to-Trusted-Dataset

Internal benchmarks from Atlan, Amundsen, and OpenMetadata customer telemetry

Best-in-class

< 2 minutes

Good

2-15 minutes

Average

15-60 minutes

Poor

1-4 hours

Broken

Days (Slack archaeology)

Source: https://www.amundsen.io/amundsen/

Real-world cases

Companies that lived this.

Verified narratives with the numbers that prove (or break) the concept.

🏠

Airbnb (Dataportal / Amundsen)

2018-present

success

Airbnb built Dataportal in 2018 to fix a scaling crisis: analyst hiring outpaced the org's ability to onboard people to the warehouse. New hires were spending weeks pinging seniors in Slack to find datasets. Dataportal ranked datasets by query frequency pulled from warehouse logs, showed owners and sample queries, and integrated with Slack. Time-to-first-trusted-dataset for new analysts dropped from weeks to minutes. The system was open-sourced as Amundsen and adopted by Lyft, Square, ING, Workday, and dozens of others. The lesson: large data orgs hit the same discovery scaling wall around 50-100 analysts and converge on essentially the same solution.

Catalyst

Analyst hiring outpaced onboarding

Primary Ranking Signal

Warehouse query frequency

Open-Sourced As

Amundsen (Linux Foundation)

Notable Adopters

Lyft, Square, ING, Workday

Query frequency from the warehouse is the highest-signal ranking input most catalogs ignore. Use it.

Source ↗
🟢

OpenMetadata

2021-present

success

OpenMetadata (founded by ex-Uber data platform engineers) became the leading open-source data discovery platform by combining auto-crawled metadata with usage-based ranking and a strong API. Customers like Slack, Stripe, Mercedes-Benz, and others have adopted it for self-hosted discovery. The product's bet — that ranking and lineage matter more than enterprise-style governance UI — has been validated by adoption velocity. The trade-off versus commercial alternatives is integration depth (Slack, BI tools) which OpenMetadata customers typically build themselves.

Founders

Ex-Uber data platform team

Primary Bet

Ranking + lineage > governance UI

Notable Adopters

Stripe, Slack, Mercedes-Benz

Trade-off

Integration depth requires in-house build

Open-source discovery works when you have engineering capacity to integrate with your stack. It fails for teams that need 'works out of the box' on day one.

Source ↗

Related concepts

Keep connecting.

The concepts that orbit this one — each one sharpens the others.

Beyond the concept

Turn Data Discovery into a live operating decision.

Use this concept as the framing layer, then move into a diagnostic if it maps directly to a current bottleneck.

Typical response time: 24h · No retainer required

Turn Data Discovery into a live operating decision.

Use Data Discovery as the framing layer, then move into diagnostics or advisory if this maps directly to a current business bottleneck.