Data Discovery
Data Discovery is the practice — and the user experience — of letting an analyst answer 'where is the data I need?' in seconds, without pinging anyone. It's the consumption surface of the catalog. Where Data Catalog is the inventory, Discovery is the search bar, the ranking algorithm, the 'people who used this also used' suggestions, and the workflow integration that surfaces datasets in Slack or the BI tool. The honest measure is the time-to-trusted-dataset: from the moment a question forms ('what's revenue by segment?') to the moment the analyst is querying the right, certified table. Best-in-class orgs hit under 2 minutes; typical orgs sit at 30 minutes to 2 hours; broken orgs measure it in days.
The Trap
The trap is confusing 'we have search' with 'discovery works'. Most catalogs ship with full-text search that returns 200 results when an analyst types 'revenue', with no ranking signal beyond keyword match. The analyst still has to ask a senior engineer which one is real. Real discovery uses popularity (how many queries hit this table last 30 days), certification status, recency, and lineage proximity to ground-truth sources. The KnowMBA POV: most companies invest in cataloging metadata (the inventory) but neglect ranking and surfacing (the discovery), then wonder why nobody uses the catalog. It's like building a library with great cataloging but no Dewey Decimal, no recommendations, and no librarian — accurate but unusable.
What to Do
Treat discovery as a search/recommendation product, not a documentation problem. Step 1: instrument query logs from your warehouse — popularity is your single most useful ranking signal. Step 2: combine certification status + popularity + freshness into a ranking score; show certified-and-popular tables first. Step 3: surface 'related datasets' using lineage and co-query patterns ('analysts who used orders.fact also used customers.dim'). Step 4: integrate into Slack and the BI tool — search needs to happen where the question is asked. Step 5: track discovery success metrics weekly: median time-to-trusted-dataset, % of questions resolved without engineering escalation, % of queries hitting non-certified tables.
Formula
In Practice
Airbnb's Dataportal (open-sourced as Amundsen, now an Apache LF project) was the canonical internal discovery tool. They built it in 2018 because their analyst hiring was outpacing their ability to onboard people to the data warehouse — new hires were spending weeks in Slack asking 'which table has bookings?' Dataportal ranked datasets by query frequency from the warehouse logs, surfaced owners and sample queries, and turned dataset discovery from a 2-week scavenger hunt into a 5-minute search. Spotify built Lexikon for the same reason. Lyft built Amundsen (which Airbnb later adopted). The pattern: every large data org reaches a scaling crisis around 50-100 analysts where discovery becomes the bottleneck, and they all build essentially the same product.
Pro Tips
- 01
Your warehouse query logs are the highest-signal ranking input you have. Tables queried 1,000+ times per month are real; tables queried 3 times per month are exhibits. Most catalogs ignore this signal because they treat metadata as static — the ones that win use query logs as the popularity oracle.
- 02
Certification + popularity is a 2D ranking that solves 80% of discovery problems. Show certified-and-popular tables at the top, certified-but-niche tables next, popular-but-uncertified tables (with a warning badge) third, and everything else last. Analysts naturally gravitate to the right places without policy enforcement.
- 03
Add 'top queries on this table' to every dataset detail page. The single most useful discovery aid is showing the 5-10 most common SQL queries other analysts have run on this table — it teaches new users not just where the data is but how it's actually used. Mode, Hex, and Atlan all do this. It cuts onboarding time roughly in half.
Myth vs Reality
Myth
“Better search keywords solve discovery”
Reality
Search is necessary but insufficient. The hard problem isn't matching the keyword 'revenue' to a table — it's deciding which of the 47 tables containing the word 'revenue' is the one you should actually use. That's a ranking and trust problem, not a keyword problem. Tools that win on discovery win on ranking, not on lexical matching.
Myth
“AI/LLM-powered semantic search is the future of discovery”
Reality
Semantic search helps at the margins. But analysts don't usually fail at finding tables that match their query semantically — they fail at deciding which match to trust. Trust signals (certification, ownership, freshness, popularity) are doing more work than semantic understanding. LLM-powered descriptions are useful but they don't fix the trust problem; they just make the inventory more navigable.
Try it
Run the numbers.
Pressure-test the concept against your own knowledge — answer the challenge or try the live scenario.
Knowledge Check
An analyst types 'monthly recurring revenue' into the catalog search bar. The system returns 12 tables, all containing those words somewhere in column names or descriptions. Which ranking signal will most reliably surface the 'right' table to the top?
Industry benchmarks
Is your number good?
Calibrate against real-world tiers. Use these ranges as targets — not absolutes.
Median Time-to-Trusted-Dataset
Internal benchmarks from Atlan, Amundsen, and OpenMetadata customer telemetryBest-in-class
< 2 minutes
Good
2-15 minutes
Average
15-60 minutes
Poor
1-4 hours
Broken
Days (Slack archaeology)
Source: https://www.amundsen.io/amundsen/
Real-world cases
Companies that lived this.
Verified narratives with the numbers that prove (or break) the concept.
Airbnb (Dataportal / Amundsen)
2018-present
Airbnb built Dataportal in 2018 to fix a scaling crisis: analyst hiring outpaced the org's ability to onboard people to the warehouse. New hires were spending weeks pinging seniors in Slack to find datasets. Dataportal ranked datasets by query frequency pulled from warehouse logs, showed owners and sample queries, and integrated with Slack. Time-to-first-trusted-dataset for new analysts dropped from weeks to minutes. The system was open-sourced as Amundsen and adopted by Lyft, Square, ING, Workday, and dozens of others. The lesson: large data orgs hit the same discovery scaling wall around 50-100 analysts and converge on essentially the same solution.
Catalyst
Analyst hiring outpaced onboarding
Primary Ranking Signal
Warehouse query frequency
Open-Sourced As
Amundsen (Linux Foundation)
Notable Adopters
Lyft, Square, ING, Workday
Query frequency from the warehouse is the highest-signal ranking input most catalogs ignore. Use it.
OpenMetadata
2021-present
OpenMetadata (founded by ex-Uber data platform engineers) became the leading open-source data discovery platform by combining auto-crawled metadata with usage-based ranking and a strong API. Customers like Slack, Stripe, Mercedes-Benz, and others have adopted it for self-hosted discovery. The product's bet — that ranking and lineage matter more than enterprise-style governance UI — has been validated by adoption velocity. The trade-off versus commercial alternatives is integration depth (Slack, BI tools) which OpenMetadata customers typically build themselves.
Founders
Ex-Uber data platform team
Primary Bet
Ranking + lineage > governance UI
Notable Adopters
Stripe, Slack, Mercedes-Benz
Trade-off
Integration depth requires in-house build
Open-source discovery works when you have engineering capacity to integrate with your stack. It fails for teams that need 'works out of the box' on day one.
Related concepts
Keep connecting.
The concepts that orbit this one — each one sharpens the others.
Beyond the concept
Turn Data Discovery into a live operating decision.
Use this concept as the framing layer, then move into a diagnostic if it maps directly to a current bottleneck.
Typical response time: 24h · No retainer required
Turn Data Discovery into a live operating decision.
Use Data Discovery as the framing layer, then move into diagnostics or advisory if this maps directly to a current business bottleneck.