Medallion Architecture Explained: The Bronze-Silver-Gold Pattern for Data Lakehouses
Sponsored
Connect with 50,000+ businesses seeking data consulting services — See advertising options
Medallion architecture is a data design pattern that organizes a data lakehouse into three progressively refined layers: Bronze (raw), Silver (cleansed), and Gold (business-ready). Originally formalized by Databricks for Delta Lake environments, it has since become the dominant organizing pattern for modern data platforms regardless of the underlying storage technology.
The pattern solves a problem that data teams have wrestled with since the earliest data warehouses: how do you maintain a trustworthy, auditable record of raw data while simultaneously providing clean, fast, business-ready datasets to downstream consumers? Medallion architecture answers that question with a clear separation of concerns — each layer has a single, well-defined purpose, and data flows directionally from one layer to the next.
The Bronze Layer: Raw Data, Untouched
The Bronze layer is your system of record. It contains data exactly as it arrived from source systems — no transformation, no filtering, no deduplication. The only processing that happens in Bronze is the mechanics of landing data: parsing file formats, writing to the storage layer, and recording ingestion metadata (arrival timestamp, source system, file name, batch ID).
Why preserve raw data with no changes? Three reasons:
- Auditability. When a data quality issue surfaces six months from now, you need to be able to prove exactly what your system received and when. Transformed data cannot do this.
- Reprocessability. If your Silver or Gold logic changes — because of a bug fix, a business rule change, or a new analytical requirement — you need a stable foundation to reprocess from. Bronze is that foundation.
- Schema evolution tolerance. Source systems change their schemas. If you transform on the way in, schema changes break your pipeline. If you land raw first, you can handle schema evolution in the transformation layer where it's easier to manage.
In practice, Bronze tables often have messy schemas: columns that are typed as strings because the source sends mixed types, nullable fields everywhere, duplicate rows, records with missing required fields. That's fine. Bronze's job is not to be clean — it's to be complete and accurate.
Technical implementation: Bronze data is typically stored as Parquet or Delta files in cloud object storage (S3, ADLS, GCS), partitioned by ingestion date. For streaming sources, Bronze tables are often append-only. For CDC sources, you may include both insert and delete records, handling deduplication downstream in Silver.
-- Bronze: land exactly what arrived
CREATE TABLE bronze.orders_raw (
_ingested_at TIMESTAMP,
_source_file STRING,
order_id STRING, -- string, not int — source sends both
customer_id STRING,
order_total STRING, -- sometimes "$12.50", sometimes "12.50"
status STRING,
created_at STRING -- ISO string, timezone inconsistent
)
USING DELTA
PARTITIONED BY (DATE(_ingested_at));
The Silver Layer: Cleansed, Conformed, Enriched
Silver is where the real data engineering work happens. The Silver layer takes Bronze data and applies the transformations that make it reliable and consistent: type casting, deduplication, null handling, business rule application, and joins that conform data from multiple source systems into a unified model.
A Silver table for orders might:
- Cast
order_totalfrom string to decimal, stripping currency symbols - Parse
created_atinto a proper timestamp with a consistent timezone (UTC) - Deduplicate records using
order_idas the primary key, keeping the latest version - Enrich with customer data joined from the
silver.customerstable - Apply validity flags for records that fail business rules (negative order totals, invalid status codes)
Silver tables should be normalized, not aggregated. They represent entities and events in their cleansed form, not yet shaped for any specific business question. This is important: the moment you start building Silver tables for a specific use case, you've started building Gold, and you'll eventually have a mess of purpose-built tables that can't be reused.
-- Silver: clean, typed, deduplicated
-- dbt model: models/silver/orders.sql
WITH ranked AS (
SELECT
CAST(order_id AS BIGINT) AS order_id,
CAST(customer_id AS BIGINT) AS customer_id,
CAST(REPLACE(order_total, '$', '') AS DECIMAL(12,2)) AS order_total_usd,
CAST(status AS STRING) AS status,
TO_UTC_TIMESTAMP(created_at, 'America/Chicago') AS created_at_utc,
_ingested_at,
ROW_NUMBER() OVER (
PARTITION BY order_id ORDER BY _ingested_at DESC
) AS rn
FROM {{ source('bronze', 'orders_raw') }}
WHERE order_id IS NOT NULL
)
SELECT * EXCEPT (rn)
FROM ranked
WHERE rn = 1
Silver is the layer that most data consumers should never see. It exists to serve the Gold layer, not BI tools or ad hoc analysts. If you find analysts querying Silver tables directly, it's usually a sign that your Gold layer is incomplete — they've given up waiting and gone to the raw data themselves.
The Gold Layer: Business-Ready, Aggregated, Consumer-Facing
Gold tables are purpose-built for consumption. They answer specific business questions, are optimized for query performance, and use business terminology — not engineering terminology. Where Silver has a orders table, Gold might have daily_revenue_by_channel, customer_lifetime_value, and product_return_rate.
Gold tables are typically denormalized. Joins are expensive for analysts and BI tools, so Gold tables pre-join the data into wide, flat tables that can be sliced and filtered without additional logic. A Gold table might contain 50 columns that span what are technically three or four normalized entities in Silver.
Gold tables are often organized by domain:
gold.finance— revenue, margins, payment reconciliationgold.marketing— campaign performance, customer acquisition, attributiongold.operations— fulfillment rates, SLAs, inventory turnsgold.executive— company-level KPIs, board metrics
There can be multiple Gold tables serving the same underlying Silver data. A single silver.orders table might power three different Gold tables: one for the finance team (focused on revenue recognition), one for operations (focused on fulfillment and returns), and one for the executive dashboard (focused on GMV and growth rate). This is correct behavior — Gold is where you optimize for consumer needs, not for data normalization.
Implementing Medallion Architecture with dbt
dbt is the most natural implementation tool for the transformation layers (Silver and Gold) because its model-based approach maps directly to the medallion pattern. A well-structured dbt project for medallion looks like this:
models/
bronze/ # sources only — no models, just source.yml declarations
silver/ # cleansing and conforming models
orders.sql
customers.sql
products.sql
gold/
finance/
daily_revenue.sql
order_margins.sql
marketing/
customer_acquisition.sql
campaign_performance.sql
A few implementation choices that matter:
- Materialize Silver as tables, not views. Views recompute on every query, which is expensive at the scale Silver operates at. Incremental tables (using dbt's incremental materialization) process only new records, which reduces compute cost dramatically.
- Use
unique_keyin incremental models. Incremental models without a unique key will produce duplicates when records are updated at the source. Define your unique key and usemergeordelete+insertstrategy explicitly. - Add dbt tests at the Silver boundary. Primary key uniqueness, not-null constraints on required fields, and referential integrity checks should all run on Silver models. Catching bad data at Silver prevents it from poisoning Gold.
When Medallion Architecture Is the Right Choice
Medallion architecture is well-suited to organizations that:
- Have multiple source systems that need to be unified into a consistent data model
- Need to support both operational reporting (relatively fresh data) and analytical workloads (historical aggregations)
- Have compliance or audit requirements that demand a raw data archive
- Are operating at scale where rebuilding from source on every run is not feasible
- Use or are migrating to a data lakehouse platform (Databricks, Delta Lake, Apache Iceberg, Hudi)
When Medallion Architecture Is Overkill
Not every data problem needs three layers. Teams that are over-engineering their stack often do so by applying medallion architecture to situations that don't warrant it:
- Single source, simple use case. If you have one Postgres database and you want dashboards, a staging + marts pattern in dbt is sufficient. Adding Bronze and Silver layers adds operational complexity without meaningful benefit.
- Small data volumes. If your entire dataset fits comfortably in a SQL warehouse and query performance isn't a concern, the separation of Gold and Silver is unnecessary overhead.
- Startup pre-product-market fit. The cost of maintaining three layers is unjustified before you know what questions you'll be asking of your data. Build the minimum viable data layer first.
Common Mistakes in Medallion Implementations
These are the failure modes we see most often when teams adopt medallion architecture:
Skipping Bronze. It seems wasteful to store raw data when you know you'll transform it. But teams that skip Bronze always regret it the first time they need to audit a data discrepancy, reprocess historical data with new logic, or recover from a transformation bug. Bronze is cheap insurance.
Putting business logic in Bronze. Any transformation — even something that seems trivial, like filtering out test records — belongs in Silver, not Bronze. The moment you apply business logic in Bronze, you've lost the raw archive you were trying to preserve.
Building Gold tables that only one person uses. Gold tables require maintenance. If a Gold table is only ever queried by one analyst's dashboard, the operational cost of maintaining it may exceed the benefit. Consolidate Gold where possible, and deprecate unused tables aggressively.
Not versioning schema changes. When a Silver or Gold table schema changes, downstream consumers break. Use a schema registry or at minimum maintain a changelog, and communicate breaking changes before they hit production.
Medallion Architecture and the Modern Data Stack
Medallion architecture fits naturally into the modern data stack. Bronze ingestion is handled by tools like Fivetran, Airbyte, or custom Spark jobs. Silver and Gold transformations run in dbt on top of Snowflake, BigQuery, Databricks, or Redshift. Orchestration runs in Airflow or Dagster. Observability tools like Monte Carlo or Soda monitor data quality at the Silver boundary.
One nuance worth calling out: Databricks' implementation uses Delta Lake tables throughout all three layers, which enables features like time travel (querying a table as it existed at a previous point in time) and ACID transactions across layers. If you're in a Databricks environment, these features make medallion architecture even more powerful — time travel alone eliminates many of the reprocessing headaches that simpler architectures face.
Outside of Databricks, Apache Iceberg has emerged as the strongest open-source alternative for lakehouse table formats, and its adoption is accelerating across AWS (S3 Tables), Snowflake (Iceberg tables), and standalone deployments.
Designing Your Data Architecture?
Practical Data Work helps companies design and implement modern data architectures — from initial platform selection through production deployment. Free architecture review, no obligation.
Get a Free Architecture Review