Medallion Architecture Explained: The Bronze-Silver-Gold Pattern for Data Lakehouses

April 10, 2025 · 2,100 words · Advertise

The Bronze Layer: Raw Data, Untouched

The Bronze layer is your system of record. It contains data exactly as it arrived from source systems — no transformation, no filtering, no deduplication. The only processing that happens in Bronze is the mechanics of landing data: parsing file formats, writing to the storage layer, and recording ingestion metadata (arrival timestamp, source system, file name, batch ID).

Why preserve raw data with no changes? Three reasons:

Auditability. When a data quality issue surfaces six months from now, you need to be able to prove exactly what your system received and when. Transformed data cannot do this.
Reprocessability. If your Silver or Gold logic changes — because of a bug fix, a business rule change, or a new analytical requirement — you need a stable foundation to reprocess from. Bronze is that foundation.
Schema evolution tolerance. Source systems change their schemas. If you transform on the way in, schema changes break your pipeline. If you land raw first, you can handle schema evolution in the transformation layer where it's easier to manage.

In practice, Bronze tables often have messy schemas: columns that are typed as strings because the source sends mixed types, nullable fields everywhere, duplicate rows, records with missing required fields. That's fine. Bronze's job is not to be clean — it's to be complete and accurate.

Technical implementation: Bronze data is typically stored as Parquet or Delta files in cloud object storage (S3, ADLS, GCS), partitioned by ingestion date. For streaming sources, Bronze tables are often append-only. For CDC sources, you may include both insert and delete records, handling deduplication downstream in Silver.

-- Bronze: land exactly what arrived
CREATE TABLE bronze.orders_raw (
  _ingested_at     TIMESTAMP,
  _source_file     STRING,
  order_id         STRING,   -- string, not int — source sends both
  customer_id      STRING,
  order_total      STRING,   -- sometimes "$12.50", sometimes "12.50"
  status           STRING,
  created_at       STRING    -- ISO string, timezone inconsistent
)
USING DELTA
PARTITIONED BY (DATE(_ingested_at));

The Silver Layer: Cleansed, Conformed, Enriched

Silver is where the real data engineering work happens. The Silver layer takes Bronze data and applies the transformations that make it reliable and consistent: type casting, deduplication, null handling, business rule application, and joins that conform data from multiple source systems into a unified model.

A Silver table for orders might:

Cast order_total from string to decimal, stripping currency symbols
Parse created_at into a proper timestamp with a consistent timezone (UTC)
Deduplicate records using order_id as the primary key, keeping the latest version
Enrich with customer data joined from the silver.customers table
Apply validity flags for records that fail business rules (negative order totals, invalid status codes)

Silver tables should be normalized, not aggregated. They represent entities and events in their cleansed form, not yet shaped for any specific business question. This is important: the moment you start building Silver tables for a specific use case, you've started building Gold, and you'll eventually have a mess of purpose-built tables that can't be reused.

-- Silver: clean, typed, deduplicated
-- dbt model: models/silver/orders.sql
WITH ranked AS (
  SELECT
    CAST(order_id AS BIGINT)                              AS order_id,
    CAST(customer_id AS BIGINT)                          AS customer_id,
    CAST(REPLACE(order_total, '$', '') AS DECIMAL(12,2)) AS order_total_usd,
    CAST(status AS STRING)                               AS status,
    TO_UTC_TIMESTAMP(created_at, 'America/Chicago')      AS created_at_utc,
    _ingested_at,
    ROW_NUMBER() OVER (
      PARTITION BY order_id ORDER BY _ingested_at DESC
    ) AS rn
  FROM {{ source('bronze', 'orders_raw') }}
  WHERE order_id IS NOT NULL
)
SELECT * EXCEPT (rn)
FROM ranked
WHERE rn = 1

Silver is the layer that most data consumers should never see. It exists to serve the Gold layer, not BI tools or ad hoc analysts. If you find analysts querying Silver tables directly, it's usually a sign that your Gold layer is incomplete — they've given up waiting and gone to the raw data themselves.

The Gold Layer: Business-Ready, Aggregated, Consumer-Facing

Gold tables are purpose-built for consumption. They answer specific business questions, are optimized for query performance, and use business terminology — not engineering terminology. Where Silver has a orders table, Gold might have daily_revenue_by_channel, customer_lifetime_value, and product_return_rate.

Gold tables are typically denormalized. Joins are expensive for analysts and BI tools, so Gold tables pre-join the data into wide, flat tables that can be sliced and filtered without additional logic. A Gold table might contain 50 columns that span what are technically three or four normalized entities in Silver.

Gold tables are often organized by domain:

gold.finance — revenue, margins, payment reconciliation
gold.marketing — campaign performance, customer acquisition, attribution
gold.operations — fulfillment rates, SLAs, inventory turns
gold.executive — company-level KPIs, board metrics

There can be multiple Gold tables serving the same underlying Silver data. A single silver.orders table might power three different Gold tables: one for the finance team (focused on revenue recognition), one for operations (focused on fulfillment and returns), and one for the executive dashboard (focused on GMV and growth rate). This is correct behavior — Gold is where you optimize for consumer needs, not for data normalization.

Implementing Medallion Architecture with dbt

dbt is the most natural implementation tool for the transformation layers (Silver and Gold) because its model-based approach maps directly to the medallion pattern. A well-structured dbt project for medallion looks like this:

models/
  bronze/           # sources only — no models, just source.yml declarations
  silver/           # cleansing and conforming models
    orders.sql
    customers.sql
    products.sql
  gold/
    finance/
      daily_revenue.sql
      order_margins.sql
    marketing/
      customer_acquisition.sql
      campaign_performance.sql

A few implementation choices that matter:

Materialize Silver as tables, not views. Views recompute on every query, which is expensive at the scale Silver operates at. Incremental tables (using dbt's incremental materialization) process only new records, which reduces compute cost dramatically.
Use unique_key in incremental models. Incremental models without a unique key will produce duplicates when records are updated at the source. Define your unique key and use merge or delete+insert strategy explicitly.
Add dbt tests at the Silver boundary. Primary key uniqueness, not-null constraints on required fields, and referential integrity checks should all run on Silver models. Catching bad data at Silver prevents it from poisoning Gold.

When Medallion Architecture Is the Right Choice

Medallion architecture is well-suited to organizations that:

Have multiple source systems that need to be unified into a consistent data model
Need to support both operational reporting (relatively fresh data) and analytical workloads (historical aggregations)
Have compliance or audit requirements that demand a raw data archive
Are operating at scale where rebuilding from source on every run is not feasible
Use or are migrating to a data lakehouse platform (Databricks, Delta Lake, Apache Iceberg, Hudi)

When Medallion Architecture Is Overkill

Not every data problem needs three layers. Teams that are over-engineering their stack often do so by applying medallion architecture to situations that don't warrant it:

Single source, simple use case. If you have one Postgres database and you want dashboards, a staging + marts pattern in dbt is sufficient. Adding Bronze and Silver layers adds operational complexity without meaningful benefit.
Small data volumes. If your entire dataset fits comfortably in a SQL warehouse and query performance isn't a concern, the separation of Gold and Silver is unnecessary overhead.
Startup pre-product-market fit. The cost of maintaining three layers is unjustified before you know what questions you'll be asking of your data. Build the minimum viable data layer first.

Common Mistakes in Medallion Implementations

These are the failure modes we see most often when teams adopt medallion architecture:

Skipping Bronze. It seems wasteful to store raw data when you know you'll transform it. But teams that skip Bronze always regret it the first time they need to audit a data discrepancy, reprocess historical data with new logic, or recover from a transformation bug. Bronze is cheap insurance.

Putting business logic in Bronze. Any transformation — even something that seems trivial, like filtering out test records — belongs in Silver, not Bronze. The moment you apply business logic in Bronze, you've lost the raw archive you were trying to preserve.

Building Gold tables that only one person uses. Gold tables require maintenance. If a Gold table is only ever queried by one analyst's dashboard, the operational cost of maintaining it may exceed the benefit. Consolidate Gold where possible, and deprecate unused tables aggressively.

Not versioning schema changes. When a Silver or Gold table schema changes, downstream consumers break. Use a schema registry or at minimum maintain a changelog, and communicate breaking changes before they hit production.

Medallion Architecture and the Modern Data Stack

Medallion architecture fits naturally into the modern data stack. Bronze ingestion is handled by tools like Fivetran, Airbyte, or custom Spark jobs. Silver and Gold transformations run in dbt on top of Snowflake, BigQuery, Databricks, or Redshift. Orchestration runs in Airflow or Dagster. Observability tools like Monte Carlo or Soda monitor data quality at the Silver boundary.

One nuance worth calling out: Databricks' implementation uses Delta Lake tables throughout all three layers, which enables features like time travel (querying a table as it existed at a previous point in time) and ACID transactions across layers. If you're in a Databricks environment, these features make medallion architecture even more powerful — time travel alone eliminates many of the reprocessing headaches that simpler architectures face.

Outside of Databricks, Apache Iceberg has emerged as the strongest open-source alternative for lakehouse table formats, and its adoption is accelerating across AWS (S3 Tables), Snowflake (Iceberg tables), and standalone deployments.

Designing Your Data Architecture?

Practical Data Work helps companies design and implement modern data architectures — from initial platform selection through production deployment. Free architecture review, no obligation.

Get a Free Architecture Review