Wed Oct 30 2024

Understanding the Medallion Architecture: A comprehensive guide with a use case

Alexandros Oikonomidis

Data management is an increasingly important topic, and organizations are constantly seeking ways to optimize their data processes for better efficiency and reliability. To achieve this, however, it is crucial to choose the right data architecture. One architecture that gained significant attention in recent years is the Medallion Architecture. It is an architecture that has evolved and is presented under different N-tiered data designs in the IT world, such as the raw→validated→enriched, the raw→cleaned→model, the development→acceptance→production data design patterns, and many others.

This guide aims to explain the Medallion Architecture in a way that is both accessible and technically robust, making it valuable for both technical and non-technical people in the tech industry, such as BI/data analysts, data scientists, data/cloud/AI/ML engineers, IT leaders, tech managers, and other stakeholders in the tech industry.

What is the Medallion Architecture?

The Medallion Architecture is a data design that organizes data into different layers, often referred to as the "bronze," "silver," and "gold" layers. Each layer serves a specific purpose in the data lifecycle, helping to systematically manage data quality, accessibility, and usability. The Medallion Architecture integrates well with the Lakehouse Architecture too.

Bronze Layer: Raw data ingestion

  • Purpose: To capture and store raw, unprocessed data from various sources.
  • Description: The bronze layer is the foundation of the Medallion Architecture. It ingests raw data in its native format from sources like databases, APIs, and IoT devices, and it removes duplicates as well. This layer acts as the landing zone for all incoming data, preserving its original structure and content.
  • Inclusions: Raw logs, unstructured data, streaming data and batch data.
  • Example of an application from everyday life: Storing raw taxi ride data from New York City, including fields like pickup and drop-off locations, time stamps, fare amounts, and passenger counts.
  • Who can work with it: Data engineers who are responsible for ingesting raw data from various sources and ensuring it is stored in its native format; cloud engineers who help with setting up and maintaining the cloud infrastructure needed to store and process large volumes of raw data, data analysts, who may access raw data for initial exploration and basic analysis.

Silver Layer: Cleansed and enriched data

  • Purpose: To clean, transform, and enrich data for better quality and usability.
  • Description: The silver layer connects data from various data sources together using business logic, and processes data from the bronze layer by handling missing values, and applying business logic. This stage creates a refined dataset that is more consistent and reliable, making it suitable for analytical purposes. The silver version of the data can also be used by other teams, such as AI/ML/data engineers and data scientists.
  • Inclusions: Cleansed data, joined tables, and datasets with basic transformations applied.
  • Example: Cleaning taxi ride data by removing duplicates and filling in missing passenger counts with default values.
  • Who can work with it: Data engineers who clean, transform, and enrich data, removing duplicates and handling missing values to create a consistent dataset; data scientists as they begin preliminary analysis and feature engineering on cleansed data; ML engineers as they prepare data for training machine learning models by applying basic transformations and ensuring data consistency; data analysts, who can use the refined data for more detailed analysis and reporting.

Gold Layer: Business-ready data

  • Purpose: To provide highly processed data ready for business intelligence (BI), analytics and machine learning.
  • Description: The Gold layer contains data that has been aggregated, summarized, and structured to support specific business use cases. This layer delivers high-quality data ready for the end-users, enabling them to generate insights and make data-driven decisions.
  • Inclusions: Aggregated data, key performance indicators (KPIs), business metrics and features for ML models.
  • Example: Aggregating taxi ride data to calculate total trips per day, average fare per trip, and other relevant metrics.
  • Who can work with it: Business analysts, who can utilize the aggregated and processed data for generating business insights, KPIs, reports and other business metrics; executives and managers, who can leverage dashboards and reports generated from this layer for decision-making; data analysts, who can use it to access business-ready data for generating reports, dashboards, and conducting detailed analysis; data scientists, who can perform advanced analytics and predictive modeling on refined datasets; AI/ML engineers, to develop, optimize, and deploy AI and machine learning models, leveraging the structured, highly processed and quality data; BI tools: Integrated BI tools access this data to provide visualizations and insights for various stakeholders (e.g., Power BI).

Why use the Medallion Architecture?

  • Data quality management: By segregating data into layers, organizations can ensure data quality at each stage before it moves to the next layer.
  • Flexibility: It helps the team structure the data, making it suitable for diverse data environments and enabling data results to be reused. In this way, maintainability is also supported.
  • Enhanced governance: The layered approach simplifies data governance and compliance management. For example, bronze and silver layers are accessible among data team members, but business users may only have access to the gold layer for business insights.
  • Improved data lineage: It provides clear visibility into data transformation processes, aiding in tracking and auditing.

When to use the Medallion Architecture?

  • Large volumes of data: The Medallion Architecture is ideal for organizations dealing with substantial data from multiple sources. For instance, companies handling petabytes, or even terabytes, of data across various departments can benefit from its structured approach.
  • Stringent data quality and governance measures: This is applicable in cases whereby high data quality and governance are essential, such as in healthcare, finance, or compliance-driven industries. High-quality data means accurate, complete, and timely data that adheres to defined standards and policies.
  • Maintainability: The Medallion Architecture is suitable for businesses aiming for scalable data architectures to support advanced analytics and machine learning. It helps organize vast amounts of raw data, such as data from e-commerce platforms, into progressive layers of quality (bronze, silver, gold). This ensures that data is clean, structured, and ready for advanced use cases, such as real-time recommendations or predictive analytics.

Implementing the Medallion Architecture: A practical use case

To demonstrate the Medallion Architecture, we will use the New York Taxi dataset 2020, a publicly available dataset containing detailed information about taxi rides in New York City. We'll process this data using PySpark on Databricks, a powerful platform for big data and AI workflows. We aim to calculate the maximum duration, the average trip distance, and the minimum fare amount for rides with two passengers in December 2020.

Why use Databricks with the Medallion Architecture?

  • Unified analytics platform: Databricks is built on Apache Spark, providing a unified platform for data engineering, data science, and ML.
  • Scalability: Databricks scales effortlessly to handle large datasets, making it ideal for implementing the Medallion Architecture.
  • Collaboration: Databricks' collaborative environment supports multiple users working together, enhancing productivity and innovation.

Step-by-Step implementation of the Medallion Architecture

Step 1: Setting up the environment

First, we need to set up a Spark session in Databricks. This session acts as the entry point for using Spark functionalities. For example, we create a session called “MedallionArchitecture”.

Step 2: Ingesting raw data (Bronze Layer)

Next, we ingest the raw data into the bronze layer. This involves reading the data from its source and storing it in a format that Spark can process efficiently (i.e., Delta format). Assuming that our source data is stored in an on-premise database, we access the database to ingest the data into the Lakehouse within Databricks.

Step 3: Processing raw data (Silver Layer)

To move from the bronze to the silver layer, we process and clean the data. This step includes filtering out invalid records, handling missing values, and performing basic data transformations. We will clean the dataset by removing duplicates and empty values. In our case, we filter out records with negative fare amounts and drop rows with missing values.

Step 4: Aggregating data for analytics (Gold Layer)

Finally, to move from the silver to the Gold layer, we apply more complex transformations and aggregate the data to make it ready for business analysis. This step involves grouping, summarizing and calculating key metrics, if required. For our NYC Taxi context, we filter the data to include only rides with two passengers in December 2020 and then calculate the maximum duration, average trip distance, and minimum fare amount. We also look up some information about the taxi companies in our metadata table, so that the data is enhanced with attributes interesting for business.

Final Thoughts

This introduction provides a clear understanding of the Medallion Architecture and its benefits. It also showcases practical implementation steps using PySpark and Databricks, making it accessible for both technical professionals and non-technical leaders.

The Medallion Architecture offers a powerful framework for managing large-scale data, ensuring data quality and a clear data structure. In that way, further activities are enabled, such as advanced analytics and machine learning. By leveraging Databricks and PySpark, organizations can implement this architecture efficiently and effectively. Whether you are a technical lead or an IT manager, understanding and utilizing the Medallion Architecture can significantly enhance your data strategy, providing high-quality, accessible, and actionable data for your business needs.

By adopting this architecture, you can ensure that your organization's data processes are robust, scalable, and ready to meet the demands of modern data analytics and machine learning. With clear stages and defined processes, the Medallion Architecture transforms how data is handled, making it a cornerstone for any forward-thinking data strategy.