Just the tip of the Iceberg! #11171

colin-rogers-dbt · 2024-12-19T21:34:35Z

colin-rogers-dbt
Dec 19, 2024
Maintainer

At Coalesce 2024, we announced that we dipped our toes into the Iceberg-filled waters by adding support for materializing your models in Snowflake’s Managed Iceberg catalog in dbt-snowflake. Today we want to put forward another milestone in better supporting the dbt Community in leveraging this brave new world of Iceberg through catalogs.

Setting the scene

If you are reading this, most likely you are already familiar with the concept of “catalogs” (and even unknowingly may already be using one) but just in case , we will disambiguate what we mean here. Broadly speaking, when we say “catalogs,” we are referring to services like Glue Data Catalog or anything that implements Iceberg REST. These services provide a warehouse-agnostic interface for managing data in object storage.

But this is not just a story about supporting a new technology but a new paradigm where dbt is aware of catalogs and how best to leverage them in the context of your warehouse-of-choice.

In this paradigm it’s important to bear in mind what a catalog does and does not do:

It does not store the data itself → It does know where the data lives
It does not query data → it does expose APIs to perform queries on the metadata
It does maintain a logical representation of an object and schemas that contain them
It can manage access and permission controls

To be clear, we aren’t committing to a table format, catalog protocol or technology. Instead, we believe the approach of writing data to a platform-agnostic storage layer and registering with a similarly agnostic catalog service will play an important part of the Analytics Development Lifecycle (ADLC)moving forward. The technology we use to accomplish that is an implementation detail. Iceberg is just the first test of this framework but the industry will decide what to standardize on.

Getting down to business…

Our first steps in supporting Iceberg focused on extending our existing model-level configuration to support using new table formats. The next step on this journey is to elevate catalogs to be first class citizens in dbt.

Doing this will allow us to support the wide world of catalogs, by handling the catalog specific DDL for each platform in our adapter. Consider Snowflake’s CREATE ICEBERG docs:

Notice that the syntax is different in what options are available. Without having awareness of both catalogs, we can’t fully support either.

To do this we will add a new catalogs.yml config file. This will allow users to specify their catalog integration info in a single, versioned, file. This will be a top-level file, which by default dbt will expect to be created at the root of the project.

Before we dive into this new file let’s take a moment to talk about catalog integrations. You might have noticed by now, we are not talking to the catalog directly but instead we are working with an intermediary: a catalog integration. A “catalog integration” is what your platform has built to interface with a catalog. In Snowflake, this is literally called catalog integration but other platforms refer to it differently. See Redshift’s external schemas as a good example. Regardless of name, they provide the interface for each platform’s integration with your catalog of choice.

With that out of our way, let’s get to our proposed spec:

catalogs:
  - catalog_name: my_first_catalog
                # define as many as you want, but you can only have one 'active' integration per invocation
                active_write_integration: prod_glue_write_integration # not required if you only have 1 integration
    write_integrations:  
             - integration_name: prod_glue_write_integration
                external_volume: my_prod_external_volume
                table_format: iceberg
                catalog_type: glue	    
             - integration_name: staging_glue_write_integration
                external_volume: my_staging_external_volume
                table_format: iceberg
                catalog_type: glue

Then, in your model’s configuration specify the catalog field:

/* 
{{
  config(
    catalog="my_first_catalog"
  )
}}

select * from {{ ref('first_table') }}

What does this do?

Now when you run the above model we will materialize your model as an Iceberg table in s3 registered in the AWS Glue catalog.

Diving into config:

catalogs is a list where each catalog has three fields:

catalog_name: this aliases the underlying catalog you’re interacting with and will be used in your model’s config.
active_write_integration: the integration config you want to use for the next dbt invocation from the list of write_integrations
write_integrations: the information we need to know to work with your platform’s catalog integration.

Integration Config

Universally Required Fields

integration_name: This is the name of the catalog integration you have set up in your warehouse.
catalog_type: This field will be used by the adapter to determine the correct DDL statements for the platform and what other fields in the config are required along with contextual validation of those fields. It will be an enum of options.

Commonly Required Fields:

table_format: Some catalogs support multiple formats, but generally, a table format is specific to a particular catalog, you can set this here otherwise the adapter will assume a sane default (typically Iceberg) for the catalog you’re connecting to.
external_volume: Sometimes required for creating tables ( for example in databricks and snowflake), this will represent the name of the platform’s abstraction over the object store (s3, GCS etc) being used.

Optional / Integration specific

It’s possible that as we get into the implementation or add support for other catalogs we will need to add more configuration fields.

FAQ

Why not just add the configuration in profile.yml ?
Profiles have historically been tied to an adapter and we don’t believe that the configuration for a catalog is necessarily tied to an adapter/profile.
Why not just add the configuration in schema.yml ?
We took a hard look at re-using existing project configs but we want to leave the door open to not having this configuration be tied to a project. An example idea might be supporting catalog configuration being supplied via a flag similar to profile.yml.
Why isn’t Iceberg a new materialization?
Iceberg is a table format that doesn't change how to materialize a ‘table’ or other data warehouse object, as such we will not be creating these as a net new materialization.
Why might you want to define multiple write integrations, if you can only have one active one?
A common pattern for testing changes to dbt projects is to run them against a staging environment which is logically or physically isolated from your production environment. Being able to swap between different integrations makes it easier to support this pattern in dbt.
Consider if we didn’t have this you would likely have to use multiple env_var calls to accomplish the same thing:
```
catalogs:
  - catalog_name: my_first_catalog
    # define as many as you want, but you can only have one 'active' integration per invocation
    active_write_integration: prod_glue_write_integration # not required if you only have 1 integration
    write_integrations:  
            - integration_name: "{{ env_var('INTEGRATION_NAME') }}"
               external_volume:  "{{ env_var('EXT_VOLUME') }}"
               table_format: iceberg
               catalog_type: glue	    
```
There are a couple drawbacks here, first if you want to know what integration you’re connecting to you have to go find the environment variable configuration somewhere else. This kind of defeats the point of checking the config file into your git project in the first place.
Also, If you want to change your prod or staging environments you have to remember to update multiple environment variables. In general we would rather you have fewer things to manage and therefore have fewer things that can go wrong.
What about better supporting external tables as sources or materializing your models as external tables?
Over the long term we believe the industry will keep standardizing on leveraging catalogs as the correct storage abstraction for managing data outside of your warehouse. As such we want to prioritize making the path to using these catalogs simple and robust as possible.
However, we understand that this change might move at a glacial pace. In the meantime we have a package, dbt-external-tables, that already supports users in leveraging external tables.
Will dbt be connecting directly with catalogs or configuring connecting to catalogs in my DWH?
A semi-explicit principle of dbt Core is that it operates on data warehouses/platforms, this will continue to be true for now. Our current approach will be to not with your Iceberg catalog directly. Instead, we will rely on you to setup your warehouse to connect to the catalog of your choice.
We will be catalog aware in the DDL statements each platform’s adapter generates.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Just the tip of the Iceberg! #11171

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Just the tip of the Iceberg! #11171

colin-rogers-dbt Dec 19, 2024 Maintainer

Setting the scene

Getting down to business…

Diving into config:

Integration Config

FAQ

Replies: 0 comments

colin-rogers-dbt
Dec 19, 2024
Maintainer