`PythonAdapter` Proposal #5758

turbo1912 · 2022-09-02T14:13:32Z

turbo1912
Sep 2, 2022

What is the problem you are trying to solve?

dbt adapters are all in one solutions that connect dbt-core with a database and a Python runtime (new!). While some platforms, like Snowflake, offer both a database and a Python runtime, a large number of databases don’t support Python. For example, there is no way for a user with a dbt-postgres adapter to work with Python models.

A more modular dbt adapter system can abstract away a lot of complexity from data architectures with multiple vendors.

Requirements

dbt-core process should not download any data into memory
Python runtimes should not re-invent how to connect to each sql database and visa versa

Proposed Solution

This post proposes a new way for dbt users to configure their Python runtimes separately from their databases. To enable this we propose a new abstraction over the BaseAdapter class, called PythonAdapter (just like SQLAdapter). This new abstraction will facilitate a modular adapter system that will allow pairing SQL-only databases with different Python runtimes. Users will be able to mix and match their Python runtime adapters with the existing SQL adapters.

The proposed solution has 3 main parts.

Specifying a Python engine in the profiles.yml
PythonAdapter interface
Minor changes to the SQLAdapter interface to facilitate Python runtimes to materialize referenced SQL models

Specifying a Python runtime in the `profiles.yml`

dbt users need a way to configure the Python runtime separately from their main database. This is done with a new YML dictionary key python_engines(alternatively called python_runtimes). Similar to default SQL targets, users can also specify a default Python target. Notice python_engines is not a required field, when omitted, dbt python models are executed by the adapter listed under the outputs field (current behavior).

test:
  target: dev-pg            #    ***** SQL target
  python_target: dev-py     #    ***** python runtime target
  outputs:
    dev-pg:
      type: postgres
      host: localhost
      port: 5432
      user: pguser
      # ...
  python_engines:
    dev-py:
      type: spark
      mode: local
      # ...

`PythonAdapter` interface/class

Each platform adapter that intends to run Python models implements the PythonAdapter class.

Adapter authors have the option to inherit from both or just one of SQLAdapter and PythonAdapter classes. For example platforms like Snowflake, that support both Python and SQL query engines, inherit from both classes, and SQL only databases like Postgres will keep using just the SQLAdapter class.

Most of the SQL specific methods and macros that are present in a SQL adapter are not relevant for a Python adapter. The PythonAdapter class consists of a few methods, the most relevantly the submit_python_job method listed below.

class PythonAdapter(BaseAdapter): 
  def submit_python_job(self, parsed_model: dict, compiled_code: str):
    raise NotImplementedException("`submit_python_job` is not implemented for this adapter!")

After PythonAdapter is introduced, the submit_python_job in the BaseAdapter is also removed. One of the reasons to introduce a new PythonAdapter class, instead of building the same functionality with the BaseAdapter is to limit the surface area of the second adapter in dbt-core.

Changes to the SQL adapters to materialize models for python runtimes

Python runtimes need additional help from their SQL adapters in order to materialize the actual datasets into Python-level objects (dataframes) and vice versa (consider ref and model's return) when executing the model.

Consider the example below, dim_all_learners model is a SQL model materialized by the SQL target. Python runtime needs to connect to the SQL target to query the materialization and turn the dim_all_learners model into a dataframe.

# models/my_python_model.py

def model(dbt):

    dbt.config(
        materialized='table'
    )

    # python runtime needs to know how to materialize dim_all_learners_model
    dim_all_learners = dbt.ref("dim_all_learners")
    philly_sample = dim_all_learners
                        .filter(col("city"=="Philly"))
                        .limit(1000)
    df = philly_sample.select("*")

    return df

This proposal suggests that the dataframe conversion, and writing a dataframe back to the SQL target to be optionally located in each SQL target adapter package. Just like the submit_python_job method in the current BaseAdapter SQL adapters may choose to not implement them if it’s not trivial for their platforms.

def read_table_as_dataframe(self, relation: BaseRelation) -> pd.DataFrame

def write_dataframe_as_table(
	self, 
	relation: BaseRelation, 
	df: pd.DataFrame, 
	mode: Union[Literal['table'], Literal['incremental']] = 'table'
) -> AdapterResponse

Another dataframe type other than pandas can be used to pass data between adapters. Possible alternatives include arrow and agate but both of these are not as broadly used as pandas. Arrow would be the most performant choice but its not as stable as pandas.

Moving Data Between Adapters

Two methods mentioned in the section above, read_table_as_dataframe and write_dataframe_as_table defines how to move data in and out of the SQL database. It is defined in the SQL adapter code, therefore all the Python runtimes can share the same way to move data. These methods are only allowed to be called in the Python runtime and not inside the adapter plugin code. Each platform can implement these methods however it is efficient for them. For example the dbt-snowflake adapter might choose to dump the data to S3.

This code snippet illustrates how read_table_as_dataframe and write_dataframe_as_table methods could be executed inside a Python runtime

# This part is user provided model code
# you will need to copy the next section to run the code

def model(...): ...

# COMMAND ----------
# this part is dbt logic for get ref work, do not modify

def ref(*args,dbt_load_df_function):
    refs = {}
    key = ".".join(args)
    return dbt_load_df_function(refs[key])

[...]

class dbtObj:
    def __init__(self, load_df_function) -> None:
        self.source = lambda x: source(x, dbt_load_df_function=load_df_function)
        self.ref = lambda x: ref(x, dbt_load_df_function=load_df_function)
        self.config = config
        self.this = this()
        self.is_incremental = False

# COMMAND ----------

# This is specific to **this** runtime, but each Python engine
# can customize the behavior for running model code in their
# own runtime.

def main(credentials, target_relation):
    db_adapter = create_adapter(credentials, ...)
    dbt_context, session = DBTContext(db_adapter.read_table_as_dataframe), None
    model_as_dataframe = model(dbt_context, session)
    return db_adapter.write_table_as_dataframe(target_relation, model_as_dataframe)

Why Now?

Refactoring the dbt adapter system to allow multiple adapters for any type of engine is undoubtedly a big task. Several factors make the Python runtimes a good first step for a bigger refactor later.

First of all, Python models are new and there is very little existing code to maintain functionality. Besides that, most platforms still have their sql query engines and storage coupled to each other however Python runtimes usually work independently. Lastly, several popular platforms prefer the same Python runtime, Spark, as their query engine of choice. Iceberg, Hudi and Deltalake to name a few. All adapters would need to implement similar submit_python_job methods to support dbt Python models using the current dbt adapter architecture, abstracting out the Python runtime will allow multiple adapters to use the same Python runtime adapter.

Alternatives Considered

Make all the related logic responsibility of the Python Adapter

The responsibility of materializing dataframes (logic that was supposed to be implemented in read_table_as_dataframe/ write_table_as_dataframe methods of SQL adapters) can be assigned to the individual Python runtimes instead of having them in the SQL adapters.

This will result in a big coupling between SQL adapters and Python adapters. As they will share a non-specified API. All the Python runtimes now will depend on SQL adapters internal behavior.
This approach would also result in duplicate code in multiple Python adapters as the implementation for a database is probably very similar in all runtimes.
This option would introduce python adapters as another type of adapter like proposed above, however it won’t give additional responsibilities to the sql adapters to facilitate the materialization.

Creating hybrid adapters (sql + python) that can handle multiple databases and runtimes

Independent python runtimes can add support for python models alongside sql adapters by encapsulating the whole sql adapter. For example one can build a dbt-postgres-spark adapter that has the same sql adapter code with the dbt-postgres adapter and have an extra submit_python_job implementation.

This approach results in an explosion of adapters as a new adapter needs to be built for each python runtime sql database combination.
This option would require minimal changes in dbt-core as the system will still work with a single adapter. There wouldn’t be new keys in profile.yml, and there wouldn’t be a PythonAdapter

jtcohen6 · 2022-09-09T10:55:10Z

jtcohen6
Sep 9, 2022
Maintainer

@turbo1912 Thanks for starting the conversation! There's a lot in here that I agree with.

High-level premises

Adapters for Python runtimes are just another kind of adapter. (It does make sense to me that PythonAdapter could skip a lot of what's currently in BaseAdapter.) When the same runtime offers both SQL + Python support, it makes sense to bundle it into one adapter (Snowflake, Databricks). When it's actually different runtimes
dbt does not "move" data. No data should be downloaded into application memory from one platform, and then re-uploaded to another.
At the same time, we would be excited about living in a future where one dbt invocation / DAG could span multiple data platforms. To achieve that future, we need data platforms to be interoperable. dbt could play a role in teaching that interoperability.

Point of departure: multi-adapter

Rather than configuring a separate target and python-target in profiles.yml, I'd much rather build toward a future where multiple adapters can be configured and used within a single invocation.

This would require us to reimagine (and refactor) profiles.yml:

# profiles.yml
targets:
  - name: dev-pg
    type: postgres
    host: localhost
    port: 5432
    user: pguser
    ...
  - name: dev-py
    type: spark
    mode: local

envs:
  - name: dev
    targets:
      - dev-pg
      - dev-py
    default: dev-pg
  - name: prod
    targets:
      ...

My project would use the default target (Postgres) for most models, but I can also override it to use specific adapters for specific models:

models:
  - name: my_sql_model
    config:
      adapter: postgres   # or should this be 'target: dev-pg',
                          # to allow for multiple postgres targets?
                          # how would that work across envs?
  - name: my_python_model
    config:
      adapter: spark

As we've discussed, and as you caught in the roadmap update, there's a nontrivial amount of work to make all that possible. I'd like it to be true by the end of next year (2023), but I can't make firm guarantees for now except to say that it's very very interesting to us. So long as we're broadly aligned on that as the long-term vision, it feels like there could be a shorter-term version of Python adapter support that's directionally correct for getting there.

Big question: data platform interop

If we want this to be possible in the general case, we can't wait for every single data platform to proactively integrate with every single other one, in order to read and write data between them. (In some cases, it's in their interest to support this; in other cases, it really isn't.)

There are two patterns for doing this, and I see them as highly divergent choices:

Option A: External storage. dbt adapters implement methods for reading data from storage, and writing back to storage. In many cases, this should just be a simple wrapper around a database's supported SQL or API operations for interacting with those platforms. For instance, Snowflake makes it quite easy to perform common operations in S3, GCS, and Azure Blob (depending on the cloud it's running in). The same goes for Redshift vis-à-vis S3, BigQuery with GCS, and so on. All writing and reading should be in columnar file formats where column schema could be auto-detected (Parquet, Delta, ...). The problem today: If two platforms do not actually share a common metastore, how do they know to look in the same places for the same table? Could dbt play a role simply by recording, in its own metadata, the bucket/file location for a Snowflake-produced table exported to S3, and then tell another adapter to read that model from the same bucket/file location? I like the potential overlap here with the discussion around cross-project lineage (#5244), where we're thinking about "publishing" public models from one (upstream) project, such that they could be used as the inputs to another. The limitations here are the limitations of cloud storage: managing permissions can be tricky; and you have to be on the same cloud. A Redshift + BigQuery DAG is pretty hard to imagine.

Option B: In-memory transfer. dbt adapters implement methods for in-memory data transfer, ideally using a performant standard such as Arrow. In many cases, this should just be a simple wrapper around the database client's own Arrow + Pandas support. (From what I can tell, snowflake-connector-python and BigQuery's storage API have both supported egress in Arrow format for some time (Snowflake blog, BigQuery blog). There's less-native support for loading Arrow data back into the warehouse, short of batched conversions to Pandas.) The big catch here, which you called out, is complying with premise 2 above: make it possible to leverage these methods for adapter interoperability, while preventing them from being used to load tons of data into the dbt application, and thereby mixing up the "control" and "data" planes.

I have some other questions about this pseudo-code, which are fairly vanilla by comparison:

def main(credentials, target_relation):
    db_adapter = create_adapter(credentials, ...)
    dbt_context, session = DBTContext(db_adapter.read_table_as_dataframe), None

We'd need a stable internal API for dbt adapters and contexts as Python objects, if they're going to be instantiated within a Python runtime. That's a thorny mess today, though to be improved by behind-the-scenes work we've got underway.
We'd need a secure way to ship up those credentials. We're talking here about sharing dbt's own Snowflake auth (ideally the short-lived token) with another external service (Python runtime). Honestly, this part goes above my head. I'm definitely open to hearing that there are good patterns for doing it.

There might be good reason to want both options, to treat each as a valid backend implementation with pros & cons. Even so, I bring up the divergence now, because I think it's worthwhile for us to figure out which one we'd want to build first.

0 replies

turbo1912 · 2022-09-12T18:50:27Z

turbo1912
Sep 12, 2022
Author

Thanks @jtcohen6! I will focus on the configuration changes in this response and later follow it up by another response about data platform interop and credentials!

Rather than configuring a separate target and python-target in profiles.yml, I'd much rather build toward a future where multiple adapters can be configured and used within a single invocation.

We are definitely excited about multi-adapter support and it’s clear that the steps we take along the way should bring us closer to the ultimate goal. Nevertheless, we are a little concerned that a complete overhaul of profiles.yml would increase the scope of this proposal too much. Our initial intent was to keep the existing profiles.yml mostly backwards compatible and just add an extra property. We want to keep the surface area of this proposal as small as possible.

A similar alternative to our initial proposal with minimal changes to the current profiles.yml format could look like:

test:
  target: dev-pg
  outputs:
    dev-pg:
      type: postgres
      host: localhost
      port: 5432
      user: pguser
      # ...
      python-engine: dev-py

    dev-py:
      type: spark
      user: asdfasdf
      pass: 123123

For a more general solution, there is some extra complexity we need to consider. That is, each adapter comes with different capabilities. In simple terms, these capabilities are: storage, SQL and Python. The dbt-snowflake adapter has all 3 three, dbt-postgres only has storage and SQL, a PySpark python adapter might only have the Python capability. To be able to use each other’s functionalities, adapters would have to define dependencies on other adapters. The python-engine property mentioned in the above example, is a simple way of configuring a Python capability for a postgres adapter. A more complete solution is needed for the multi-adapter future.

In the override example you posted above, assuming spark adapter doesn’t come with any storage, either the model config has to specify where to write the model results or the Postgres database needs to specify which Python engine to use.

  - name: my_python_model
    config:
      adapter: spark
      storage: ??

0 replies

burkaygur · 2022-09-13T01:11:10Z

burkaygur
Sep 13, 2022

… and here is our thoughts on Data Interop

To achieve that future, we need data platforms to be interoperable. dbt could play a role in teaching that interoperability.

It is great to hear that dbt-core is open to building a standardized way of doing data interop! As the fal team, we would love to contribute anything I mention in the rest of this post.

Let’s focus on Option A as it has some nice properties. With this approach, we could introduce a new protocol, let’s call it dbt Teleport.

teleport-config:
  outputs:
    - name: dev-data-share
      type: s3
      bucket: s3://
      aws_key:
      aws_secret:
      ...

    - name: prod-data-share
      type: gcs
      bucket: my-prod-bucket
      service_account_json:
      ...

A teleport config defines a common staging area that models can share data with each other. Every adapter would get read/write access to this shared teleport bucket (could be made more capable over time in terms of access controls). In this world, a node is responsible for unloading data into teleport if it is referenced by another model with a different adapter. Similarly, a node is responsible for loading data from teleport if it is referring to an upstream model with a different adapter when it is being executed.

This pattern leads us into a world where adapters do not need to share credentials or delegate access to each other (which we may want to avoid anyways due to access isolation principles).

The unload and load implementation of each adapter would live inside the adapter itself. The checks for whether a given model needs to be unloaded or a load from would live in a transaction inside the materialization code.

dbt teleport metadata storage

Once a model is unloaded into external storage (s3, gcs.. ) several different models with potentially different adapters need to access and load that model. dbt-core needs to pass a pointer to the location of the data in external storage to all the adapters that need to load the data. There are several places where this information can be stored in; either in external storage (s3/gcs..) or in the dbt build logs (run_results, or similar).

What are the interfaces that needs to be implemented by adapters?

To provide full isolation and no dependencies between adapters, we could come up with “named formats” such as parquet, csv. With these standards, an implementation would mean the same thing across adapters.

A given adapter could implement load and unload methods for a named format that it supports. dbt-core would then resolve which implementations would be used during compilation by matching the unload-load pairs across dependent models. dbt-core would also implement the necessary guardrails and error out during compile time when there are no matching unload-load pairs.

P.S. For Option B (which may be a more efficient way of doing data transfer in certain cases, such as loading batches of training data for a deep learning model), we would have to share credentials across adapters, or dbt would facilitate handing over a secure connection that is pre-authenticated to a given downstream adapter. This could be explored further once there is enough demand for the batch data loading approach.

1 reply

chamini2 Sep 14, 2022

In this world, a node is responsible for unloading data into teleport if it is referenced by another model with a different adapter. Similarly, a node is responsible for loading data from teleport if it is referring to an upstream model with a different adapter when it is being executed.

I think it would actually be a node's responsibility to make available any data that it will need.

So if we have a model a referencing a model b in a different adapter, and we run dbt run --select a, dbt-core would make sure to unload the model b data into teleport (S3) and pass the pointer (link) to it to model a, so that it can download it whenever it needs it.

This makes each model take care of itself and not depend on past runs to have the correct data available.

turbo1912 · 2022-09-29T19:55:21Z

turbo1912
Sep 29, 2022
Author

Wanted to share some of the progress we have made with python adapters with the community.

Loom Demo: https://www.loom.com/share/26c9da8814d4435cb763cfb4eb3ab5dc
Notion for Setup Instructions: https://featuresandlabels.notion.site/Public-Set-up-dbt-fal-Act-1-568b7b0692514f52be161e1129d7dcc8

Let us know what you think. We are especially happy about how this turned out because the changes on dbt-core are minimal.

0 replies

ismailsimsek · 2024-12-17T12:29:37Z

ismailsimsek
Dec 17, 2024

This is now also enabled/possible by opendbt

here is an example which uses dlt to import data, an advanced python meterlization, execution
another article

one advantage with opendbt is user dont need to define any other connection or dbt profile. It completely uses existing adapter and confing

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`PythonAdapter` Proposal #5758

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 5 comments 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

PythonAdapter Proposal #5758

turbo1912 Sep 2, 2022

What is the problem you are trying to solve?

Requirements

Proposed Solution

Specifying a Python runtime in the profiles.yml

PythonAdapter interface/class

Changes to the SQL adapters to materialize models for python runtimes

Moving Data Between Adapters

Why Now?

Alternatives Considered

Make all the related logic responsibility of the Python Adapter

Creating hybrid adapters (sql + python) that can handle multiple databases and runtimes

Replies: 5 comments · 1 reply

jtcohen6 Sep 9, 2022 Maintainer

High-level premises

Point of departure: multi-adapter

Big question: data platform interop

turbo1912 Sep 12, 2022 Author

burkaygur Sep 13, 2022

… and here is our thoughts on Data Interop

dbt teleport metadata storage

What are the interfaces that needs to be implemented by adapters?

chamini2 Sep 14, 2022

turbo1912 Sep 29, 2022 Author

ismailsimsek Dec 17, 2024

`PythonAdapter` Proposal #5758

turbo1912
Sep 2, 2022

Specifying a Python runtime in the `profiles.yml`

`PythonAdapter` interface/class

Replies: 5 comments 1 reply

jtcohen6
Sep 9, 2022
Maintainer

turbo1912
Sep 12, 2022
Author

burkaygur
Sep 13, 2022

turbo1912
Sep 29, 2022
Author

ismailsimsek
Dec 17, 2024