Replies: 5 comments 1 reply
-
@turbo1912 Thanks for starting the conversation! There's a lot in here that I agree with. High-level premises
Point of departure: multi-adapterRather than configuring a separate This would require us to reimagine (and refactor) # profiles.yml
targets:
- name: dev-pg
type: postgres
host: localhost
port: 5432
user: pguser
...
- name: dev-py
type: spark
mode: local
envs:
- name: dev
targets:
- dev-pg
- dev-py
default: dev-pg
- name: prod
targets:
... My project would use the default target (Postgres) for most models, but I can also override it to use specific adapters for specific models: models:
- name: my_sql_model
config:
adapter: postgres # or should this be 'target: dev-pg',
# to allow for multiple postgres targets?
# how would that work across envs?
- name: my_python_model
config:
adapter: spark As we've discussed, and as you caught in the roadmap update, there's a nontrivial amount of work to make all that possible. I'd like it to be true by the end of next year (2023), but I can't make firm guarantees for now except to say that it's very very interesting to us. So long as we're broadly aligned on that as the long-term vision, it feels like there could be a shorter-term version of Python adapter support that's directionally correct for getting there. Big question: data platform interopIf we want this to be possible in the general case, we can't wait for every single data platform to proactively integrate with every single other one, in order to read and write data between them. (In some cases, it's in their interest to support this; in other cases, it really isn't.) There are two patterns for doing this, and I see them as highly divergent choices: Option A: External storage. dbt adapters implement methods for reading data from storage, and writing back to storage. In many cases, this should just be a simple wrapper around a database's supported SQL or API operations for interacting with those platforms. For instance, Snowflake makes it quite easy to perform common operations in S3, GCS, and Azure Blob (depending on the cloud it's running in). The same goes for Redshift vis-à-vis S3, BigQuery with GCS, and so on. All writing and reading should be in columnar file formats where column schema could be auto-detected (Parquet, Delta, ...). The problem today: If two platforms do not actually share a common metastore, how do they know to look in the same places for the same table? Could dbt play a role simply by recording, in its own metadata, the bucket/file location for a Snowflake-produced table exported to S3, and then tell another adapter to read that model from the same bucket/file location? I like the potential overlap here with the discussion around cross-project lineage (#5244), where we're thinking about "publishing" public models from one (upstream) project, such that they could be used as the inputs to another. The limitations here are the limitations of cloud storage: managing permissions can be tricky; and you have to be on the same cloud. A Redshift + BigQuery DAG is pretty hard to imagine. Option B: In-memory transfer. dbt adapters implement methods for in-memory data transfer, ideally using a performant standard such as Arrow. In many cases, this should just be a simple wrapper around the database client's own Arrow + Pandas support. (From what I can tell, I have some other questions about this pseudo-code, which are fairly vanilla by comparison: def main(credentials, target_relation):
db_adapter = create_adapter(credentials, ...)
dbt_context, session = DBTContext(db_adapter.read_table_as_dataframe), None
There might be good reason to want both options, to treat each as a valid backend implementation with pros & cons. Even so, I bring up the divergence now, because I think it's worthwhile for us to figure out which one we'd want to build first. |
Beta Was this translation helpful? Give feedback.
-
Thanks @jtcohen6! I will focus on the configuration changes in this response and later follow it up by another response about data platform interop and credentials!
We are definitely excited about multi-adapter support and it’s clear that the steps we take along the way should bring us closer to the ultimate goal. Nevertheless, we are a little concerned that a complete overhaul of A similar alternative to our initial proposal with minimal changes to the current profiles.yml format could look like:
For a more general solution, there is some extra complexity we need to consider. That is, each adapter comes with different capabilities. In simple terms, these capabilities are: storage, SQL and Python. The dbt-snowflake adapter has all 3 three, dbt-postgres only has storage and SQL, a PySpark python adapter might only have the Python capability. To be able to use each other’s functionalities, adapters would have to define dependencies on other adapters. The In the override example you posted above, assuming spark adapter doesn’t come with any storage, either the model config has to specify where to write the model results or the Postgres database needs to specify which Python engine to use.
|
Beta Was this translation helpful? Give feedback.
-
… and here is our thoughts on Data Interop
It is great to hear that dbt-core is open to building a standardized way of doing data interop! As the fal team, we would love to contribute anything I mention in the rest of this post. Let’s focus on Option A as it has some nice properties. With this approach, we could introduce a new protocol, let’s call it teleport-config:
outputs:
- name: dev-data-share
type: s3
bucket: s3://
aws_key:
aws_secret:
...
- name: prod-data-share
type: gcs
bucket: my-prod-bucket
service_account_json:
... A teleport config defines a common staging area that models can share data with each other. Every adapter would get read/write access to this shared teleport bucket (could be made more capable over time in terms of access controls). In this world, a node is responsible for unloading data into teleport if it is referenced by another model with a different adapter. Similarly, a node is responsible for loading data from teleport if it is referring to an upstream model with a different adapter when it is being executed. This pattern leads us into a world where adapters do not need to share credentials or delegate access to each other (which we may want to avoid anyways due to access isolation principles). The unload and load implementation of each adapter would live inside the adapter itself. The checks for whether a given model needs to be unloaded or a load from would live in a transaction inside the materialization code. dbt teleport metadata storageOnce a model is unloaded into external storage (s3, gcs.. ) several different models with potentially different adapters need to access and load that model. dbt-core needs to pass a pointer to the location of the data in external storage to all the adapters that need to load the data. There are several places where this information can be stored in; either in external storage (s3/gcs..) or in the dbt build logs (run_results, or similar). What are the interfaces that needs to be implemented by adapters?To provide full isolation and no dependencies between adapters, we could come up with “named formats” such as A given adapter could implement P.S. For Option B (which may be a more efficient way of doing data transfer in certain cases, such as loading batches of training data for a deep learning model), we would have to share credentials across adapters, or dbt would facilitate handing over a secure connection that is pre-authenticated to a given downstream adapter. This could be explored further once there is enough demand for the batch data loading approach. |
Beta Was this translation helpful? Give feedback.
-
Wanted to share some of the progress we have made with python adapters with the community. Loom Demo: https://www.loom.com/share/26c9da8814d4435cb763cfb4eb3ab5dc Let us know what you think. We are especially happy about how this turned out because the changes on dbt-core are minimal. |
Beta Was this translation helpful? Give feedback.
-
This is now also enabled/possible by opendbt
one advantage with opendbt is user dont need to define any other connection or dbt profile. It completely uses existing adapter and confing |
Beta Was this translation helpful? Give feedback.
-
What is the problem you are trying to solve?
dbt adapters are all in one solutions that connect dbt-core with a database and a Python runtime (new!). While some platforms, like Snowflake, offer both a database and a Python runtime, a large number of databases don’t support Python. For example, there is no way for a user with a dbt-postgres adapter to work with Python models.
A more modular dbt adapter system can abstract away a lot of complexity from data architectures with multiple vendors.
Requirements
Proposed Solution
This post proposes a new way for dbt users to configure their Python runtimes separately from their databases. To enable this we propose a new abstraction over the
BaseAdapter
class, calledPythonAdapter
(just likeSQLAdapter
). This new abstraction will facilitate a modular adapter system that will allow pairing SQL-only databases with different Python runtimes. Users will be able to mix and match their Python runtime adapters with the existing SQL adapters.The proposed solution has 3 main parts.
profiles.yml
PythonAdapter
interfaceSQLAdapter
interface to facilitate Python runtimes to materialize referenced SQL modelsSpecifying a Python runtime in the
profiles.yml
dbt users need a way to configure the Python runtime separately from their main database. This is done with a new YML dictionary key
python_engines
(alternatively calledpython_runtimes
). Similar to default SQL targets, users can also specify a default Python target. Noticepython_engines
is not a required field, when omitted, dbt python models are executed by the adapter listed under theoutputs
field (current behavior).PythonAdapter
interface/classEach platform adapter that intends to run Python models implements the
PythonAdapter
class.Adapter authors have the option to inherit from both or just one of
SQLAdapter
andPythonAdapter
classes. For example platforms like Snowflake, that support both Python and SQL query engines, inherit from both classes, and SQL only databases like Postgres will keep using just theSQLAdapter
class.Most of the SQL specific methods and macros that are present in a SQL adapter are not relevant for a Python adapter. The
PythonAdapter
class consists of a few methods, the most relevantly thesubmit_python_job
method listed below.After
PythonAdapter
is introduced, thesubmit_python_job
in theBaseAdapter
is also removed. One of the reasons to introduce a newPythonAdapter
class, instead of building the same functionality with theBaseAdapter
is to limit the surface area of the second adapter in dbt-core.Changes to the SQL adapters to materialize models for python runtimes
Python runtimes need additional help from their SQL adapters in order to materialize the actual datasets into Python-level objects (dataframes) and vice versa (consider
ref
andmodel
's return) when executing the model.Consider the example below,
dim_all_learners
model is a SQL model materialized by the SQL target. Python runtime needs to connect to the SQL target to query the materialization and turn thedim_all_learners
model into a dataframe.This proposal suggests that the dataframe conversion, and writing a dataframe back to the SQL target to be optionally located in each SQL target adapter package. Just like the
submit_python_job
method in the currentBaseAdapter
SQL adapters may choose to not implement them if it’s not trivial for their platforms.Another dataframe type other than pandas can be used to pass data between adapters. Possible alternatives include arrow and agate but both of these are not as broadly used as pandas. Arrow would be the most performant choice but its not as stable as pandas.
Moving Data Between Adapters
Two methods mentioned in the section above,
read_table_as_dataframe
andwrite_dataframe_as_table
defines how to move data in and out of the SQL database. It is defined in the SQL adapter code, therefore all the Python runtimes can share the same way to move data. These methods are only allowed to be called in the Python runtime and not inside the adapter plugin code. Each platform can implement these methods however it is efficient for them. For example the dbt-snowflake adapter might choose to dump the data to S3.This code snippet illustrates how
read_table_as_dataframe
andwrite_dataframe_as_table
methods could be executed inside a Python runtimeWhy Now?
Refactoring the dbt adapter system to allow multiple adapters for any type of engine is undoubtedly a big task. Several factors make the Python runtimes a good first step for a bigger refactor later.
First of all, Python models are new and there is very little existing code to maintain functionality. Besides that, most platforms still have their sql query engines and storage coupled to each other however Python runtimes usually work independently. Lastly, several popular platforms prefer the same Python runtime, Spark, as their query engine of choice. Iceberg, Hudi and Deltalake to name a few. All adapters would need to implement similar
submit_python_job
methods to support dbt Python models using the current dbt adapter architecture, abstracting out the Python runtime will allow multiple adapters to use the same Python runtime adapter.Alternatives Considered
Make all the related logic responsibility of the Python Adapter
The responsibility of materializing dataframes (logic that was supposed to be implemented in
read_table_as_dataframe
/write_table_as_dataframe
methods of SQL adapters) can be assigned to the individual Python runtimes instead of having them in the SQL adapters.This will result in a big coupling between SQL adapters and Python adapters. As they will share a non-specified API. All the Python runtimes now will depend on SQL adapters internal behavior.
This approach would also result in duplicate code in multiple Python adapters as the implementation for a database is probably very similar in all runtimes.
This option would introduce python adapters as another type of adapter like proposed above, however it won’t give additional responsibilities to the sql adapters to facilitate the materialization.
Creating hybrid adapters (sql + python) that can handle multiple databases and runtimes
Independent python runtimes can add support for python models alongside sql adapters by encapsulating the whole sql adapter. For example one can build a dbt-postgres-spark adapter that has the same sql adapter code with the dbt-postgres adapter and have an extra
submit_python_job
implementation.PythonAdapter
Beta Was this translation helpful? Give feedback.
All reactions