Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[590] Add RunCatalogSync utility for synchronizing tables across catalogs #591

Open
wants to merge 1 commit into
base: 590-CatalogSync-API
Choose a base branch
from

Conversation

vinishjail97
Copy link
Contributor

@vinishjail97 vinishjail97 commented Dec 6, 2024

Important Read

  • Please ensure the GitHub issue is mentioned at the beginning of the PR

What is the purpose of the pull request

Introduced RunCatalogSync utility which does the following on a high level. This unblocks the ability to sync tables from a source catalog to multiple target catalogs, you can look at the sample configuration here xtable-utilities/src/test/resources/catalogConfig.yaml

  1. Start off with SourceCatalog.
  2. User configures multiple TargetCatalog's where tables will be synced.
  3. XTable generates the table format metadata in storage first.
  4. Syncs the table format metadata to TargetCatalog.

Brief change log

(for example:)

  • Add interface for CatalogSyncClient and CatalogSync
  • Add. ability to synchronize tables between source and target catalogs

Verify this pull request

(Please pick either of the following options)

This change added tests and can be verified as follows:

(example:)

  • org.apache.xtable.spi.sync.TestCatalogSync
  • org.apache.xtable.spi.sync.TestCatalogUtils

@vinishjail97 vinishjail97 marked this pull request as draft December 6, 2024 22:41
@vinishjail97 vinishjail97 changed the title [590] Add interface for CatalogSyncClient and CatalogSyncOperations [590] Add interface for CatalogSyncClient and CatalogSync Dec 10, 2024
@vinishjail97 vinishjail97 marked this pull request as ready for review December 10, 2024 02:16
@vinishjail97
Copy link
Contributor Author

vinishjail97 commented Dec 10, 2024

I'm pushing another PR for the client side changes for CatalogSync.
https://github.com/apache/incubator-xtable/pull/597/files

@vinishjail97
Copy link
Contributor Author

vinishjail97 commented Dec 10, 2024

After putting more thought into it, I think we can keep the cross catalog sync as a separate function and not integrate it with TableFormatSync.

  1. Start off with SourceCatalog. (This can be StorageCatalog as well).
  2. User can choose multiple TargetCatalog and in each catalog there's an option to sync multiple table formats.
  3. XTable generates the table format metadata in storage first.
  4. Syncs the table format metadata to TargetCatalog.

This will be separate utility option in RunSync to configure a catalogConfig.yaml file.

/** This class represents the unique identifier for a table in a catalog. */
@Value
@Builder
public class CatalogTableIdentifier {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. What do you think of calling this class TableIdentifier.
  2. Also, we should consider making the naming more generic, since it should represent all types of table identifiers. For instance, while databaseName is a popular namespace, there's also schema. In some scenarios, databaseName is synonymous with catalogName. The current two-part naming based on table and databaseName seems a bit restrictive to me.

Would you mind sharing the use cases this naming caters to?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 is okay, included the prefix Catalog because TableIdentifier is overloaded in the dependent projects (iceberg, hudi, delta etc.) and didn't want to add another identifier with the same name.

For 2, agreed that each catalog or system has a different term for what's called a "logical grouping of tables".
I have looked up this name in different systems and have found databaseName, namespace, schema are the popular ones, a more generic name that comes to my mind is tableCollection or tableGroup ? If we can't find a generic name, choosing databaseName maybe okay ? Regardless of the name we choose the conversion interfaces for each catalog are responsible for translating the catalog table definition to this representation.

assertDoesNotThrow(() -> RunCatalogSync.main(args));
}

public static class TestCatalogImpl implements CatalogConversionSource, CatalogSyncClient {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be simpler to use a mock object here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CatalogConversionFactory contains methods which use reflection to load the classImpl provided by the user, mocking this won't be a good assertion from UT's perspective.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw a similar class in another test, should we just have one test implementation we reuse for these?

String getCatalogImpl();

/** Returns the storage location of the table synced to the catalog. */
String getStorageDescriptorLocation(TABLE table);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What composes a storage descriptor? is it just a path?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's the storage path for the table in the catalog.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we just call it getStorageLocation then?

@ashvina
Copy link
Contributor

ashvina commented Dec 18, 2024

Hi @vinishjail97 , this PR seems to cover two distinct features: refactoring to add CatalogSyncClient and enabling syncing to multiple catalogs. The combination of these features might be contributing to the PR's size. What do you think about splitting the PR along these feature lines?

@@ -34,14 +35,21 @@ public class ConversionConfig {
@NonNull SourceTable sourceTable;
// One or more targets to sync the table metadata to
List<TargetTable> targetTables;
// Each target table can be synced to multiple target catalogs, this is map from
// targetTableIdentifier to target catalogs.
Map<String, List<TargetCatalogConfig>> targetCatalogs;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am curious about the expected behavior if XTable fails to update a subset of the TargetCatalogs and if that impacts the way incremental sync works?
This may have been covered elsewhere in the PR that I might have missed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't change the behavior of existing incremental sync, the failure is returned back as part of this, btw existing XTable users can sync use incremental sync without configuring source/target catalogs, the existing RunSync class in utilities or sync function in ConversionController is not changing.

// The sync status for each catalog.
List<CatalogSyncStatus> catalogSyncStatusList; 

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -44,4 +44,8 @@ public TargetTable(
this.metadataRetention =
metadataRetention == null ? Duration.of(7, ChronoUnit.DAYS) : metadataRetention;
}

public String getId() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the contract of Id? Could you please add javadoc for public methods of the core classes ?

public class CatalogTableIdentifier {
/**
* Catalogs have the ability to group tables logically, databaseName is the identifier for such
* logical classification. The alternate names for this field include namespace, schemaName etc.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that the nesting vary depending on the database. However, the usage of a 3-part naming is quite common. The convention includes name of database, schema, and table or view. I am wondering if it would be feasible to keep the table Id as a string instead of an object whose structure can vary a lot.

@ashvina
Copy link
Contributor

ashvina commented Dec 18, 2024

Hi @vinishjail97, This change significantly impacts XTable usage. The current description doesn't provide enough context for me. Could we set up a call or create a document to discuss this change in detail?

@vinishjail97
Copy link
Contributor Author

vinishjail97 commented Dec 18, 2024

@ashvina I will create two separate PR's to avoid the confusion.

  1. Interfaces for CatalogSyncClient and CatalogSync.
  2. Integration of catalog syncs in XTable conversion controller using RunCatalogSync.

We are not changing the way incremental sync works for table formats, the sync method in ConversionController still exists, the functionality of syncTableAcrossCatalogs is to synchronize a table in source catalog to target catalog. If there's a need for tableFormat sync, we synchronize the table format first otherwise skip it. After that, the targetTable's metadata is synchronized to the target catalogs using the catalogTableIdentifier provided. I will add a small RFC document as well in the second PR for clarity.

@vinishjail97 vinishjail97 changed the base branch from main to 590-CatalogSync-API December 18, 2024 10:07
@vinishjail97 vinishjail97 changed the title [590] Add interface for CatalogSyncClient and CatalogSync [590] Add RunCatalogSync utility for synchronizing tables across catalogs Dec 18, 2024
@vinishjail97
Copy link
Contributor Author

@ashvina I have split this change two PR's for clarity.

  1. [590] Add interfaces for CatalogSyncClient and CatalogSync #603
  2. [590] Add RunCatalogSync utility for synchronizing tables across catalogs #591

I will push the PR for an RFC doc for this tomorrow but have replied to most of your comments. Regarding the class CatalogTableIdentifier let's discuss on the RFC, if we have a compatibility matrix between CatalogTableIdentifier and the naming conventions for major catalogs documented users can easily understand.

@@ -1,178 +0,0 @@
/*
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revert this change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants