-
Notifications
You must be signed in to change notification settings - Fork 151
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[590] Add RunCatalogSync utility for synchronizing tables across catalogs #591
base: 590-CatalogSync-API
Are you sure you want to change the base?
Conversation
xtable-api/src/main/java/org/apache/xtable/spi/sync/CatalogSyncClient.java
Outdated
Show resolved
Hide resolved
xtable-api/src/main/java/org/apache/xtable/model/exception/CatalogRefreshException.java
Outdated
Show resolved
Hide resolved
xtable-api/src/main/java/org/apache/xtable/model/sync/SyncResult.java
Outdated
Show resolved
Hide resolved
xtable-api/src/main/java/org/apache/xtable/spi/sync/CatalogUtils.java
Outdated
Show resolved
Hide resolved
xtable-api/src/main/java/org/apache/xtable/spi/sync/CatalogUtils.java
Outdated
Show resolved
Hide resolved
xtable-api/src/main/java/org/apache/xtable/model/catalog/CatalogType.java
Outdated
Show resolved
Hide resolved
xtable-api/src/main/java/org/apache/xtable/spi/sync/CatalogSync.java
Outdated
Show resolved
Hide resolved
xtable-api/src/main/java/org/apache/xtable/spi/sync/CatalogSync.java
Outdated
Show resolved
Hide resolved
xtable-api/src/main/java/org/apache/xtable/spi/sync/CatalogSync.java
Outdated
Show resolved
Hide resolved
xtable-api/src/main/java/org/apache/xtable/conversion/ExternalCatalog.java
Outdated
Show resolved
Hide resolved
xtable-api/src/main/java/org/apache/xtable/conversion/ExternalCatalog.java
Outdated
Show resolved
Hide resolved
7f307d4
to
06501c1
Compare
I'm pushing another PR for the client side changes for CatalogSync. |
xtable-api/src/main/java/org/apache/xtable/spi/sync/CatalogSync.java
Outdated
Show resolved
Hide resolved
xtable-api/src/main/java/org/apache/xtable/conversion/ExternalCatalog.java
Outdated
Show resolved
Hide resolved
After putting more thought into it, I think we can keep the cross catalog sync as a separate function and not integrate it with TableFormatSync.
This will be separate utility option in RunSync to configure a catalogConfig.yaml file. |
xtable-api/src/main/java/org/apache/xtable/spi/sync/CatalogSync.java
Outdated
Show resolved
Hide resolved
xtable-api/src/main/java/org/apache/xtable/conversion/ExternalTable.java
Outdated
Show resolved
Hide resolved
xtable-api/src/main/java/org/apache/xtable/conversion/ExternalTable.java
Outdated
Show resolved
Hide resolved
xtable-api/src/main/java/org/apache/xtable/conversion/TargetCatalog.java
Outdated
Show resolved
Hide resolved
/** This class represents the unique identifier for a table in a catalog. */ | ||
@Value | ||
@Builder | ||
public class CatalogTableIdentifier { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- What do you think of calling this class
TableIdentifier
. - Also, we should consider making the naming more generic, since it should represent all types of table identifiers. For instance, while
databaseName
is a popular namespace, there's alsoschema
. In some scenarios,databaseName
is synonymous withcatalogName
. The current two-part naming based ontable
anddatabaseName
seems a bit restrictive to me.
Would you mind sharing the use cases this naming caters to?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1 is okay, included the prefix Catalog because TableIdentifier
is overloaded in the dependent projects (iceberg, hudi, delta etc.) and didn't want to add another identifier with the same name.
For 2, agreed that each catalog or system has a different term for what's called a "logical grouping of tables".
I have looked up this name in different systems and have found databaseName, namespace, schema are the popular ones, a more generic name that comes to my mind is tableCollection
or tableGroup
? If we can't find a generic name, choosing databaseName maybe okay ? Regardless of the name we choose the conversion interfaces for each catalog are responsible for translating the catalog table definition to this representation.
xtable-api/src/main/java/org/apache/xtable/model/catalog/CatalogTableIdentifier.java
Outdated
Show resolved
Hide resolved
xtable-core/src/main/java/org/apache/xtable/conversion/ConversionController.java
Show resolved
Hide resolved
xtable-api/src/main/java/org/apache/xtable/spi/extractor/ConversionSource.java
Outdated
Show resolved
Hide resolved
xtable-api/src/main/java/org/apache/xtable/spi/sync/CatalogSync.java
Outdated
Show resolved
Hide resolved
xtable-api/src/main/java/org/apache/xtable/spi/sync/CatalogSyncClient.java
Outdated
Show resolved
Hide resolved
assertDoesNotThrow(() -> RunCatalogSync.main(args)); | ||
} | ||
|
||
public static class TestCatalogImpl implements CatalogConversionSource, CatalogSyncClient { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be simpler to use a mock object here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CatalogConversionFactory
contains methods which use reflection to load the classImpl
provided by the user, mocking this won't be a good assertion from UT's perspective.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I saw a similar class in another test, should we just have one test implementation we reuse for these?
String getCatalogImpl(); | ||
|
||
/** Returns the storage location of the table synced to the catalog. */ | ||
String getStorageDescriptorLocation(TABLE table); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What composes a storage descriptor? is it just a path?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's the storage path for the table in the catalog.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we just call it getStorageLocation
then?
xtable-core/src/main/java/org/apache/xtable/conversion/ConversionController.java
Outdated
Show resolved
Hide resolved
xtable-api/src/test/java/org/apache/xtable/spi/sync/TestCatalogSync.java
Outdated
Show resolved
Hide resolved
xtable-core/src/main/java/org/apache/xtable/conversion/ConversionController.java
Show resolved
Hide resolved
Hi @vinishjail97 , this PR seems to cover two distinct features: refactoring to add |
@@ -34,14 +35,21 @@ public class ConversionConfig { | |||
@NonNull SourceTable sourceTable; | |||
// One or more targets to sync the table metadata to | |||
List<TargetTable> targetTables; | |||
// Each target table can be synced to multiple target catalogs, this is map from | |||
// targetTableIdentifier to target catalogs. | |||
Map<String, List<TargetCatalogConfig>> targetCatalogs; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am curious about the expected behavior if XTable
fails to update a subset of the TargetCatalogs
and if that impacts the way incremental sync works?
This may have been covered elsewhere in the PR that I might have missed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't change the behavior of existing incremental sync, the failure is returned back as part of this, btw existing XTable users can sync use incremental sync without configuring source/target catalogs, the existing RunSync class in utilities or sync function in ConversionController
is not changing.
// The sync status for each catalog.
List<CatalogSyncStatus> catalogSyncStatusList;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -44,4 +44,8 @@ public TargetTable( | |||
this.metadataRetention = | |||
metadataRetention == null ? Duration.of(7, ChronoUnit.DAYS) : metadataRetention; | |||
} | |||
|
|||
public String getId() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the contract of Id? Could you please add javadoc for public methods of the core classes ?
public class CatalogTableIdentifier { | ||
/** | ||
* Catalogs have the ability to group tables logically, databaseName is the identifier for such | ||
* logical classification. The alternate names for this field include namespace, schemaName etc. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that the nesting vary depending on the database. However, the usage of a 3-part naming is quite common. The convention includes name of database, schema, and table or view. I am wondering if it would be feasible to keep the table Id as a string instead of an object whose structure can vary a lot.
Hi @vinishjail97, This change significantly impacts XTable usage. The current description doesn't provide enough context for me. Could we set up a call or create a document to discuss this change in detail? |
xtable-api/src/main/java/org/apache/xtable/conversion/TargetCatalogConfig.java
Show resolved
Hide resolved
@ashvina I will create two separate PR's to avoid the confusion.
We are not changing the way incremental sync works for table formats, the sync method in |
7ab5659
to
1c07c07
Compare
@ashvina I have split this change two PR's for clarity.
I will push the PR for an RFC doc for this tomorrow but have replied to most of your comments. Regarding the class |
@@ -1,178 +0,0 @@ | |||
/* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Revert this change.
Important Read
What is the purpose of the pull request
Introduced
RunCatalogSync
utility which does the following on a high level. This unblocks the ability to sync tables from a source catalog to multiple target catalogs, you can look at the sample configuration herextable-utilities/src/test/resources/catalogConfig.yaml
Brief change log
(for example:)
Verify this pull request
(Please pick either of the following options)
This change added tests and can be verified as follows:
(example:)