0.11.0 (2024-05-27)
Breaking Changes
There can be some changes in connection behavior, related to version upgrades. So we mark these changes as breaking although most of users will not see any differences.
-
Update Clickhouse JDBC driver to latest version (#249):
- Package was renamed
ru.yandex.clickhouse:clickhouse-jdbc
→com.clickhouse:clickhouse-jdbc
. - Package version changed
0.3.2
→0.6.0-patch5
. - Driver name changed
ru.yandex.clickhouse.ClickHouseDriver
→com.clickhouse.jdbc.ClickHouseDriver
.
This brings up several fixes for Spark <-> Clickhouse type compatibility, and also Clickhouse clusters support.
- Package was renamed
Warning
New JDBC driver has a more strict behavior regarding types:
- Old JDBC driver applied
max(1970-01-01T00:00:00, value)
for Timestamp values, as this is a minimal supported value ofDateTime32
Clickhouse type. New JDBC driver doesn't. - Old JDBC driver rounded values with higher precision than target column during write. New JDBC driver doesn't.
- Old JDBC driver replaced NULLs as input for non-Nullable columns with column's DEFAULT value. New JDBC driver doesn't. To enable previous behavior, pass
Clickhouse(extra={"nullsAsDefault": 2})
(see documentation).
-
Update other JDBC drivers to latest versions:
-
Update MongoDB connector to latest version:
10.1.1
→10.3.0
(#255, #283).This brings up Spark 3.5 support.
-
Update
XML
package to latest version:0.17.0
→0.18.0
(#259).This brings few bugfixes with datetime format handling.
-
For JDBC connections add new
SQLOptions
class forDB.sql(query, options=...)
method (#272).Firsly, to keep naming more consistent.
Secondly, some of options are not supported by
DB.sql(...)
method, but supported byDBReader
. For example,SQLOptions
do not supportpartitioning_mode
and require explicit definition oflower_bound
andupper_bound
whennum_partitions
is greater than 1.ReadOptions
does supportpartitioning_mode
and allows skippinglower_bound
andupper_bound
values.This require some code changes. Before:
from onetl.connection import Postgres postgres = Postgres(...) df = postgres.sql( """ SELECT * FROM some.mytable WHERE key = 'something' """, options=Postgres.ReadOptions( partitioning_mode="range", partition_column="id", num_partitions=10, ), )
After:
from onetl.connection import Postgres postgres = Postgres(...) df = postgres.sql( """ SELECT * FROM some.mytable WHERE key = 'something' """, options=Postgres.SQLOptions( # partitioning_mode is not supported! partition_column="id", num_partitions=10, lower_bound=0, # <-- set explicitly upper_bound=1000, # <-- set explicitly ), )
For now,
DB.sql(query, options=...)
can acceptReadOptions
to keep backward compatibility, but emits deprecation warning. The support will be removed inv1.0.0
. -
Split up
JDBCOptions
class intoFetchOptions
andExecuteOptions
(#274).New classes are used by
DB.fetch(query, options=...)
andDB.execute(query, options=...)
methods respectively. This is mostly to keep naming more consistent.This require some code changes. Before:
from onetl.connection import Postgres postgres = Postgres(...) df = postgres.fetch( "SELECT * FROM some.mytable WHERE key = 'something'", options=Postgres.JDBCOptions( fetchsize=1000, query_timeout=30, ), ) postgres.execute( "UPDATE some.mytable SET value = 'new' WHERE key = 'something'", options=Postgres.JDBCOptions(query_timeout=30), )
After:
from onetl.connection import Postgres # Using FetchOptions for fetching data postgres = Postgres(...) df = postgres.fetch( "SELECT * FROM some.mytable WHERE key = 'something'", options=Postgres.FetchOptions( # <-- change class name fetchsize=1000, query_timeout=30, ), ) # Using ExecuteOptions for executing statements postgres.execute( "UPDATE some.mytable SET value = 'new' WHERE key = 'something'", options=Postgres.ExecuteOptions(query_timeout=30), # <-- change class name )
For now,
DB.fetch(query, options=...)
andDB.execute(query, options=...)
can acceptJDBCOptions
, to keep backward compatibility, but emit a deprecation warning. The old class will be removed inv1.0.0
. -
Serialize
ColumnDatetimeHWM
to Clickhouse'sDateTime64(6)
(precision up to microseconds) instead ofDateTime
(precision up to seconds) (#267).In previous onETL versions,
ColumnDatetimeHWM
value was rounded to the second, and thus reading some rows that were read in previous runs, producing duplicates.For Clickhouse versions below 21.1 comparing column of type
DateTime
with a value of typeDateTime64
is not supported, returning an empty dataframe. To avoid this, replace:DBReader( ..., hwm=DBReader.AutoDetectHWM( name="my_hwm", expression="hwm_column", # <-- ), )
with:
DBReader( ..., hwm=DBReader.AutoDetectHWM( name="my_hwm", expression="CAST(hwm_column AS DateTime64)", # <-- add explicit CAST ), )
-
Pass JDBC connection extra params as
properties
dict instead of URL with query part (#268).This allows passing custom connection parameters like
Clickhouse(extra={"custom_http_options": "option1=value1,option2=value2"})
without need to apply urlencode to parameter value, likeoption1%3Dvalue1%2Coption2%3Dvalue2
.
Features
Improve user experience with Kafka messages and Database tables with serialized columns, like JSON/XML.
-
Allow passing custom package version as argument for
DB.get_packages(...)
method of several DB connectors:Clickhouse.get_packages(package_version=..., apache_http_client_version=...)
(#249).MongoDB.get_packages(scala_version=..., spark_version=..., package_version=...)
(#255).MySQL.get_packages(package_version=...)
(#253).MSSQL.get_packages(java_version=..., package_version=...)
(#254).Oracle.get_packages(java_version=..., package_version=...)
(#252).Postgres.get_packages(package_version=...)
(#251).Teradata.get_packages(package_version=...)
(#256).
Now users can downgrade or upgrade connection without waiting for next onETL release. Previously only
Kafka
andGreenplum
supported this feature. -
Add
FileFormat.parse_column(...)
method to several classes:Avro.parse_column(col)
(#265).JSON.parse_column(col, schema=...)
(#257).CSV.parse_column(col, schema=...)
(#258).XML.parse_column(col, schema=...)
(#269).
This allows parsing data in
value
field of Kafka message or string/binary column of some table as a nested Spark structure. -
Add
FileFormat.serialize_column(...)
method to several classes:Avro.serialize_column(col)
(#265).JSON.serialize_column(col)
(#257).CSV.serialize_column(col)
(#258).
This allows saving Spark nested structures or arrays to
value
field of Kafka message or string/binary column of some table.
Improvements
Few documentation improvements.
-
Replace all
assert
in documentation with doctest syntax. This should make documentation more readable (#273). -
Add generic
Troubleshooting
guide (#275). -
Improve Kafka documentation:
- Add "Prerequisites" page describing different aspects of connecting to Kafka.
- Improve "Reading from" and "Writing to" page of Kafka documentation, add more examples and usage notes.
- Add "Troubleshooting" page (#276).
-
Improve Hive documentation:
- Add "Prerequisites" page describing different aspects of connecting to Hive.
- Improve "Reading from" and "Writing to" page of Hive documentation, add more examples and recommendations.
- Improve "Executing statements in Hive" page of Hive documentation. (#278).
-
Add "Prerequisites" page describing different aspects of using SparkHDFS and SparkS3 connectors. (#279).
-
Add note about connecting to Clickhouse cluster. (#280).
-
Add notes about versions when specific class/method/attribute/argument was added, renamed or changed behavior (#282).
Bug Fixes
- Fix missing
pysmb
package after installingpip install onetl[files]
.