Releases · NVIDIA/spark-rapids-ml

21 Nov 07:24

YanxuanLiu

v24.10

f82cfac

v24.10.0 release Latest

Latest

Release notes as follows:

Migrated cuML based ivf-flat and ivf-pq to cuVS and added support for cosine distance.
Added support for sparse data in UMAP.
Added support for NNDescent based k-NN graph building for UMAP.
Updated AWS EMR examples to EMR version 7.3.
Updated RAPIDS dependencies to 24.10.
Dropped support for Python 3.9 (transitive from RAPIDS).
Multiple bug and documentation fixes for data generation, CrossValidator, UMAP, DBScan, KMeans, and approximate k-NN implementations.
Known issues:
- LogisticRegression hangs on fitting sparse data with all zero features in a GPU
- various CUDA errors when spark.rapids.ml.uvm.enabled or spark.python.worker.reuse are set to true and with multiple GPUs per executor. Work around is to set either of those configs to false in multiple GPU per exectuor clusters.
- error in multi-class RandomForest fit when one GPU does not see all class label values.
- CUDA error when fewer probes than k in ivflat-pq ANN algorithm.

pip package available at https://pypi.org/project/spark-rapids-ml/24.10.0/

Assets 2

19 Sep 07:54

YanxuanLiu

v24.08

7f8e779

v24.08.0 release

Release notes:

Removed MAXINT limit on number of non-zero inputs per GPU for sparse logistic regression.
IVF-PQ and Cagra were added to the suite of supported approximate nearest neighbor algorithms.
Extended benchmarking scripts to be compatible with Databricks runtime 13.3 with the spark-rapids plugin and 14.3 and 15.4 without the plugin.
Included an experimental CLI for no-import-statement-change acceleration of pyspark.ml applications.
Fixed a slow down for inputs having a large number of columns when type conversion is required.
Updated RAPIDS dependencies to 24.08.
Known issues to be fixed in next release:
- for sparse logistic regression fit a low-level C++/CUDA exception is raised if a partition has no non-zero data.
- array type inputs with int dtypes are not converted to float leading to errors in some algorithms (e.g. cagra ann)
- in ivf-pq based Cagra the intermediate graph degree must <= 128 or a low-level C++ exception is raised
- test_sparse_int64 test requires 256GB host memory to run and not 128GB stated in the comments

pip package available at https://pypi.org/project/spark-rapids-ml/24.08.0/

Assets 2

22 Jul 01:33

YanxuanLiu

v24.06.0

c7becc2

v24.06.0 release

Release notes:

Double precision support for GPU accelerated logistic regression.
Added GPU accelerated IVF-Flat Approximate Nearest Neighbor (ANN) to benchmarking scripts.
Improved throughput of GPU accelerated IVF-Flat ANN for large data sets.
Update of RAPIDS dependencies to 24.06.

NOTE: For a large number of feature/input columns in float64 type, please use VectorUDT or array type (as opposed to multiple scalar columns) for all algorithms due to a performance issue. This will be resolved in our 24.08 release.

pip package available at https://pypi.org/project/spark-rapids-ml/24.06.0/

Assets 2

16 May 04:05

YanxuanLiu

v24.04.0

df01b39

v24.04.0 release

Release notes:

Feature standardization in logistic regression for sparse vectors.
GPU accelerated Density Based Spatial Clustering for Applications with Noise (DBSCAN) algorithm with example notebook.
GPU accelerated IVF-Flat Approximate Nearest Neighbor algorithm with example notebook
Stage level scheduling support for Yarn and K8s.
Update of RAPIDS dependencies to 24.04.

pip package available at https://pypi.org/project/spark-rapids-ml/24.04.0/

Assets 2

21 Mar 23:45

YanxuanLiu

v24.02.0

e0f644d

v24.02.0 release

Release notes:

Support feature standardization in logistic regression for dense vectors.
Add large scale synthetic sparse data generation for logistic regression testing.
Fix tol=0 in KMeans
Add sparse vectors to logistic regression notebook example.
Update RAPIDS dependencies to 24.02.
Known Issue: RandomForest training will throw an exception if the label column takes on only a single value. This will be fixed in 24.04.

pip package available at https://pypi.org/project/spark-rapids-ml/24.02.0/

Assets 2

17 Jan 06:10

YanxuanLiu

v23.12.0

e8d138b

v23.12.0 release

Release notes:

Match Spark's logistic regression fit behavior when data set has only one label value.
Support sparse vector based computations through cuML layer in logistic regression fit, transform, and cross validation.
Update dataproc benchmark script.
Update Azure Databricks instructions.
Update RAPIDS dependencies to 23.12.

pip package available at https://pypi.org/project/spark-rapids-ml/23.12.0/

Assets 2

16 Nov 04:16

pxLi

v23.10.0

5f77d4b

v23.10.0 release

Release Notes:

L1 and elastic net regularization for GPU accelerated distributed LogisticRegression, with notebook example.
More than 2 classes for GPU accelerated distributed LogisticRegression, with notebook example.
Optimized fitMultiple api for LogisticRegression.
Accelerated cross validation for LogisticRegression and log loss.
Output raw prediction column for logistic regression.
Updated Databricks init scripts and benchmarking scripts.
Improved api docs.
Updated RAPIDS dependencies to 23.10.

NOTE: While the runtime is compatible with Spark versions >= 3.3, some scripts in python/tests/ are not compatible with Spark 3.3. This is addressed in 23.12

pip package available at https://pypi.org/project/spark-rapids-ml/23.10.0/

Assets 2

13 Sep 05:48

pxLi

v23.08.0

5dab107

v23.08.0 release

Release Notes:

GPU accelerated distributed Logistic Regression with L2 regularization fit and transform, along with benchmarking and Jupyter notebook examples.
GPU accelerated distributed Uniform Manifold Approximation and Projection (UMAP) fit and transform for non-linear dimensionality reduction along with benchmarking and Jupyter notebook examples.
Stage level scheduling for training on stand-alone clusters.
Improved logging.
Preserve input column types during transform.
Default to float32 inputs to cuML layer.
Support conversion of GPU Logistic Regression models to pySpark ML CPU.
Improved local benchmarking script.
Updated RAPIDS and RAPIDS Accelerator for Spark dependencies to 23.08.

pip package available at https://pypi.org/project/spark-rapids-ml/23.8.0/

Assets 2

13 Jul 07:25

pxLi

v23.06.0

04dffdf

v23.06.0 release

Release Notes:

GPU accelerated CrossValidator for RandomForestClassifier, RandomForestRegressor and LinearRegression, with example notebook
Support for CUDA unified virtual memory to allow over-subscription of GPU memory
Benchmarking scripts and instructions for AWS EMR
Distributed synthetic data generation
RandomForest example notebooks
Support Spark ML parameters in constructors
Improved API docs
Updated RAPIDS dependencies to 23.06

pip package available at https://pypi.org/project/spark-rapids-ml/23.6.0/

Assets 2