Skip to content

Releases: NVIDIA/spark-rapids-ml

v24.10.0 release

21 Nov 07:24
f82cfac
Compare
Choose a tag to compare

Release notes as follows:

  • Migrated cuML based ivf-flat and ivf-pq to cuVS and added support for cosine distance.
  • Added support for sparse data in UMAP.
  • Added support for NNDescent based k-NN graph building for UMAP.
  • Updated AWS EMR examples to EMR version 7.3.
  • Updated RAPIDS dependencies to 24.10.
  • Dropped support for Python 3.9 (transitive from RAPIDS).
  • Multiple bug and documentation fixes for data generation, CrossValidator, UMAP, DBScan, KMeans, and approximate k-NN implementations.
  • Known issues:
    • LogisticRegression hangs on fitting sparse data with all zero features in a GPU
    • various CUDA errors when spark.rapids.ml.uvm.enabled or spark.python.worker.reuse are set to true and with multiple GPUs per executor. Work around is to set either of those configs to false in multiple GPU per exectuor clusters.
    • error in multi-class RandomForest fit when one GPU does not see all class label values.
    • CUDA error when fewer probes than k in ivflat-pq ANN algorithm.

pip package available at https://pypi.org/project/spark-rapids-ml/24.10.0/

v24.08.0 release

19 Sep 07:54
7f8e779
Compare
Choose a tag to compare

Release notes:

  • Removed MAXINT limit on number of non-zero inputs per GPU for sparse logistic regression.
  • IVF-PQ and Cagra were added to the suite of supported approximate nearest neighbor algorithms.
  • Extended benchmarking scripts to be compatible with Databricks runtime 13.3 with the spark-rapids plugin and 14.3 and 15.4 without the plugin.
  • Included an experimental CLI for no-import-statement-change acceleration of pyspark.ml applications.
  • Fixed a slow down for inputs having a large number of columns when type conversion is required.
  • Updated RAPIDS dependencies to 24.08.
  • Known issues to be fixed in next release:
    • for sparse logistic regression fit a low-level C++/CUDA exception is raised if a partition has no non-zero data.
    • array type inputs with int dtypes are not converted to float leading to errors in some algorithms (e.g. cagra ann)
    • in ivf-pq based Cagra the intermediate graph degree must <= 128 or a low-level C++ exception is raised
    • test_sparse_int64 test requires 256GB host memory to run and not 128GB stated in the comments

pip package available at https://pypi.org/project/spark-rapids-ml/24.08.0/

v24.06.0 release

22 Jul 01:33
c7becc2
Compare
Choose a tag to compare

Release notes:

  • Double precision support for GPU accelerated logistic regression.
  • Added GPU accelerated IVF-Flat Approximate Nearest Neighbor (ANN) to benchmarking scripts.
  • Improved throughput of GPU accelerated IVF-Flat ANN for large data sets.
  • Update of RAPIDS dependencies to 24.06.

NOTE: For a large number of feature/input columns in float64 type, please use VectorUDT or array type (as opposed to multiple scalar columns) for all algorithms due to a performance issue. This will be resolved in our 24.08 release.

pip package available at https://pypi.org/project/spark-rapids-ml/24.06.0/

v24.04.0 release

16 May 04:05
df01b39
Compare
Choose a tag to compare

Release notes:

  • Feature standardization in logistic regression for sparse vectors.
  • GPU accelerated Density Based Spatial Clustering for Applications with Noise (DBSCAN) algorithm with example notebook.
  • GPU accelerated IVF-Flat Approximate Nearest Neighbor algorithm with example notebook
  • Stage level scheduling support for Yarn and K8s.
  • Update of RAPIDS dependencies to 24.04.

pip package available at https://pypi.org/project/spark-rapids-ml/24.04.0/

v24.02.0 release

21 Mar 23:45
e0f644d
Compare
Choose a tag to compare

Release notes:

  • Support feature standardization in logistic regression for dense vectors.
  • Add large scale synthetic sparse data generation for logistic regression testing.
  • Fix tol=0 in KMeans
  • Add sparse vectors to logistic regression notebook example.
  • Update RAPIDS dependencies to 24.02.
  • Known Issue: RandomForest training will throw an exception if the label column takes on only a single value. This will be fixed in 24.04.

pip package available at https://pypi.org/project/spark-rapids-ml/24.02.0/

v23.12.0 release

17 Jan 06:10
e8d138b
Compare
Choose a tag to compare

Release notes:

  • Match Spark's logistic regression fit behavior when data set has only one label value.
  • Support sparse vector based computations through cuML layer in logistic regression fit, transform, and cross validation.
  • Update dataproc benchmark script.
  • Update Azure Databricks instructions.
  • Update RAPIDS dependencies to 23.12.

pip package available at https://pypi.org/project/spark-rapids-ml/23.12.0/

v23.10.0 release

16 Nov 04:16
5f77d4b
Compare
Choose a tag to compare

Release Notes:

  • L1 and elastic net regularization for GPU accelerated distributed LogisticRegression, with notebook example.
  • More than 2 classes for GPU accelerated distributed LogisticRegression, with notebook example.
  • Optimized fitMultiple api for LogisticRegression.
  • Accelerated cross validation for LogisticRegression and log loss.
  • Output raw prediction column for logistic regression.
  • Updated Databricks init scripts and benchmarking scripts.
  • Improved api docs.
  • Updated RAPIDS dependencies to 23.10.

NOTE: While the runtime is compatible with Spark versions >= 3.3, some scripts in python/tests/ are not compatible with Spark 3.3. This is addressed in 23.12

pip package available at https://pypi.org/project/spark-rapids-ml/23.10.0/

v23.08.0 release

13 Sep 05:48
5dab107
Compare
Choose a tag to compare

Release Notes:

  • GPU accelerated distributed Logistic Regression with L2 regularization fit and transform, along with benchmarking and Jupyter notebook examples.
  • GPU accelerated distributed Uniform Manifold Approximation and Projection (UMAP) fit and transform for non-linear dimensionality reduction along with benchmarking and Jupyter notebook examples.
  • Stage level scheduling for training on stand-alone clusters.
  • Improved logging.
  • Preserve input column types during transform.
  • Default to float32 inputs to cuML layer.
  • Support conversion of GPU Logistic Regression models to pySpark ML CPU.
  • Improved local benchmarking script.
  • Updated RAPIDS and RAPIDS Accelerator for Spark dependencies to 23.08.

pip package available at https://pypi.org/project/spark-rapids-ml/23.8.0/

v23.06.0 release

13 Jul 07:25
04dffdf
Compare
Choose a tag to compare

Release Notes:

  • GPU accelerated CrossValidator for RandomForestClassifier, RandomForestRegressor and LinearRegression, with example notebook
  • Support for CUDA unified virtual memory to allow over-subscription of GPU memory
  • Benchmarking scripts and instructions for AWS EMR
  • Distributed synthetic data generation
  • RandomForest example notebooks
  • Support Spark ML parameters in constructors
  • Improved API docs
  • Updated RAPIDS dependencies to 23.06

pip package available at https://pypi.org/project/spark-rapids-ml/23.6.0/

v23.04.0 release

03 May 19:03
b251734
Compare
Choose a tag to compare

This release includes:

  • Getting started guide and benchmarking scripts on GCP dataproc
  • Getting started guide on AWS EMR
  • cpu method to convert Spark RAPIDS ML generated models to Spark ML models
  • Eliminating the need for CUDA on the driver node
  • Example notebook for k-NN
  • Spark 3.4 compatibility
  • Updating RAPIDS dependencies to 23.04

pip package available at https://pypi.org/project/spark-rapids-ml/23.4.0/