Skip to content

Releases: NVIDIA/spark-rapids-ml

v25.08.0 release

10 Sep 04:49
884c451
Compare
Choose a tag to compare

Release notes as follows:

  • Updates RAPIDS dependencies to 25.08
  • Bug fixes in UMAP (Hellinger distance) and RandomForest
  • Drops support for cuda 11.

Known issues:

  • CrossValidator for RandomForest over Spark Connect will fail in Spark 4.0. Fix pending in Spark 4.1

pip package available at https://pypi.org/project/spark-rapids-ml/25.08.0/

v25.06.0 release

25 Jul 01:00
577e5c0
Compare
Choose a tag to compare

Release notes as follows:

  • Updates RAPIDS dependencies to 25.06
  • Spark Rapids Connect ML plugin improvements:
    • Extends Spark Rapids Connect ML plugin to support accelerated KMeans, LinearRegression, RandomForest regression and classifiction, and PCA.
    • Adds runtime spark configs for verbose, float32_inputs, num_workers to allow these to be set over spark connect when using the accelerated plugin.
    • improves transfer of RandomForest models from python to jvm
    • Bundles plugin jar for Spark 4.0 in pip package.
  • bug fixes in UMAP and in LogisticRegression on large datasets

Known issues:

  • RandomForest inference:
    • may fail on nodes with multiple GPUs. Convert via cpu() api for cpu based inference as a work around.
    • may fail for very wide inputs (e.g. > 10000 features).
    • CrossValidator for RandomForest over Spark Connect will fail in Spark 4.0. Fix pending in Spark 4.1

pip package available at https://pypi.org/project/spark-rapids-ml/25.06.0/

v25.04.0 release

19 May 01:03
9ccd267
Compare
Choose a tag to compare

Release notes as follows:

  • Updates RAPIDS dependencies to 25.04
  • Adds initial version of accelerated Pipeline for special case of VectorAssembler with columnar inputs and any accelerated estimator.
  • Adds cpu fallback for estimator fit() invocations with unsupported parameters.
  • Adds option to eliminate dataset copies during LogisticRegression data loading.
  • Adds a Spark Connect ML Plugin interface implementation targeting Spark 4.0, with support for accelerated LogisticRegression fit, transform and model saving and loading.

pip package available at https://pypi.org/project/spark-rapids-ml/25.04.0/

v25.02.0 release

14 Mar 08:38
34a7b3a
Compare
Choose a tag to compare

Release notes as follows:

  • Adds pyspark-rapids cli for zero-code change accelerated pyspark shell and Jupyter notebook.
  • Adds example init scripts for setting up zero-code change accelerated notebook environments in cloud Spark clusters.
  • Updates RAPIDS dependencies to 25.02.01
    • Note: Not fully compatible with RAPIDS 25.02.00 so please use the .01 patch release.
  • Fixes UMAP precomputed KNN error message.

pip package available at https://pypi.org/project/spark-rapids-ml/25.02.0/

v24.12.0 release

23 Jan 01:17
5ec5020
Compare
Choose a tag to compare

Release notes as follows:

  • Enables saving models to cloud storage and precomputed k-NN argument in UMAP.
  • Uses improved precision GPU kernels for mean and variance in logistic regression.
  • Updates RAPIDS dependencies to 24.12.
  • Updates Dataproc notebook and benchmark examples.
  • Multiple bug fixes for multi-gpu nodes, ivf_pq with cagra build, logistic regression training and estimator copy.

pip package available at https://pypi.org/project/spark-rapids-ml/24.12.0/

Known issues:

  • Enabling UVM for DBSCAN and KNN may cause seg-faults on some multi-gpu instances.
  • NCCL hangs in some algos on some multi-gpu instances.
  • Supplying both param sample fraction and precomputed kNN to UMAP can trigger obscure cuda error.
  • Model copy with parameter value update results in an error.

v24.10.0 release

21 Nov 07:24
f82cfac
Compare
Choose a tag to compare

Release notes as follows:

  • Migrated cuML based ivf-flat and ivf-pq to cuVS and added support for cosine distance.
  • Added support for sparse data in UMAP.
  • Added support for NNDescent based k-NN graph building for UMAP.
  • Updated AWS EMR examples to EMR version 7.3.
  • Updated RAPIDS dependencies to 24.10.
  • Dropped support for Python 3.9 (transitive from RAPIDS).
  • Multiple bug and documentation fixes for data generation, CrossValidator, UMAP, DBScan, KMeans, and approximate k-NN implementations.
  • Known issues:
    • LogisticRegression hangs on fitting sparse data with all zero features in a GPU
    • various CUDA errors when spark.rapids.ml.uvm.enabled or spark.python.worker.reuse are set to true and with multiple GPUs per executor. Work around is to set either of those configs to false in multiple GPU per exectuor clusters.
    • error in multi-class RandomForest fit when one GPU does not see all class label values.
    • CUDA error when fewer probes than k in ivflat-pq ANN algorithm.

pip package available at https://pypi.org/project/spark-rapids-ml/24.10.0/

v24.08.0 release

19 Sep 07:54
7f8e779
Compare
Choose a tag to compare

Release notes:

  • Removed MAXINT limit on number of non-zero inputs per GPU for sparse logistic regression.
  • IVF-PQ and Cagra were added to the suite of supported approximate nearest neighbor algorithms.
  • Extended benchmarking scripts to be compatible with Databricks runtime 13.3 with the spark-rapids plugin and 14.3 and 15.4 without the plugin.
  • Included an experimental CLI for no-import-statement-change acceleration of pyspark.ml applications.
  • Fixed a slow down for inputs having a large number of columns when type conversion is required.
  • Updated RAPIDS dependencies to 24.08.
  • Known issues to be fixed in next release:
    • for sparse logistic regression fit a low-level C++/CUDA exception is raised if a partition has no non-zero data.
    • array type inputs with int dtypes are not converted to float leading to errors in some algorithms (e.g. cagra ann)
    • in ivf-pq based Cagra the intermediate graph degree must <= 128 or a low-level C++ exception is raised
    • test_sparse_int64 test requires 256GB host memory to run and not 128GB stated in the comments

pip package available at https://pypi.org/project/spark-rapids-ml/24.08.0/

v24.06.0 release

22 Jul 01:33
c7becc2
Compare
Choose a tag to compare

Release notes:

  • Double precision support for GPU accelerated logistic regression.
  • Added GPU accelerated IVF-Flat Approximate Nearest Neighbor (ANN) to benchmarking scripts.
  • Improved throughput of GPU accelerated IVF-Flat ANN for large data sets.
  • Update of RAPIDS dependencies to 24.06.

NOTE: For a large number of feature/input columns in float64 type, please use VectorUDT or array type (as opposed to multiple scalar columns) for all algorithms due to a performance issue. This will be resolved in our 24.08 release.

pip package available at https://pypi.org/project/spark-rapids-ml/24.06.0/

v24.04.0 release

16 May 04:05
df01b39
Compare
Choose a tag to compare

Release notes:

  • Feature standardization in logistic regression for sparse vectors.
  • GPU accelerated Density Based Spatial Clustering for Applications with Noise (DBSCAN) algorithm with example notebook.
  • GPU accelerated IVF-Flat Approximate Nearest Neighbor algorithm with example notebook
  • Stage level scheduling support for Yarn and K8s.
  • Update of RAPIDS dependencies to 24.04.

pip package available at https://pypi.org/project/spark-rapids-ml/24.04.0/

v24.02.0 release

21 Mar 23:45
e0f644d
Compare
Choose a tag to compare

Release notes:

  • Support feature standardization in logistic regression for dense vectors.
  • Add large scale synthetic sparse data generation for logistic regression testing.
  • Fix tol=0 in KMeans
  • Add sparse vectors to logistic regression notebook example.
  • Update RAPIDS dependencies to 24.02.
  • Known Issue: RandomForest training will throw an exception if the label column takes on only a single value. This will be fixed in 24.04.

pip package available at https://pypi.org/project/spark-rapids-ml/24.02.0/