sparsekmeans Package

The sparsekmeans package provides an efficient implementation of the K-means clustering algorithm optimized for sparse data sets. It is designed to handle high-dimensional and sparse data commonly found in text mining, recommender systems, and bioinformatics. By leveraging appropriate storage format and sparse matrix multiplication operations, our package ensures significant speedup in running time while maintaining consistency for clustering results compared with scikit-learn. Besides, the design of the package allows users to easily extend and customize, making it suitable for research or integrating into large-scale machine learning systems.

Installation

Use the following command to install sparsekmeans with python >= 3.10.

pip install sparsekmeans

Usage

We support two popular K-means algorithms: Lloyd's method and Elkan's method. While both require distance calculations between data points and centroids, Elkan's method reduces some computations by exploiting information from the previous iteration, specifically using the triangle inequality to maintain two distance-related bounds.

When the data set is small (e.g., the number of samples is less than 10,000), it is recommended to use Lloyd's. Lloyd's relies primarily on sparse matrix multiplications, whereas Elkan's mechanism to reduce distance calculations may cause non-negligible overhead for small problems. Conversely, for large data sets or the situation of a huge number of clusters (K), Elkan's method may significantly outperforms Lloyd's.

Use the following steps to cluster a data set:

from sparsekmeans import LloydKmeans, ElkanKmeans

kmeans = LloydKmeans(n_clusters=100)
kmeans = ElkanKmeans(n_clusters=100)
labels = kmeans.fit(X)

Users can specify the following parameters while creating the kmeans object.

n_clusters : int, default=8
            The predefined number of clusters

n_threads : int, default=max(1, os.cpu_count() // 2)
            The predefined number of threads to use 

max_iter : int, default=300
            Maximum number of iterations of the k-means algorithm.

tol : float, default=1e-4
            Relative tolerance to declare convergence.

random_state : int, RandomState instance or None, default=None
            Determines random number generation for centroid initialization.

To cluster additional data by using previously obtained centroids, use the following way.

test_labels = kmeans.predict(Xtest)

In case users have defined their own centroids and want to use them for clustering data, our package provides a function to do that:

from sparsekmeans import kmeans_predict
test_labels = kmeans_predict(Xtest, user_defined_centroids)

Example

We show an example to cluster a real-world data set.

from sparsekmeans import LloydKmeans
from libsvm.svmutil import svm_read_problem
import io
import urllib.request
import bz2

with urllib.request.urlopen("https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass/news20_tfidf_train.svm.bz2") as r:
    with bz2.open(r, 'rt') as f:
        _, X = svm_read_problem(f, return_scipy=True)

kmeans = LloydKmeans(n_clusters=100)
labels = kmeans.fit(X)

Benchmark

Running time comparison for Lloyd's algorithm

Data sets	𝑲	sklearn	sparsekmeans	Speedup
Wiki-500K	100	24,617.7	2,957.3	8.32x
	500	> $10^5$	26,104.9	> 3.83x
Amazon-670K	100	670.9	123.4	5.43x
	500	7,170.1	776.2	9.23x
Url	100	799.9	163.5	4.89x
	500	5,888.4	987.2	5.96x
Amazon-3M	100	26,207.5	2,359.1	11.10x
	500	> $10^5$	39,346.1	> 2.54x

Running time comparison for Elkan's algorithm

Data sets	𝑲	sklearn	sparsekmeans	Speedup
Wiki-500K	100	4,042.7	2,382.1	1.69x
	500	91,441.8	5,061.0	18.06x
Amazon-670K	100	248.5	141.8	1.75x
	500	1,248.3	685.1	1.82x
Url	100	719.8	296.0	2.43x
	500	4,687.1	1,989.7	2.35x
Amazon-3M	100	2,965.2	1,743.4	1.70x
	500	13,340.0	5,517.9	2.41x

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
sparsekmeans		sparsekmeans
LICENSE		LICENSE
README.md		README.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

sparsekmeans Package

Installation

Usage

Example

Benchmark

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

cjlin1/sparsekmeans

Folders and files

Latest commit

History

Repository files navigation

sparsekmeans Package

Installation

Usage

Example

Benchmark

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages