Skip to content

cjlin1/sparsekmeans

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 

Repository files navigation

sparsekmeans Package

The sparsekmeans package provides an efficient implementation of the K-means clustering algorithm optimized for sparse data sets. It is designed to handle high-dimensional and sparse data commonly found in text mining, recommender systems, and bioinformatics. By leveraging appropriate storage format and sparse matrix multiplication operations, our package ensures significant speedup in running time while maintaining consistency for clustering results compared with scikit-learn. Besides, the design of the package allows users to easily extend and customize, making it suitable for research or integrating into large-scale machine learning systems.

Installation

Use the following command to install sparsekmeans with python >= 3.10.

pip install sparsekmeans

Usage

We support two popular K-means algorithms: Lloyd's method and Elkan's method. While both require distance calculations between data points and centroids, Elkan's method reduces some computations by exploiting information from the previous iteration, specifically using the triangle inequality to maintain two distance-related bounds.

When the data set is small (e.g., the number of samples is less than 10,000), it is recommended to use Lloyd's. Lloyd's relies primarily on sparse matrix multiplications, whereas Elkan's mechanism to reduce distance calculations may cause non-negligible overhead for small problems. Conversely, for large data sets or the situation of a huge number of clusters (K), Elkan's method may significantly outperforms Lloyd's.

Use the following steps to cluster a data set:

from sparsekmeans import LloydKmeans, ElkanKmeans

kmeans = LloydKmeans(n_clusters=100)
kmeans = ElkanKmeans(n_clusters=100)
labels = kmeans.fit(X)

Users can specify the following parameters while creating the kmeans object.

n_clusters : int, default=8
            The predefined number of clusters

n_threads : int, default=max(1, os.cpu_count() // 2)
            The predefined number of threads to use 

max_iter : int, default=300
            Maximum number of iterations of the k-means algorithm.

tol : float, default=1e-4
            Relative tolerance to declare convergence.

random_state : int, RandomState instance or None, default=None
            Determines random number generation for centroid initialization.

To cluster additional data by using previously obtained centroids, use the following way.

test_labels = kmeans.predict(Xtest)

In case users have defined their own centroids and want to use them for clustering data, our package provides a function to do that:

from sparsekmeans import kmeans_predict
test_labels = kmeans_predict(Xtest, user_defined_centroids)

Example

We show an example to cluster a real-world data set.

from sparsekmeans import LloydKmeans
from libsvm.svmutil import svm_read_problem
import io
import urllib.request
import bz2

with urllib.request.urlopen("https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass/news20_tfidf_train.svm.bz2") as r:
    with bz2.open(r, 'rt') as f:
        _, X = svm_read_problem(f, return_scipy=True)

kmeans = LloydKmeans(n_clusters=100)
labels = kmeans.fit(X)

Benchmark

Running time comparison for Lloyd's algorithm

Data sets 𝑲 sklearn sparsekmeans Speedup
Wiki-500K 100 24,617.7 2,957.3 8.32x
500 > $10^5$ 26,104.9 > 3.83x
Amazon-670K 100 670.9 123.4 5.43x
500 7,170.1 776.2 9.23x
Url 100 799.9 163.5 4.89x
500 5,888.4 987.2 5.96x
Amazon-3M 100 26,207.5 2,359.1 11.10x
500 > $10^5$ 39,346.1 > 2.54x

Running time comparison for Elkan's algorithm

Data sets 𝑲 sklearn sparsekmeans Speedup
Wiki-500K 100 4,042.7 2,382.1 1.69x
500 91,441.8 5,061.0 18.06x
Amazon-670K 100 248.5 141.8 1.75x
500 1,248.3 685.1 1.82x
Url 100 719.8 296.0 2.43x
500 4,687.1 1,989.7 2.35x
Amazon-3M 100 2,965.2 1,743.4 1.70x
500 13,340.0 5,517.9 2.41x

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 2

  •  
  •  

Languages