Mini-Batch K-means on RCV1 dataset using Dask

The objective of this project was to implement a simplified Mini-Batch K-Means algorithm on the RCV1 dataset, which consists of more than 800k articles. Each article is characterized by over 50k features, resulting in a dataset exceeding 250GB in size. To handle this substantial dataset, we leveraged the Dask library in Python and three Virtual Machines on CloudVeneto cluster, following some initial data reduction assumptions based on Natural Language Processing that brought to the choice of $k=4$. The results have been analyzed considering the best configuration in terms of performances (therefore number of workers, number of partitions, rechunk of the minibatch) instead of focusing on the K-means efficency through dask dashboard analysis, execution time and capacity of the workers.

The project notebook is structured as follows:

Introduction:
1.1. Start cluster
1.2. Dataset
1.3. Dask Array
1.4. Preprocessing
1.5 Filtering the dataset
1.6 Load filtered data
1.7 Dataset exploration
1.8 Sihouette analysis
1.9 Metrics
Mini Batch Clustering
2.1. Architecture and Motivations
2.2. Toy Example
Results
3.1. Mini-Batch size graphs
3.2. Mini-Batch and predict rechunk graphs
3.3. Number of workers graphs
3.4. Is the Euclidean Distance the best objective function for RCV1 dataset?

This project was developed for the "Management and Analysis of Physics Dataset Module B" course. It was undertaken by Group 19, comprised of:

Golan Rodrigo
Guercio Tommaso
Zoppellari Elena

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
images		images
results		results
LICENSE		LICENSE
MiniBatchKmeans_group19.ipynb		MiniBatchKmeans_group19.ipynb
Readme.md		Readme.md
Readme.md.txt		Readme.md.txt
dask_graph_parallel.png		dask_graph_parallel.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Mini-Batch K-means on RCV1 dataset using Dask

About

Uh oh!

Releases

Packages

Languages

License

zoppellarielena/Mini-Batch-K-means-on-RCV1-dataset-using-Dask

Folders and files

Latest commit

History

Repository files navigation

Mini-Batch K-means on RCV1 dataset using Dask

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages