Qmeans updates #216

espg · 2025-04-23T19:24:35Z

Major simplification of qmeans-- both the min_size and opt_size parameters are gone, leaving this as a single parameter clustering algorithm (max_size). Performance is better at the algorithm level... although since we have a post processing pipeline, the performance increase is less dramatic.

espg · 2025-04-23T19:51:49Z

Here is a view of what the algorithm is doing for a reference dataset-- the first line (with brackets) shows the bisection split; we start with 1,008 points, that are split into two clusters of 405 and 603 (respectively). The next line show the which cluster is next selected for bisection; previous versions used inertia for this, but now we always select the largest cluster to bisect... in practice, this switch means very little since inertia is still used to determine the cluster groupings, and switching the order in which we process cluster leaves doesn't change the final clustering output-- but it does make the code shorter and easier to understand and debug.

The output below switches back and forth between the bisection results (brackets showing size of [leafA leafB], and cluster selection. Since we always select the next largest cluster to bisect, sometimes this means going up the hierarchy to a larger node. Using 'L' and 'R' for the node levels, you can see we start with the right node R, then RR, then L (from the first level leaf)... and so on.

Because we can see the individual splits, you can see where the small clusters get created (i.e., a size 7 cluster gets cleaved from a size 65 cluster to make leaves of 58 and 7).

[405 603]counts
603
[158 445]counts
445
[286 159]counts
405
[173 232]counts
286
[246  40]counts
246
[ 65 181]counts
232
[186  46]counts
186
[118  68]counts
181
[105  76]counts
173
[89 84]counts
159
[101  58]counts
158
[68 90]counts
118
[ 18 100]counts
105
[45 60]counts
101
[27 74]counts
100
[26 74]counts
90
[65 25]counts
89
[77 12]counts
84
[ 7 77]counts
77
[68  9]counts
77
[22 55]counts
76
[35 41]counts
74
[40 34]counts
74
[31 43]counts
68
[14 54]counts
68
[37 31]counts
68
[39 29]counts
65
[58  7]counts
65
[31 34]counts
60
[24 36]counts
58
[25 33]counts
58
[18 40]counts
55
[51  4]counts
54
[49  5]counts
51
[24 27]counts
49
[ 2 47]counts
47
[28 19]counts
46
[15 31]counts
45
[33 12]counts
43
[19 24]counts
41
[20 21]counts
40
[17 23]counts
40
[20 20]counts
40
[ 4 36]counts
39
[18 21]counts
37
[18 19]counts
36
[11 25]counts
36
[22 14]counts
35
[16 19]counts
34
[22 12]counts
34
[16 18]counts
33
[25  8]counts
33
[22 11]counts
31
[23  8]counts
31
[ 8 23]counts
31
[ 1 30]counts
31
[16 15]counts
30
[17 13]counts
29
[11 18]counts
28
[14 14]counts
27
[17 10]counts
27
[ 8 19]counts
26
[ 4 22]counts
25
Stop!!!!

Since the cluster leaves are always processed in order of overall size, we terminate when we hit the max_size parameter, which here is set to 25. The reason for this, is because evaluating splits from any clusters under our max_size setting are unlikely to improve our results... if we have opt_size=16 and max_size=25, then once we only have clusters of size 25 left, any bisection split will definitionally leave us with at least one degenerate cluster (i.e., 25 split to an optimum of 16 leaves a cluster of size 9). For clusters that are over the max_size parameter, checking for if a split gives an optimum child is irrelevant; size 65 cluster which was split to a non-optimum child node of 7 needs to be split to shrink it!

In other words, for the opt_size parameter to have any meaning at all, opt_size would need to be <= (max_size / 2), which it usually isn't for our runs. This is why it was eliminated. Here's the cluster size distribution using the single parameter version above:

output.sum(axis=1)
array([14,  2, 14, 14, 19,  5,  9, 12,  7, 22, 24, 17, 10,  4, 18,  4, 22,
       17, 23, 22, 12, 18, 19, 23,  8, 15,  8, 23, 18, 21, 11, 18, 25, 25,
        8,  7, 25,  1, 17, 13, 16, 18, 22, 11, 12, 24, 11, 25, 16, 19, 20,
       21, 20, 20,  8, 19, 16, 15, 19, 24, 18,  4, 22, 14])

and here's a version which terminates when the opt_size parameter is included and used as a termination condition:

array([14,  2, 14, 14, 19,  5,  9, 12,  7, 22, 24, 17, 10,  4, 18,  4, 22,
       17, 23, 22, 12, 18, 19, 23,  8, 15,  8, 23, 18, 21, 11, 18, 20,  5,
       11, 14,  8,  7,  8, 17,  1, 17, 13, 16, 18, 22, 11, 12, 24, 11,  6,
       19, 16, 19, 20, 21, 20, 20,  8, 19, 16, 15, 19, 24, 18,  4, 22, 14])

(Note that these are raw outputs from qmeans, with no overclustering or pruning applied)

espg · 2025-04-23T20:07:01Z

Here's our output from qmeans, using max_size=25, with our current best overcluster parameters (method=linear, neighbors=10, overlap_points=5), i.e., each cluster is expanded by exactly 10 stations:

b.sum(axis=1), len(b), np.sum(b), np.var(b.sum(axis=1))

# individual station counts per cluster
(array([25, 13, 25, 25, 30, 16, 20, 23, 18, 33, 35, 28, 21, 15, 29, 15, 33,
        28, 34, 33, 23, 29, 30, 34, 19, 26, 19, 34, 29, 32, 22, 29, 36, 36,
        19, 18, 36, 12, 28, 24, 27, 29, 33, 22, 23, 35, 22, 36, 27, 30, 31,
        32, 31, 31, 19, 30, 27, 26, 30, 35, 29, 15, 33, 25]),
 64, # number of clusters
 1712, # number of stations
 41.1875 # cluster size variance

...with the plot:

mpl.pyplot.hist(b.sum(axis=1),20, density=2)

For comparison, here's what I had for output using the same overcluster parameters (method=linear, neighbors=10, overlap_points=5), and "close" parameters in our previous qmeans (min_size=1, max_size=25, opt_size=18):

np.sum(b), np.var(b.sum(axis=1)), mpl.pyplot.hist(b.sum(axis=1),20, density=2) # overlap=2

# individual station counts per cluster
(array([21, 23, 13, 20, 23, 27, 21, 16, 23, 16, 26, 15, 22, 18, 25, 15, 27,
        17, 28, 18, 18, 22, 21, 19, 23, 25, 18, 22, 19, 21, 27, 14, 27, 27,
        26, 23, 23, 25, 20, 16, 14, 28, 25, 24, 19, 13, 15, 25, 15, 17, 25,
        19, 18, 18, 24, 22, 19, 25, 20, 28, 12, 28, 25, 17, 20, 21, 27, 16,
        23, 28, 25, 21, 20, 26, 22, 23, 18, 12, 21, 23, 18, 24, 24, 18, 19,
        28, 24, 22, 18, 28, 21, 20, 25, 15, 27, 19, 25, 20]),
 98,  # number of clusters
 2086,  # number of stations
 18.02040816326531 # cluster size variance

...with the plot:

mpl.pyplot.hist(b.sum(axis=1),20, density=2)

So, around 300 more stations across an extra 30 something clusters.

espg added 3 commits April 18, 2025 15:13

modifying test to use new keyword

91009b7

Merge branch 'demiangomez:master' into master

2d8ad18

heavy modification of qmeans clustering algorithm

d6ad353

demiangomez approved these changes Apr 23, 2025

View reviewed changes

demiangomez merged commit d0ea797 into demiangomez:master Apr 23, 2025
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Qmeans updates #216

Qmeans updates #216

Uh oh!

espg commented Apr 23, 2025

Uh oh!

espg commented Apr 23, 2025

Uh oh!

espg commented Apr 23, 2025

Uh oh!

Uh oh!

Uh oh!

Qmeans updates #216

Qmeans updates #216

Uh oh!

Conversation

espg commented Apr 23, 2025

Uh oh!

espg commented Apr 23, 2025

Uh oh!

espg commented Apr 23, 2025

Uh oh!

Uh oh!

Uh oh!