Skip to content

Qmeans updates #216

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Apr 23, 2025
Merged

Qmeans updates #216

merged 3 commits into from
Apr 23, 2025

Conversation

espg
Copy link
Collaborator

@espg espg commented Apr 23, 2025

Major simplification of qmeans-- both the min_size and opt_size parameters are gone, leaving this as a single parameter clustering algorithm (max_size). Performance is better at the algorithm level... although since we have a post processing pipeline, the performance increase is less dramatic.

@espg
Copy link
Collaborator Author

espg commented Apr 23, 2025

Here is a view of what the algorithm is doing for a reference dataset-- the first line (with brackets) shows the bisection split; we start with 1,008 points, that are split into two clusters of 405 and 603 (respectively). The next line show the which cluster is next selected for bisection; previous versions used inertia for this, but now we always select the largest cluster to bisect... in practice, this switch means very little since inertia is still used to determine the cluster groupings, and switching the order in which we process cluster leaves doesn't change the final clustering output-- but it does make the code shorter and easier to understand and debug.

The output below switches back and forth between the bisection results (brackets showing size of [leafA leafB], and cluster selection. Since we always select the next largest cluster to bisect, sometimes this means going up the hierarchy to a larger node. Using 'L' and 'R' for the node levels, you can see we start with the right node R, then RR, then L (from the first level leaf)... and so on.

Because we can see the individual splits, you can see where the small clusters get created (i.e., a size 7 cluster gets cleaved from a size 65 cluster to make leaves of 58 and 7).

[405 603]counts
603
[158 445]counts
445
[286 159]counts
405
[173 232]counts
286
[246  40]counts
246
[ 65 181]counts
232
[186  46]counts
186
[118  68]counts
181
[105  76]counts
173
[89 84]counts
159
[101  58]counts
158
[68 90]counts
118
[ 18 100]counts
105
[45 60]counts
101
[27 74]counts
100
[26 74]counts
90
[65 25]counts
89
[77 12]counts
84
[ 7 77]counts
77
[68  9]counts
77
[22 55]counts
76
[35 41]counts
74
[40 34]counts
74
[31 43]counts
68
[14 54]counts
68
[37 31]counts
68
[39 29]counts
65
[58  7]counts
65
[31 34]counts
60
[24 36]counts
58
[25 33]counts
58
[18 40]counts
55
[51  4]counts
54
[49  5]counts
51
[24 27]counts
49
[ 2 47]counts
47
[28 19]counts
46
[15 31]counts
45
[33 12]counts
43
[19 24]counts
41
[20 21]counts
40
[17 23]counts
40
[20 20]counts
40
[ 4 36]counts
39
[18 21]counts
37
[18 19]counts
36
[11 25]counts
36
[22 14]counts
35
[16 19]counts
34
[22 12]counts
34
[16 18]counts
33
[25  8]counts
33
[22 11]counts
31
[23  8]counts
31
[ 8 23]counts
31
[ 1 30]counts
31
[16 15]counts
30
[17 13]counts
29
[11 18]counts
28
[14 14]counts
27
[17 10]counts
27
[ 8 19]counts
26
[ 4 22]counts
25
Stop!!!!

Since the cluster leaves are always processed in order of overall size, we terminate when we hit the max_size parameter, which here is set to 25. The reason for this, is because evaluating splits from any clusters under our max_size setting are unlikely to improve our results... if we have opt_size=16 and max_size=25, then once we only have clusters of size 25 left, any bisection split will definitionally leave us with at least one degenerate cluster (i.e., 25 split to an optimum of 16 leaves a cluster of size 9). For clusters that are over the max_size parameter, checking for if a split gives an optimum child is irrelevant; size 65 cluster which was split to a non-optimum child node of 7 needs to be split to shrink it!

In other words, for the opt_size parameter to have any meaning at all, opt_size would need to be <= (max_size / 2), which it usually isn't for our runs. This is why it was eliminated. Here's the cluster size distribution using the single parameter version above:

output.sum(axis=1)
array([14,  2, 14, 14, 19,  5,  9, 12,  7, 22, 24, 17, 10,  4, 18,  4, 22,
       17, 23, 22, 12, 18, 19, 23,  8, 15,  8, 23, 18, 21, 11, 18, 25, 25,
        8,  7, 25,  1, 17, 13, 16, 18, 22, 11, 12, 24, 11, 25, 16, 19, 20,
       21, 20, 20,  8, 19, 16, 15, 19, 24, 18,  4, 22, 14])

and here's a version which terminates when the opt_size parameter is included and used as a termination condition:

array([14,  2, 14, 14, 19,  5,  9, 12,  7, 22, 24, 17, 10,  4, 18,  4, 22,
       17, 23, 22, 12, 18, 19, 23,  8, 15,  8, 23, 18, 21, 11, 18, 20,  5,
       11, 14,  8,  7,  8, 17,  1, 17, 13, 16, 18, 22, 11, 12, 24, 11,  6,
       19, 16, 19, 20, 21, 20, 20,  8, 19, 16, 15, 19, 24, 18,  4, 22, 14])

(Note that these are raw outputs from qmeans, with no overclustering or pruning applied)

@espg
Copy link
Collaborator Author

espg commented Apr 23, 2025

Here's our output from qmeans, using max_size=25, with our current best overcluster parameters (method=linear, neighbors=10, overlap_points=5), i.e., each cluster is expanded by exactly 10 stations:

b.sum(axis=1), len(b), np.sum(b), np.var(b.sum(axis=1))

# individual station counts per cluster
(array([25, 13, 25, 25, 30, 16, 20, 23, 18, 33, 35, 28, 21, 15, 29, 15, 33,
        28, 34, 33, 23, 29, 30, 34, 19, 26, 19, 34, 29, 32, 22, 29, 36, 36,
        19, 18, 36, 12, 28, 24, 27, 29, 33, 22, 23, 35, 22, 36, 27, 30, 31,
        32, 31, 31, 19, 30, 27, 26, 30, 35, 29, 15, 33, 25]),
 64, # number of clusters
 1712, # number of stations
 41.1875 # cluster size variance

...with the plot:

mpl.pyplot.hist(b.sum(axis=1),20, density=2)

image

For comparison, here's what I had for output using the same overcluster parameters (method=linear, neighbors=10, overlap_points=5), and "close" parameters in our previous qmeans (min_size=1, max_size=25, opt_size=18):

np.sum(b), np.var(b.sum(axis=1)), mpl.pyplot.hist(b.sum(axis=1),20, density=2) # overlap=2

# individual station counts per cluster
(array([21, 23, 13, 20, 23, 27, 21, 16, 23, 16, 26, 15, 22, 18, 25, 15, 27,
        17, 28, 18, 18, 22, 21, 19, 23, 25, 18, 22, 19, 21, 27, 14, 27, 27,
        26, 23, 23, 25, 20, 16, 14, 28, 25, 24, 19, 13, 15, 25, 15, 17, 25,
        19, 18, 18, 24, 22, 19, 25, 20, 28, 12, 28, 25, 17, 20, 21, 27, 16,
        23, 28, 25, 21, 20, 26, 22, 23, 18, 12, 21, 23, 18, 24, 24, 18, 19,
        28, 24, 22, 18, 28, 21, 20, 25, 15, 27, 19, 25, 20]),
 98,  # number of clusters
 2086,  # number of stations
 18.02040816326531 # cluster size variance

...with the plot:

mpl.pyplot.hist(b.sum(axis=1),20, density=2)

image

So, around 300 more stations across an extra 30 something clusters.

@demiangomez demiangomez merged commit d0ea797 into demiangomez:master Apr 23, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants