-
Notifications
You must be signed in to change notification settings - Fork 24
Qmeans updates #216
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Qmeans updates #216
Conversation
Here is a view of what the algorithm is doing for a reference dataset-- the first line (with brackets) shows the bisection split; we start with 1,008 points, that are split into two clusters of 405 and 603 (respectively). The next line show the which cluster is next selected for bisection; previous versions used inertia for this, but now we always select the largest cluster to bisect... in practice, this switch means very little since inertia is still used to determine the cluster groupings, and switching the order in which we process cluster leaves doesn't change the final clustering output-- but it does make the code shorter and easier to understand and debug. The output below switches back and forth between the bisection results (brackets showing size of [leafA leafB], and cluster selection. Since we always select the next largest cluster to bisect, sometimes this means going up the hierarchy to a larger node. Using 'L' and 'R' for the node levels, you can see we start with the right node R, then RR, then L (from the first level leaf)... and so on. Because we can see the individual splits, you can see where the small clusters get created (i.e., a size 7 cluster gets cleaved from a size 65 cluster to make leaves of 58 and 7).
Since the cluster leaves are always processed in order of overall size, we terminate when we hit the In other words, for the
and here's a version which terminates when the opt_size parameter is included and used as a termination condition:
(Note that these are raw outputs from qmeans, with no overclustering or pruning applied) |
Here's our output from qmeans, using
...with the plot:
For comparison, here's what I had for output using the same overcluster parameters (
...with the plot:
So, around 300 more stations across an extra 30 something clusters. |
Major simplification of qmeans-- both the
min_size
andopt_size
parameters are gone, leaving this as a single parameter clustering algorithm (max_size
). Performance is better at the algorithm level... although since we have a post processing pipeline, the performance increase is less dramatic.