Skip to content

Conversation

@ishansurdi
Copy link

Context

The _euclidean_distance function in python/pathway/stdlib/ml/classifiers/_knn_lsh.py was missing the square root operation, causing it to return squared Euclidean distances instead of actual Euclidean distances.

Problem:

  • Function name promises Euclidean distance but returns squared distance
  • Euclidean distance formula is: √((x₂-x₁)² + (y₂-y₁)²)
  • Old implementation: np.sum((data_table - query_table) ** 2, axis=1) (missing sqrt)
  • Example: Distance from [0,0] to [3,4] returned 25.0 instead of 5.0

Impact:

  • Affects knn_lsh_classifier_train and knn_lsh_euclidean_classifier_train
  • Distance values returned to users are mathematically incorrect
  • While k-NN neighbor selection still works (ordering is preserved), users receive wrong distance values

How has this been tested?

  1. Created comprehensive test files:

    • test_old_buggy_code.py - Demonstrates the bug with multiple test cases
    • test_new_fixed_code.py - Verifies the fix returns correct distances
  2. Test cases validated:

    • Distance [0,0] → [3,4]: Expected 5.0 (was returning 25.0) ✓
    • Distance [0,0] → [1,1]: Expected 1.4142 (was returning 2.0) ✓
    • Distance [0,0] → [5,12]: Expected 13.0 (was returning 169.0) ✓
    • Distance [0,0] → [1,0]: Expected 1.0 (was returning 1.0) ✓
  3. Code review:

    • Verified the fix is a single-line change adding np.sqrt() wrapper
    • Confirmed no breaking changes to k-NN neighbor selection logic
    • The relative ordering of distances remains the same

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature or improvement (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)

Related issue(s):

  1. None (bug discovered during code review)

Checklist:

  • My code follows the code style of this project
  • My change requires a change to the documentation
  • I described the modification in the CHANGELOG.md file

The _euclidean_distance function was returning squared Euclidean distance
instead of actual Euclidean distance. This caused k-NN classifiers to
return incorrect distance values to users.

- Added np.sqrt() to compute actual Euclidean distance
- Distances returned from [0,0] to [3,4] now correctly return 5.0 instead of 25.0
- Affects knn_lsh_classifier_train and knn_lsh_euclidean_classifier_train
@CLAassistant
Copy link

CLAassistant commented Jan 1, 2026

CLA assistant check
All committers have signed the CLA.

@zxqfd555 zxqfd555 self-assigned this Jan 5, 2026
@zxqfd555
Copy link
Collaborator

zxqfd555 commented Jan 5, 2026

Hi. I’ve reviewed the bug report, and the observation itself is valid. Indeed, we currently compute the Euclidean distance as the sum of squared coordinate differences, without taking the square root.

However, this behavior is intentional. Computing the square root is a relatively expensive operation, and in our case, it provides no additional benefit because the distance values are only ever compared against each other, not used as absolute metrics. For comparisons, the squared distance and actual distance are strictly equivalent.

While we could technically add the square root, in my opinion, doing so would not improve the codebase. On the contrary, it would introduce unnecessary overhead in performance-critical paths. Even though the returned values are squared distances rather than true Euclidean distances, the primary use case of this code is clustering for nearest neighbor querying, where this distinction does not matter.

That said, I believe there is room for improvement, but in a different direction:

  1. We should clarify this behavior explicitly in the documentation. Namely, that when a distance is returned by certain operators under the Euclidean metric, the value represents the squared distance.
  2. Optionally, we could apply the square root only as a final step, when computing distances for the nearest neighbors, rather than during index construction or general distance evaluation. This would preserve performance for index building and queries, while still allowing users to obtain true distances when they are actually needed.

Given that the current behavior appears to be a deliberate design decision, I’m going to close this pull request.

@zxqfd555 zxqfd555 closed this Jan 5, 2026
@dxtrous
Copy link
Member

dxtrous commented Jan 6, 2026

@zxqfd555 given this is an internal (underscored) function, perhaps it might make sense simply to rename (refactor) it to something like _euclidean_distance_squared?

@zxqfd555
Copy link
Collaborator

zxqfd555 commented Jan 6, 2026

@dxtrous I think it will only solve the problem partly: IIUC, the nearest neighbor query method will return the squared distances for the neighbors, if the with_distances flag is used.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants