Skip to content

Conversation

mihow
Copy link
Collaborator

@mihow mihow commented May 15, 2025

Summary

Updates to the post-processing and usage of clusters in the Antenna database & interface.

List of Changes

  • Calculate a relevant score to use for cluster members ("silhouette score")
  • Allow cluster scores to be less than species scores but still take precedence as the occurrence's determination
  • Add data from species classifications to derived cluster classifications
  • Update automatic naming of clusters
  • Automatically place cluster under a likely Taxon parent
  • Show a representative occurrence image for Taxa, based on highest score.

Related Issues

If the PR closes or is related to an issue, reference it here.
For example, "Closes #<ISSUE_NUMBER>", "Fixes #<ISSUE_NUMBER>" or "Relates to #<ISSUE_NUMBER>" .

See Github Keywords
for more information on this

Detailed Description

.

How to Test the Changes

.

Screenshots

.

Deployment Notes

.

Checklist

  • I have tested these changes appropriately.
  • I have added and/or modified relevant tests.
  • I updated relevant documentation or comments.
  • I have verified that this PR follows the project's coding standards.
  • Any dependent changes have already been merged to main.

@Copilot Copilot AI review requested due to automatic review settings May 15, 2025 10:06
Copy link

netlify bot commented May 15, 2025

Deploy Preview for antenna-preview canceled.

Name Link
🔨 Latest commit 4353b25
🔍 Latest deploy log https://app.netlify.com/projects/antenna-preview/deploys/68276c693e092100085f7e7f

Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR improves the representation of generated clusters in the Antenna database and interface by updating post‐processing logic and refining cluster usage. Key changes include:

  • Calculation and use of silhouette scores and automatic naming/parent assignment for clusters.
  • Updates to management commands (update_taxa, update_occurrence_determinations, import_taxa) and API endpoints to support clustering.
  • Enhancements in admin actions and job configuration for detection clustering.

Reviewed Changes

Copilot reviewed 104 out of 104 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
ami/main/tests/init.py Added basic test package initialization
ami/main/migrations/* New migrations adding fields and data population logic for taxa, occurrences, and clustering
ami/main/management/commands/update_taxa.py CSV update command improvements and data normalization for taxa
ami/main/management/commands/update_occurrence_determinations.py Bulk update of occurrence determination fields
ami/main/management/commands/import_taxa.py Enhanced taxon creation logic with additional identifiers
ami/main/api/views.py Added cluster detections action and updated filtering parameters
ami/main/api/serializers.py Extended serializers with cover image and feature fields
ami/main/admin.py Introduced admin actions and URL view helpers for classification and detection
ami/jobs/models.py Updated detection clustering job logic with revised batching and duplicate status updates
ami/base/filters.py Added ThresholdFilter for numeric filtering in API endpoints
.github/workflows/test.frontend.yml, .github/workflows/test.backend.yml Updated branch filters for CI pipelines

Comment on lines 19 to 20
reader = csv.DictReader(open(fname))
taxa = [row for row in reader]
Copy link
Preview

Copilot AI May 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Consider using a with-statement (e.g., with open(fname) as f:) when opening the CSV file to ensure that the file is properly closed.

Suggested change
reader = csv.DictReader(open(fname))
taxa = [row for row in reader]
with open(fname) as file:
reader = csv.DictReader(file)
taxa = [row for row in reader]

Copilot uses AI. Check for mistakes.

# Look up existing taxon by name or gbif_taxon_key
# If the taxon already exists, use it and maybe update it
taxon = None
matches = Taxon.objects.filter(Q(name=name) | Q(gbif_taxon_key=gbif_taxon_key))
Copy link
Preview

Copilot AI May 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When gbif_taxon_key is None, this filter may unintentionally match records with null values. Consider checking that gbif_taxon_key is not None before including it in the filter.

Suggested change
matches = Taxon.objects.filter(Q(name=name) | Q(gbif_taxon_key=gbif_taxon_key))
query = Q(name=name)
if gbif_taxon_key is not None:
query |= Q(gbif_taxon_key=gbif_taxon_key)
matches = Taxon.objects.filter(query)

Copilot uses AI. Check for mistakes.


@admin.display(description="Classification")
def view_classification(self, obj):
url = f"/admin/main/classification/{obj.pk}/change/"
Copy link
Preview

Copilot AI May 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Consider using Django's reverse() function to generate the admin change URL instead of hardcoding the URL path.

Suggested change
url = f"/admin/main/classification/{obj.pk}/change/"
from django.urls import reverse
url = reverse("admin:main_classification_change", args=[obj.pk])

Copilot uses AI. Check for mistakes.


@classmethod
def run(cls, job: "Job"):
job.update_status(JobState.STARTED)
Copy link
Preview

Copilot AI May 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] There are redundant calls to update_status(JobState.STARTED) and setting of started_at in the DetectionClusteringJob.run method; removing the duplicate assignments can simplify the code.

Copilot uses AI. Check for mistakes.

@mihow mihow changed the base branch from main to deployments/ood.antenna.insectai.org May 15, 2025 10:13
@mihow mihow requested a review from Copilot May 15, 2025 10:13
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR enhances cluster post-processing by calculating and tracking silhouette scores, refining how clusters are represented and named, and surfacing representative images in the API.

  • Introduce ClusterMember to carry detection, classification, score, and features through clustering
  • Update agglomerative clustering to return silhouette scores and adjust cluster_detections to create taxa/classifications using those scores
  • Annotate and serialize a representative detection image (cover_image_url) for taxa in API views and serializers

Reviewed Changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated no comments.

Show a summary per file
File Description
ami/utils/schemas.py Add __hash__ to enums so they can be used in sets
ami/ml/tests.py Rename test variables for clarity (cluster_members, cluster_member)
ami/ml/clustering_algorithms/cluster_detections.py Add ClusterMember dataclass, include silhouette scores, classification linkage, and improved naming
ami/ml/clustering_algorithms/agglomerative.py Return (cluster_ids, silhouette_scores) instead of just IDs
ami/main/models.py Refactor prediction logic to prioritize derived classifications, adjust scoring methods, add find_common_ancestor_taxon
ami/main/api/views.py Annotate best_detection_image_path on taxa queryset
ami/main/api/serializers.py Add cover_image_url field/method to Taxon serializers
Comments suppressed due to low confidence (6)

ami/ml/clustering_algorithms/cluster_detections.py:76

  • New behavior wraps detections in ClusterMember with scores/features; consider adding or updating tests to validate that each member's score and classification are correctly populated.
def cluster_detections(

ami/ml/clustering_algorithms/cluster_detections.py:1

  • The module uses np (e.g. np.asarray) and references np.ndarray in type hints but never imports numpy. Add import numpy as np at the top.
import dataclasses

ami/main/models.py:2441

  • There's a typo in the docstring: complext should be complex.
In the past, this was a more complext query that returned a single result

ami/main/models.py:3105

  • The parameter name ignore_missing_parents is documented as ignore_rootless. Update the docstring to match the actual argument name.
Find the common ancestor taxon for a list of taxa.

ami/main/models.py:2432

  • Previously the query limited results by max-score per algorithm; now it returns all classifications, which may degrade performance. Consider reintroducing filtering or pagination if result sets can grow large.
Retrieve all classifications for this occurrence in chronological order.

ami/main/api/serializers.py:549

  • [nitpick] Minor grammar: use "a QuerySet" instead of "an QuerySet".
# This attribute is added by an QuerySet annotation

@mihow
Copy link
Collaborator Author

mihow commented May 15, 2025

Several tests are failing because we changed how determination scores are calculated for human identifications. The tests need to be updated.

Copy link

netlify bot commented May 15, 2025

Deploy Preview for antenna-ood canceled.

Name Link
🔨 Latest commit 4353b25
🔍 Latest deploy log https://app.netlify.com/projects/antenna-ood/deploys/68276c69b48ae40008538229

mihow added 3 commits May 15, 2025 19:31
…nickLab/antenna into feat/better-cluster-data
…nickLab/antenna into feat/better-cluster-data
…nickLab/antenna into feat/better-cluster-data
parts = [
f"Cluster {cluster_id}",
f"(Job {job.pk})" if job else "",
f"{taxon.name}?" if taxon else "",
Copy link
Member

@annavik annavik May 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest we skip making the taxon name part of the cluster name for now, I think the auto generated cluster names already look quite messy to begin with! We can let users assign better names from UI?

I like that we attempt to assign a parent though (visible below the cluster name).

Screenshot 2025-05-16 at 11 23 54
Suggested change
f"{taxon.name}?" if taxon else "",

@annavik
Copy link
Member

annavik commented May 16, 2025

After the classification step, seems all occurrences will get machine prediction score 0 after this change. I think this is problematic both for visualization and for score based sorting and filtering.

Screenshot 2025-05-16 at 10 38 00

@annavik
Copy link
Member

annavik commented May 16, 2025

Also thanks for the updates here @mihow!

I like the attempt to assign a parent based on common ancestors in the cluster. I guess the problem is that in practice, the clusters member parents seem to be quite varied 😅 Maybe because we are going directly from binary classification to species level classification (I'm thinking since all cluster members have a low classification score, the parent information might also be bit uncertain)? In any case, I think it's helpful to actually see this! There is also a problem with parent information gaps in our taxa database, making a common ancestor sometimes impossible to find. In practice, this is how my clusters ended up after these updates:

Screenshot 2025-05-16 at 12 12 26

For the score, I think we have some problems to resolve for regular occurrences as mentioned above, but for clustered occurrences it's very helpful to see the machine prediction score now instead of 1:

Screenshot 2025-05-16 at 11 50 47

@annavik
Copy link
Member

annavik commented May 16, 2025

Question: how is this "relevant score" calculated? I see it's different from the previous machine prediction score?

Screenshot 2025-05-16 at 11 59 45

@mihow mihow merged commit a4dd173 into deployments/ood.antenna.insectai.org May 16, 2025
6 checks passed
@mihow mihow deleted the feat/better-cluster-data branch May 16, 2025 17:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants