-
Notifications
You must be signed in to change notification settings - Fork 5
Improve representation of generated clusters #849
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve representation of generated clusters #849
Conversation
✅ Deploy Preview for antenna-preview canceled.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR improves the representation of generated clusters in the Antenna database and interface by updating post‐processing logic and refining cluster usage. Key changes include:
- Calculation and use of silhouette scores and automatic naming/parent assignment for clusters.
- Updates to management commands (update_taxa, update_occurrence_determinations, import_taxa) and API endpoints to support clustering.
- Enhancements in admin actions and job configuration for detection clustering.
Reviewed Changes
Copilot reviewed 104 out of 104 changed files in this pull request and generated 4 comments.
Show a summary per file
File | Description |
---|---|
ami/main/tests/init.py | Added basic test package initialization |
ami/main/migrations/* | New migrations adding fields and data population logic for taxa, occurrences, and clustering |
ami/main/management/commands/update_taxa.py | CSV update command improvements and data normalization for taxa |
ami/main/management/commands/update_occurrence_determinations.py | Bulk update of occurrence determination fields |
ami/main/management/commands/import_taxa.py | Enhanced taxon creation logic with additional identifiers |
ami/main/api/views.py | Added cluster detections action and updated filtering parameters |
ami/main/api/serializers.py | Extended serializers with cover image and feature fields |
ami/main/admin.py | Introduced admin actions and URL view helpers for classification and detection |
ami/jobs/models.py | Updated detection clustering job logic with revised batching and duplicate status updates |
ami/base/filters.py | Added ThresholdFilter for numeric filtering in API endpoints |
.github/workflows/test.frontend.yml, .github/workflows/test.backend.yml | Updated branch filters for CI pipelines |
reader = csv.DictReader(open(fname)) | ||
taxa = [row for row in reader] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] Consider using a with-statement (e.g., with open(fname) as f:) when opening the CSV file to ensure that the file is properly closed.
reader = csv.DictReader(open(fname)) | |
taxa = [row for row in reader] | |
with open(fname) as file: | |
reader = csv.DictReader(file) | |
taxa = [row for row in reader] |
Copilot uses AI. Check for mistakes.
# Look up existing taxon by name or gbif_taxon_key | ||
# If the taxon already exists, use it and maybe update it | ||
taxon = None | ||
matches = Taxon.objects.filter(Q(name=name) | Q(gbif_taxon_key=gbif_taxon_key)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When gbif_taxon_key is None, this filter may unintentionally match records with null values. Consider checking that gbif_taxon_key is not None before including it in the filter.
matches = Taxon.objects.filter(Q(name=name) | Q(gbif_taxon_key=gbif_taxon_key)) | |
query = Q(name=name) | |
if gbif_taxon_key is not None: | |
query |= Q(gbif_taxon_key=gbif_taxon_key) | |
matches = Taxon.objects.filter(query) |
Copilot uses AI. Check for mistakes.
|
||
@admin.display(description="Classification") | ||
def view_classification(self, obj): | ||
url = f"/admin/main/classification/{obj.pk}/change/" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] Consider using Django's reverse() function to generate the admin change URL instead of hardcoding the URL path.
url = f"/admin/main/classification/{obj.pk}/change/" | |
from django.urls import reverse | |
url = reverse("admin:main_classification_change", args=[obj.pk]) |
Copilot uses AI. Check for mistakes.
|
||
@classmethod | ||
def run(cls, job: "Job"): | ||
job.update_status(JobState.STARTED) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] There are redundant calls to update_status(JobState.STARTED) and setting of started_at in the DetectionClusteringJob.run method; removing the duplicate assignments can simplify the code.
Copilot uses AI. Check for mistakes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR enhances cluster post-processing by calculating and tracking silhouette scores, refining how clusters are represented and named, and surfacing representative images in the API.
- Introduce
ClusterMember
to carry detection, classification, score, and features through clustering - Update agglomerative clustering to return silhouette scores and adjust
cluster_detections
to create taxa/classifications using those scores - Annotate and serialize a representative detection image (
cover_image_url
) for taxa in API views and serializers
Reviewed Changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated no comments.
Show a summary per file
File | Description |
---|---|
ami/utils/schemas.py |
Add __hash__ to enums so they can be used in sets |
ami/ml/tests.py |
Rename test variables for clarity (cluster_members , cluster_member ) |
ami/ml/clustering_algorithms/cluster_detections.py |
Add ClusterMember dataclass, include silhouette scores, classification linkage, and improved naming |
ami/ml/clustering_algorithms/agglomerative.py |
Return (cluster_ids, silhouette_scores) instead of just IDs |
ami/main/models.py |
Refactor prediction logic to prioritize derived classifications, adjust scoring methods, add find_common_ancestor_taxon |
ami/main/api/views.py |
Annotate best_detection_image_path on taxa queryset |
ami/main/api/serializers.py |
Add cover_image_url field/method to Taxon serializers |
Comments suppressed due to low confidence (6)
ami/ml/clustering_algorithms/cluster_detections.py:76
- New behavior wraps detections in
ClusterMember
with scores/features; consider adding or updating tests to validate that each member's score and classification are correctly populated.
def cluster_detections(
ami/ml/clustering_algorithms/cluster_detections.py:1
- The module uses
np
(e.g.np.asarray
) and referencesnp.ndarray
in type hints but never imports numpy. Addimport numpy as np
at the top.
import dataclasses
ami/main/models.py:2441
- There's a typo in the docstring:
complext
should becomplex
.
In the past, this was a more complext query that returned a single result
ami/main/models.py:3105
- The parameter name
ignore_missing_parents
is documented asignore_rootless
. Update the docstring to match the actual argument name.
Find the common ancestor taxon for a list of taxa.
ami/main/models.py:2432
- Previously the query limited results by max-score per algorithm; now it returns all classifications, which may degrade performance. Consider reintroducing filtering or pagination if result sets can grow large.
Retrieve all classifications for this occurrence in chronological order.
ami/main/api/serializers.py:549
- [nitpick] Minor grammar: use "a QuerySet" instead of "an QuerySet".
# This attribute is added by an QuerySet annotation
Several tests are failing because we changed how determination scores are calculated for human identifications. The tests need to be updated. |
✅ Deploy Preview for antenna-ood canceled.
|
…nickLab/antenna into feat/better-cluster-data
…nickLab/antenna into feat/better-cluster-data
…nickLab/antenna into feat/better-cluster-data
parts = [ | ||
f"Cluster {cluster_id}", | ||
f"(Job {job.pk})" if job else "", | ||
f"{taxon.name}?" if taxon else "", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest we skip making the taxon name part of the cluster name for now, I think the auto generated cluster names already look quite messy to begin with! We can let users assign better names from UI?
I like that we attempt to assign a parent though (visible below the cluster name).

f"{taxon.name}?" if taxon else "", |
Also thanks for the updates here @mihow! I like the attempt to assign a parent based on common ancestors in the cluster. I guess the problem is that in practice, the clusters member parents seem to be quite varied 😅 Maybe because we are going directly from binary classification to species level classification (I'm thinking since all cluster members have a low classification score, the parent information might also be bit uncertain)? In any case, I think it's helpful to actually see this! There is also a problem with parent information gaps in our taxa database, making a common ancestor sometimes impossible to find. In practice, this is how my clusters ended up after these updates: ![]() For the score, I think we have some problems to resolve for regular occurrences as mentioned above, but for clustered occurrences it's very helpful to see the machine prediction score now instead of 1: ![]() |
Summary
Updates to the post-processing and usage of clusters in the Antenna database & interface.
List of Changes
Related Issues
If the PR closes or is related to an issue, reference it here.
For example, "Closes #<ISSUE_NUMBER>", "Fixes #<ISSUE_NUMBER>" or "Relates to #<ISSUE_NUMBER>" .
See Github Keywords
for more information on this
Detailed Description
.
How to Test the Changes
.
Screenshots
.
Deployment Notes
.
Checklist