-
Notifications
You must be signed in to change notification settings - Fork 9
feat(map): Add normalized Average Precision (experimental) #98
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
shntnu
wants to merge
4
commits into
main
Choose a base branch
from
normalizedAP
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…ent scoring - Add normalization.py module with expected AP calculation using harmonic numbers - Implement AP normalization using (AP - μ₀)/(1 - μ₀) formula for scale independence - Update average_precision() to always compute normalized AP alongside raw scores - Update mean_average_precision() to include mean normalized AP in results - Add comprehensive unit tests for expected AP and normalization functions - Add integration tests verifying end-to-end pipeline functionality - Update dependencies: add pytest and scikit-learn to dev group - Update .gitignore to exclude .claude/ directory The normalization addresses prevalence bias in raw AP scores, enabling fair comparison across different experimental conditions. Normalized scores range from -1 (worse than random) to 1 (perfect), with 0 representing random performance. Based on theoretical work from https://github.com/shntnu/expected_ap 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
More inspection notes (Claude Opus 4.1, UNVERIFIED)
Code """
Inspection notebook for normalized vs raw mAP
This script explores various cases to understand when and how normalization helps.
"""
import pandas as pd
import numpy as np
import ast
# Load the data
print("Loading data...")
df = pd.read_csv('/Users/shsingh/Documents/GitHub/jump/jump_production/data/processed/copairs/runs/consistency/compound_no_source7__feat_all__consistency_no_target2__all_sources__repurposing/results/consistency_map_results.csv')
df['n_samples'] = df['indices'].apply(lambda x: len(ast.literal_eval(x)))
print("\n" + "="*80)
print("BASIC STATISTICS")
print("="*80)
print(f"\nTotal targets: {len(df)}")
print(f"Sample size range: {df['n_samples'].min()}-{df['n_samples'].max()}")
print(f"Mean sample size: {df['n_samples'].mean():.1f}")
print(f"Median sample size: {df['n_samples'].median():.0f}")
print("\nCorrelations:")
print(f" mAP vs normalized mAP: {df['mean_average_precision'].corr(df['mean_normalized_average_precision']):.4f}")
print(f" mAP vs -log10(p): {df['mean_average_precision'].corr(df['-log10(p-value)']):.4f}")
print(f" normalized mAP vs -log10(p): {df['mean_normalized_average_precision'].corr(df['-log10(p-value)']):.4f}")
print(f" n_samples vs mAP: {df['n_samples'].corr(df['mean_average_precision']):.4f}")
print(f" n_samples vs normalized mAP: {df['n_samples'].corr(df['mean_normalized_average_precision']):.4f}")
print("\n" + "="*80)
print("INSPECTION 1: Targets with same raw mAP but different sample sizes")
print("="*80)
# Group by rounded mAP to find similar values
df['mAP_rounded'] = (df['mean_average_precision'] * 20).round() / 20 # Round to 0.05 increments
for map_val in [0.05, 0.10, 0.15, 0.20]:
similar = df[(df['mAP_rounded'] == map_val) & (df['n_samples'] > 0)]
if len(similar) > 2:
print(f"\nTargets with mAP ≈ {map_val:.2f}:")
# Show min and max n_samples for this mAP level
min_n = similar.nsmallest(1, 'n_samples').iloc[0]
max_n = similar.nlargest(1, 'n_samples').iloc[0]
if min_n['n_samples'] != max_n['n_samples']:
print(f" {min_n['Metadata_repurposing_target']:10s}: n={min_n['n_samples']:2.0f}, mAP={min_n['mean_average_precision']:.3f}, norm={min_n['mean_normalized_average_precision']:.3f}")
print(f" {max_n['Metadata_repurposing_target']:10s}: n={max_n['n_samples']:2.0f}, mAP={max_n['mean_average_precision']:.3f}, norm={max_n['mean_normalized_average_precision']:.3f}")
print(f" → Normalized mAP difference: {abs(min_n['mean_normalized_average_precision'] - max_n['mean_normalized_average_precision']):.3f}")
print("\n" + "="*80)
print("INSPECTION 2: Distribution by sample size groups")
print("="*80)
# Bin the data by sample size
bins = [0, 3, 5, 10, 20, 100]
labels = ['2-3', '4-5', '6-10', '11-20', '21+']
df['size_group'] = pd.cut(df['n_samples'], bins=bins, labels=labels, include_lowest=True)
print("\n{:<10} {:>6} {:>10} {:>10} {:>10} {:>10}".format(
"Size Group", "Count", "Mean mAP", "Mean Norm", "Min Norm", "Max Norm"
))
print("-" * 70)
for group in labels:
subset = df[df['size_group'] == group]
if len(subset) > 0:
print("{:<10} {:>6} {:>10.3f} {:>10.3f} {:>10.3f} {:>10.3f}".format(
group,
len(subset),
subset['mean_average_precision'].mean(),
subset['mean_normalized_average_precision'].mean(),
subset['mean_normalized_average_precision'].min(),
subset['mean_normalized_average_precision'].max()
))
print("\n" + "="*80)
print("INSPECTION 3: Cases where raw mAP and normalized mAP disagree most")
print("="*80)
# Calculate percentile ranks for both metrics
df['rank_mAP'] = df['mean_average_precision'].rank(pct=True)
df['rank_norm'] = df['mean_normalized_average_precision'].rank(pct=True)
df['rank_diff'] = abs(df['rank_mAP'] - df['rank_norm'])
print("\nTargets with largest rank differences (percentile):")
print("{:<15} {:>4} {:>7} {:>7} {:>10} {:>10} {:>10}".format(
"Target", "n", "mAP", "norm", "mAP rank%", "norm rank%", "diff%"
))
print("-" * 80)
for _, row in df.nlargest(10, 'rank_diff').iterrows():
print("{:<15} {:>4.0f} {:>7.3f} {:>7.3f} {:>10.1f} {:>10.1f} {:>10.1f}".format(
row['Metadata_repurposing_target'][:15],
row['n_samples'],
row['mean_average_precision'],
row['mean_normalized_average_precision'],
row['rank_mAP'] * 100,
row['rank_norm'] * 100,
row['rank_diff'] * 100
))
print("\n" + "="*80)
print("INSPECTION 4: Threshold analysis - what gets selected?")
print("="*80)
thresholds = [0.05, 0.10, 0.15, 0.20]
print("\nUsing different mAP thresholds:")
print("{:<10} {:>15} {:>15} {:>15} {:>15}".format(
"Threshold", "Raw mAP", "Normalized mAP", "Both", "Either"
))
print("-" * 70)
for t in thresholds:
raw_selected = df[df['mean_average_precision'] > t]
norm_selected = df[df['mean_normalized_average_precision'] > t]
both = set(raw_selected['Metadata_repurposing_target']) & set(norm_selected['Metadata_repurposing_target'])
either = set(raw_selected['Metadata_repurposing_target']) | set(norm_selected['Metadata_repurposing_target'])
print("{:<10.2f} {:>15} {:>15} {:>15} {:>15}".format(
t,
len(raw_selected),
len(norm_selected),
len(both),
len(either)
))
print("\n" + "="*80)
print("INSPECTION 5: P-value vs effect size discrepancies")
print("="*80)
# High significance but low effect
high_sig_low_effect = df[(df['-log10(p-value)'] > 3) & (df['mean_normalized_average_precision'] < 0.1)]
print(f"\nHigh significance (-log10(p) > 3) but low effect (norm < 0.1): {len(high_sig_low_effect)} targets")
print("\nExamples:")
print("{:<15} {:>4} {:>7} {:>7} {:>10} {:>10}".format(
"Target", "n", "mAP", "norm", "-log10(p)", "p-value"
))
print("-" * 70)
for _, row in high_sig_low_effect.head(5).iterrows():
print("{:<15} {:>4.0f} {:>7.3f} {:>7.3f} {:>10.2f} {:>10.4f}".format(
row['Metadata_repurposing_target'][:15],
row['n_samples'],
row['mean_average_precision'],
row['mean_normalized_average_precision'],
row['-log10(p-value)'],
row['p_value']
))
# Low significance but high effect
low_sig_high_effect = df[(df['corrected_p_value'] > 0.05) & (df['mean_normalized_average_precision'] > 0.2)]
print(f"\nLow significance (p > 0.05) but high effect (norm > 0.2): {len(low_sig_high_effect)} targets")
print("\nExamples:")
print("{:<15} {:>4} {:>7} {:>7} {:>10} {:>10}".format(
"Target", "n", "mAP", "norm", "-log10(p)", "p-value"
))
print("-" * 70)
for _, row in low_sig_high_effect.head(5).iterrows():
print("{:<15} {:>4.0f} {:>7.3f} {:>7.3f} {:>10.2f} {:>10.4f}".format(
row['Metadata_repurposing_target'][:15],
row['n_samples'],
row['mean_average_precision'],
row['mean_normalized_average_precision'],
row['-log10(p-value)'],
row['p_value']
))
print("\n" + "="*80)
print("INSPECTION 6: Expected vs observed improvement patterns")
print("="*80)
# For different sample sizes, what's the typical improvement?
print("\nImprovement over random by sample size:")
print("{:<15} {:>10} {:>15} {:>15} {:>15}".format(
"Sample Size", "Count", "Mean norm mAP", "Median norm mAP", "% positive"
))
print("-" * 75)
for n in sorted(df['n_samples'].unique())[:10]: # First 10 sample sizes
subset = df[df['n_samples'] == n]
if len(subset) >= 3: # Only show if we have at least 3 examples
pct_positive = (subset['mean_normalized_average_precision'] > 0).mean() * 100
print("{:<15.0f} {:>10} {:>15.3f} {:>15.3f} {:>15.1f}%".format(
n,
len(subset),
subset['mean_normalized_average_precision'].mean(),
subset['mean_normalized_average_precision'].median(),
pct_positive
))
print("\n" + "="*80)
print("INSPECTION 7: Selection comparison with combined criteria")
print("="*80)
# Different selection strategies
strategies = {
'Raw mAP > 0.1': df[df['mean_average_precision'] > 0.1],
'Norm mAP > 0.1': df[df['mean_normalized_average_precision'] > 0.1],
'P < 0.05': df[df['corrected_p_value'] < 0.05],
'Raw > 0.1 & P < 0.05': df[(df['mean_average_precision'] > 0.1) & (df['corrected_p_value'] < 0.05)],
'Norm > 0.1 & P < 0.05': df[(df['mean_normalized_average_precision'] > 0.1) & (df['corrected_p_value'] < 0.05)]
}
print("\nSelection strategy comparison:")
print("{:<25} {:>10} {:>15} {:>15}".format(
"Strategy", "Selected", "Mean n_samples", "Mean norm mAP"
))
print("-" * 70)
for name, selected in strategies.items():
if len(selected) > 0:
print("{:<25} {:>10} {:>15.1f} {:>15.3f}".format(
name,
len(selected),
selected['n_samples'].mean(),
selected['mean_normalized_average_precision'].mean()
))
print("\n" + "="*80)
print("INSPECTION 8: Borderline cases around common thresholds")
print("="*80)
# Cases near the threshold boundaries
threshold = 0.1
margin = 0.02
print(f"\nTargets near the {threshold:.1f} threshold (within ±{margin}):")
print("\nJust below threshold (would be excluded):")
print("{:<15} {:>4} {:>7} {:>7} {:>10}".format(
"Target", "n", "mAP", "norm", "Decision"
))
print("-" * 55)
just_below = df[(df['mean_normalized_average_precision'] < threshold) &
(df['mean_normalized_average_precision'] > threshold - margin)]
for _, row in just_below.head(5).iterrows():
raw_pass = "Raw: YES" if row['mean_average_precision'] > threshold else "Raw: NO"
print("{:<15} {:>4.0f} {:>7.3f} {:>7.3f} {:>10}".format(
row['Metadata_repurposing_target'][:15],
row['n_samples'],
row['mean_average_precision'],
row['mean_normalized_average_precision'],
raw_pass
))
print("\nJust above threshold (would be included):")
print("{:<15} {:>4} {:>7} {:>7} {:>10}".format(
"Target", "n", "mAP", "norm", "Decision"
))
print("-" * 55)
just_above = df[(df['mean_normalized_average_precision'] > threshold) &
(df['mean_normalized_average_precision'] < threshold + margin)]
for _, row in just_above.head(5).iterrows():
raw_pass = "Raw: YES" if row['mean_average_precision'] > threshold else "Raw: NO"
print("{:<15} {:>4.0f} {:>7.3f} {:>7.3f} {:>10}".format(
row['Metadata_repurposing_target'][:15],
row['n_samples'],
row['mean_average_precision'],
row['mean_normalized_average_precision'],
raw_pass
))
print("\n" + "="*80)
print("END OF INSPECTION")
print("="*80) CSV: Full hydra folder: |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
Adds normalized Average Precision to enable scale-independent comparison across different prevalences.
What's new
(AP - μ₀) / (1 - μ₀)
where μ₀ is expected AP under random rankingFiles changed
src/copairs/map/normalization.py
: Core normalization functionssrc/copairs/map/average_precision.py
: Always computes normalized APsrc/copairs/map/map.py
: Always computes normalized mAPNext steps
Need to validate the practical utility of normalized scores on real datasets before merging.
🤖 Generated with Claude Code
Co-Authored-By: Claude noreply@anthropic.com