Skip to content

Conversation

shntnu
Copy link
Member

@shntnu shntnu commented Sep 7, 2025

Summary

Adds normalized Average Precision to enable scale-independent comparison across different prevalences.

⚠️ Experimental: This feature is experimental and hasn't been validated in practice yet. Keeping unmerged for now pending further testing.

What's new

  • Normalized AP scores using (AP - μ₀) / (1 - μ₀) where μ₀ is expected AP under random ranking
  • Based on exact formula from https://github.com/shntnu/expected_ap
  • Always computed alongside raw AP (minimal overhead)
  • Scale: -∞ to 1 where 0 = random, 1 = perfect, although we explicitly trim [-1,1] for simplicity (and numerical stability)

Files changed

  • src/copairs/map/normalization.py: Core normalization functions
  • src/copairs/map/average_precision.py: Always computes normalized AP
  • src/copairs/map/map.py: Always computes normalized mAP
  • Tests added for validation

Next steps

Need to validate the practical utility of normalized scores on real datasets before merging.

🤖 Generated with Claude Code

Co-Authored-By: Claude noreply@anthropic.com

shntnu and others added 3 commits September 7, 2025 18:32
…ent scoring

- Add normalization.py module with expected AP calculation using harmonic numbers
- Implement AP normalization using (AP - μ₀)/(1 - μ₀) formula for scale independence
- Update average_precision() to always compute normalized AP alongside raw scores
- Update mean_average_precision() to include mean normalized AP in results
- Add comprehensive unit tests for expected AP and normalization functions
- Add integration tests verifying end-to-end pipeline functionality
- Update dependencies: add pytest and scikit-learn to dev group
- Update .gitignore to exclude .claude/ directory

The normalization addresses prevalence bias in raw AP scores, enabling fair
comparison across different experimental conditions. Normalized scores range
from -1 (worse than random) to 1 (perfect), with 0 representing random performance.

Based on theoretical work from https://github.com/shntnu/expected_ap

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
@shntnu
Copy link
Member Author

shntnu commented Sep 8, 2025

More inspection notes

(Claude Opus 4.1, UNVERIFIED)

  1. High correlation (0.9996) between raw and normalized mAP, but this masks important differences at decision boundaries
  2. Same raw mAP, different meaning: Targets with similar raw mAP values show different normalized values based on sample size (e.g., mAP ≈ 0.20 shows norm mAP ranging from 0.139 to 0.218)
  3. Threshold selection matters: At 0.1 threshold, raw and normalized select the same 83 targets, but borderline cases show differences (e.g., ADRB2 with raw=0.118 has norm=0.098, just below threshold)
  4. Sample size patterns: Smaller sample sizes (2-3) have more variable results with only 35-48% showing positive normalized mAP, while larger samples (6+) are almost always positive
  5. Statistical vs practical significance: 14 targets show high statistical significance (-log10(p) > 3) but low practical effect (norm < 0.1)

================================================================================
BASIC STATISTICS
================================================================================

Total targets: 322
Sample size range: 2-31
Mean sample size: 5.4
Median sample size: 3

Correlations:
  mAP vs normalized mAP: 0.9996
  mAP vs -log10(p): 0.6953
  normalized mAP vs -log10(p): 0.6818
  n_samples vs mAP: 0.0782
  n_samples vs normalized mAP: 0.0509

================================================================================
INSPECTION 1: Targets with same raw mAP but different sample sizes
================================================================================

Targets with mAP ≈ 0.05:
  PAK1      : n= 2, mAP=0.049, norm=0.041
  EGFR      : n=31, mAP=0.066, norm=0.022
  → Normalized mAP difference: 0.018

Targets with mAP ≈ 0.10:
  CDK6      : n= 2, mAP=0.110, norm=0.101
  HTR2C     : n=20, mAP=0.107, norm=0.078
  → Normalized mAP difference: 0.023

Targets with mAP ≈ 0.15:
  HDAC10    : n= 4, mAP=0.132, norm=0.122
  MAPK14    : n=18, mAP=0.135, norm=0.110
  → Normalized mAP difference: 0.013

Targets with mAP ≈ 0.20:
  METAP2    : n= 2, mAP=0.225, norm=0.218
  HTR2A     : n=29, mAP=0.176, norm=0.139
  → Normalized mAP difference: 0.079

================================================================================
INSPECTION 2: Distribution by sample size groups
================================================================================

Size Group  Count   Mean mAP  Mean Norm   Min Norm   Max Norm
----------------------------------------------------------------------
2-3           173      0.077      0.068     -0.008      1.000
4-5            55      0.201      0.192     -0.008      1.000
6-10           49      0.129      0.114     -0.006      0.887
11-20          38      0.116      0.094     -0.007      0.339
21+             7      0.189      0.157      0.007      0.357

================================================================================
INSPECTION 3: Cases where raw mAP and normalized mAP disagree most
================================================================================

Targets with largest rank differences (percentile):
Target             n     mAP    norm  mAP rank% norm rank%      diff%
--------------------------------------------------------------------------------
MET               11   0.013  -0.007       41.6        9.9       31.7
CYP3A4            14   0.021  -0.004       46.9       29.5       17.4
PIM1               4   0.004  -0.008       17.1        0.3       16.8
FGFR1              6   0.008  -0.006       31.1       16.5       14.6
PRKCE              4   0.004  -0.007       21.4        8.7       12.7
FLT1               6   0.009  -0.005       33.2       22.7       10.6
KDR               22   0.041   0.007       55.3       45.3        9.9
NR1I2              3   0.003  -0.007       14.6        5.3        9.3
EGFR              31   0.066   0.022       64.0       55.0        9.0
LRRK2              3   0.003  -0.007       14.3        6.2        8.1

================================================================================
INSPECTION 4: Threshold analysis - what gets selected?
================================================================================

Using different mAP thresholds:
Threshold          Raw mAP  Normalized mAP            Both          Either
----------------------------------------------------------------------
0.05                   132             118             118             132
0.10                    92              83              83              92
0.15                    73              68              68              73
0.20                    55              49              49              55

================================================================================
INSPECTION 5: P-value vs effect size discrepancies
================================================================================

High significance (-log10(p) > 3) but low effect (norm < 0.1): 14 targets

Examples:
Target             n     mAP    norm  -log10(p)    p-value
----------------------------------------------------------------------
ADRA1A            17   0.099   0.073       4.28     0.0000
ADRA1B            14   0.086   0.063       4.28     0.0000
ADRA1D            10   0.085   0.067       4.28     0.0000
ADRA2A            12   0.064   0.043       4.01     0.0000
ADRA2C             7   0.105   0.090       3.40     0.0001

Low significance (p > 0.05) but high effect (norm > 0.2): 0 targets

Examples:
Target             n     mAP    norm  -log10(p)    p-value
----------------------------------------------------------------------

================================================================================
INSPECTION 6: Expected vs observed improvement patterns
================================================================================

Improvement over random by sample size:
Sample Size          Count   Mean norm mAP Median norm mAP      % positive
---------------------------------------------------------------------------
2                      107           0.077          -0.005            35.5%
3                       66           0.054          -0.000            48.5%
4                       41           0.174           0.069            80.5%
5                       14           0.245           0.082            78.6%
6                       12           0.113           0.023            75.0%
7                        9           0.092           0.030           100.0%
8                        8           0.160           0.139           100.0%
9                        8           0.087           0.078           100.0%
10                      12           0.118           0.057           100.0%
11                       6           0.087           0.023            83.3%

================================================================================
INSPECTION 7: Selection comparison with combined criteria
================================================================================

Selection strategy comparison:
Strategy                    Selected  Mean n_samples   Mean norm mAP
----------------------------------------------------------------------
Raw mAP > 0.1                     92             7.5           0.327
Norm mAP > 0.1                    83             7.3           0.352
P < 0.05                         137             7.9           0.237
Raw > 0.1 & P < 0.05              92             7.5           0.327
Norm > 0.1 & P < 0.05             83             7.3           0.352

================================================================================
INSPECTION 8: Borderline cases around common thresholds
================================================================================

Targets near the 0.1 threshold (within ±0.02):

Just below threshold (would be excluded):
Target             n     mAP    norm   Decision
-------------------------------------------------------
ADRA2C             7   0.105   0.090   Raw: YES
ADRB2             12   0.118   0.098   Raw: YES
CDK2              17   0.109   0.084   Raw: YES
DRD3              12   0.112   0.092   Raw: YES
KDM1A              2   0.091   0.083    Raw: NO

Just above threshold (would be included):
Target             n     mAP    norm   Decision
-------------------------------------------------------
ADRA2B             7   0.116   0.101   Raw: YES
CDK6               2   0.110   0.101   Raw: YES
DRD4               9   0.131   0.115   Raw: YES
MAOA               3   0.120   0.111   Raw: YES
MAPK11             8   0.123   0.107   Raw: YES

================================================================================
END OF INSPECTION
================================================================================

Code

"""
Inspection notebook for normalized vs raw mAP
This script explores various cases to understand when and how normalization helps.
"""

import pandas as pd
import numpy as np
import ast

# Load the data
print("Loading data...")
df = pd.read_csv('/Users/shsingh/Documents/GitHub/jump/jump_production/data/processed/copairs/runs/consistency/compound_no_source7__feat_all__consistency_no_target2__all_sources__repurposing/results/consistency_map_results.csv')
df['n_samples'] = df['indices'].apply(lambda x: len(ast.literal_eval(x)))

print("\n" + "="*80)
print("BASIC STATISTICS")
print("="*80)

print(f"\nTotal targets: {len(df)}")
print(f"Sample size range: {df['n_samples'].min()}-{df['n_samples'].max()}")
print(f"Mean sample size: {df['n_samples'].mean():.1f}")
print(f"Median sample size: {df['n_samples'].median():.0f}")

print("\nCorrelations:")
print(f"  mAP vs normalized mAP: {df['mean_average_precision'].corr(df['mean_normalized_average_precision']):.4f}")
print(f"  mAP vs -log10(p): {df['mean_average_precision'].corr(df['-log10(p-value)']):.4f}")
print(f"  normalized mAP vs -log10(p): {df['mean_normalized_average_precision'].corr(df['-log10(p-value)']):.4f}")
print(f"  n_samples vs mAP: {df['n_samples'].corr(df['mean_average_precision']):.4f}")
print(f"  n_samples vs normalized mAP: {df['n_samples'].corr(df['mean_normalized_average_precision']):.4f}")

print("\n" + "="*80)
print("INSPECTION 1: Targets with same raw mAP but different sample sizes")
print("="*80)

# Group by rounded mAP to find similar values
df['mAP_rounded'] = (df['mean_average_precision'] * 20).round() / 20  # Round to 0.05 increments

for map_val in [0.05, 0.10, 0.15, 0.20]:
    similar = df[(df['mAP_rounded'] == map_val) & (df['n_samples'] > 0)]
    if len(similar) > 2:
        print(f"\nTargets with mAP ≈ {map_val:.2f}:")
        # Show min and max n_samples for this mAP level
        min_n = similar.nsmallest(1, 'n_samples').iloc[0]
        max_n = similar.nlargest(1, 'n_samples').iloc[0]
        if min_n['n_samples'] != max_n['n_samples']:
            print(f"  {min_n['Metadata_repurposing_target']:10s}: n={min_n['n_samples']:2.0f}, mAP={min_n['mean_average_precision']:.3f}, norm={min_n['mean_normalized_average_precision']:.3f}")
            print(f"  {max_n['Metadata_repurposing_target']:10s}: n={max_n['n_samples']:2.0f}, mAP={max_n['mean_average_precision']:.3f}, norm={max_n['mean_normalized_average_precision']:.3f}")
            print(f"  → Normalized mAP difference: {abs(min_n['mean_normalized_average_precision'] - max_n['mean_normalized_average_precision']):.3f}")

print("\n" + "="*80)
print("INSPECTION 2: Distribution by sample size groups")
print("="*80)

# Bin the data by sample size
bins = [0, 3, 5, 10, 20, 100]
labels = ['2-3', '4-5', '6-10', '11-20', '21+']
df['size_group'] = pd.cut(df['n_samples'], bins=bins, labels=labels, include_lowest=True)

print("\n{:<10} {:>6} {:>10} {:>10} {:>10} {:>10}".format(
    "Size Group", "Count", "Mean mAP", "Mean Norm", "Min Norm", "Max Norm"
))
print("-" * 70)

for group in labels:
    subset = df[df['size_group'] == group]
    if len(subset) > 0:
        print("{:<10} {:>6} {:>10.3f} {:>10.3f} {:>10.3f} {:>10.3f}".format(
            group,
            len(subset),
            subset['mean_average_precision'].mean(),
            subset['mean_normalized_average_precision'].mean(),
            subset['mean_normalized_average_precision'].min(),
            subset['mean_normalized_average_precision'].max()
        ))

print("\n" + "="*80)
print("INSPECTION 3: Cases where raw mAP and normalized mAP disagree most")
print("="*80)

# Calculate percentile ranks for both metrics
df['rank_mAP'] = df['mean_average_precision'].rank(pct=True)
df['rank_norm'] = df['mean_normalized_average_precision'].rank(pct=True)
df['rank_diff'] = abs(df['rank_mAP'] - df['rank_norm'])

print("\nTargets with largest rank differences (percentile):")
print("{:<15} {:>4} {:>7} {:>7} {:>10} {:>10} {:>10}".format(
    "Target", "n", "mAP", "norm", "mAP rank%", "norm rank%", "diff%"
))
print("-" * 80)

for _, row in df.nlargest(10, 'rank_diff').iterrows():
    print("{:<15} {:>4.0f} {:>7.3f} {:>7.3f} {:>10.1f} {:>10.1f} {:>10.1f}".format(
        row['Metadata_repurposing_target'][:15],
        row['n_samples'],
        row['mean_average_precision'],
        row['mean_normalized_average_precision'],
        row['rank_mAP'] * 100,
        row['rank_norm'] * 100,
        row['rank_diff'] * 100
    ))

print("\n" + "="*80)
print("INSPECTION 4: Threshold analysis - what gets selected?")
print("="*80)

thresholds = [0.05, 0.10, 0.15, 0.20]

print("\nUsing different mAP thresholds:")
print("{:<10} {:>15} {:>15} {:>15} {:>15}".format(
    "Threshold", "Raw mAP", "Normalized mAP", "Both", "Either"
))
print("-" * 70)

for t in thresholds:
    raw_selected = df[df['mean_average_precision'] > t]
    norm_selected = df[df['mean_normalized_average_precision'] > t]
    both = set(raw_selected['Metadata_repurposing_target']) & set(norm_selected['Metadata_repurposing_target'])
    either = set(raw_selected['Metadata_repurposing_target']) | set(norm_selected['Metadata_repurposing_target'])
    
    print("{:<10.2f} {:>15} {:>15} {:>15} {:>15}".format(
        t,
        len(raw_selected),
        len(norm_selected),
        len(both),
        len(either)
    ))

print("\n" + "="*80)
print("INSPECTION 5: P-value vs effect size discrepancies")
print("="*80)

# High significance but low effect
high_sig_low_effect = df[(df['-log10(p-value)'] > 3) & (df['mean_normalized_average_precision'] < 0.1)]
print(f"\nHigh significance (-log10(p) > 3) but low effect (norm < 0.1): {len(high_sig_low_effect)} targets")
print("\nExamples:")
print("{:<15} {:>4} {:>7} {:>7} {:>10} {:>10}".format(
    "Target", "n", "mAP", "norm", "-log10(p)", "p-value"
))
print("-" * 70)
for _, row in high_sig_low_effect.head(5).iterrows():
    print("{:<15} {:>4.0f} {:>7.3f} {:>7.3f} {:>10.2f} {:>10.4f}".format(
        row['Metadata_repurposing_target'][:15],
        row['n_samples'],
        row['mean_average_precision'],
        row['mean_normalized_average_precision'],
        row['-log10(p-value)'],
        row['p_value']
    ))

# Low significance but high effect
low_sig_high_effect = df[(df['corrected_p_value'] > 0.05) & (df['mean_normalized_average_precision'] > 0.2)]
print(f"\nLow significance (p > 0.05) but high effect (norm > 0.2): {len(low_sig_high_effect)} targets")
print("\nExamples:")
print("{:<15} {:>4} {:>7} {:>7} {:>10} {:>10}".format(
    "Target", "n", "mAP", "norm", "-log10(p)", "p-value"
))
print("-" * 70)
for _, row in low_sig_high_effect.head(5).iterrows():
    print("{:<15} {:>4.0f} {:>7.3f} {:>7.3f} {:>10.2f} {:>10.4f}".format(
        row['Metadata_repurposing_target'][:15],
        row['n_samples'],
        row['mean_average_precision'],
        row['mean_normalized_average_precision'],
        row['-log10(p-value)'],
        row['p_value']
    ))

print("\n" + "="*80)
print("INSPECTION 6: Expected vs observed improvement patterns")
print("="*80)

# For different sample sizes, what's the typical improvement?
print("\nImprovement over random by sample size:")
print("{:<15} {:>10} {:>15} {:>15} {:>15}".format(
    "Sample Size", "Count", "Mean norm mAP", "Median norm mAP", "% positive"
))
print("-" * 75)

for n in sorted(df['n_samples'].unique())[:10]:  # First 10 sample sizes
    subset = df[df['n_samples'] == n]
    if len(subset) >= 3:  # Only show if we have at least 3 examples
        pct_positive = (subset['mean_normalized_average_precision'] > 0).mean() * 100
        print("{:<15.0f} {:>10} {:>15.3f} {:>15.3f} {:>15.1f}%".format(
            n,
            len(subset),
            subset['mean_normalized_average_precision'].mean(),
            subset['mean_normalized_average_precision'].median(),
            pct_positive
        ))

print("\n" + "="*80)
print("INSPECTION 7: Selection comparison with combined criteria")
print("="*80)

# Different selection strategies
strategies = {
    'Raw mAP > 0.1': df[df['mean_average_precision'] > 0.1],
    'Norm mAP > 0.1': df[df['mean_normalized_average_precision'] > 0.1],
    'P < 0.05': df[df['corrected_p_value'] < 0.05],
    'Raw > 0.1 & P < 0.05': df[(df['mean_average_precision'] > 0.1) & (df['corrected_p_value'] < 0.05)],
    'Norm > 0.1 & P < 0.05': df[(df['mean_normalized_average_precision'] > 0.1) & (df['corrected_p_value'] < 0.05)]
}

print("\nSelection strategy comparison:")
print("{:<25} {:>10} {:>15} {:>15}".format(
    "Strategy", "Selected", "Mean n_samples", "Mean norm mAP"
))
print("-" * 70)

for name, selected in strategies.items():
    if len(selected) > 0:
        print("{:<25} {:>10} {:>15.1f} {:>15.3f}".format(
            name,
            len(selected),
            selected['n_samples'].mean(),
            selected['mean_normalized_average_precision'].mean()
        ))

print("\n" + "="*80)
print("INSPECTION 8: Borderline cases around common thresholds")
print("="*80)

# Cases near the threshold boundaries
threshold = 0.1
margin = 0.02

print(f"\nTargets near the {threshold:.1f} threshold (within ±{margin}):")
print("\nJust below threshold (would be excluded):")
print("{:<15} {:>4} {:>7} {:>7} {:>10}".format(
    "Target", "n", "mAP", "norm", "Decision"
))
print("-" * 55)

just_below = df[(df['mean_normalized_average_precision'] < threshold) & 
                 (df['mean_normalized_average_precision'] > threshold - margin)]
for _, row in just_below.head(5).iterrows():
    raw_pass = "Raw: YES" if row['mean_average_precision'] > threshold else "Raw: NO"
    print("{:<15} {:>4.0f} {:>7.3f} {:>7.3f} {:>10}".format(
        row['Metadata_repurposing_target'][:15],
        row['n_samples'],
        row['mean_average_precision'],
        row['mean_normalized_average_precision'],
        raw_pass
    ))

print("\nJust above threshold (would be included):")
print("{:<15} {:>4} {:>7} {:>7} {:>10}".format(
    "Target", "n", "mAP", "norm", "Decision"
))
print("-" * 55)

just_above = df[(df['mean_normalized_average_precision'] > threshold) & 
                 (df['mean_normalized_average_precision'] < threshold + margin)]
for _, row in just_above.head(5).iterrows():
    raw_pass = "Raw: YES" if row['mean_average_precision'] > threshold else "Raw: NO"
    print("{:<15} {:>4.0f} {:>7.3f} {:>7.3f} {:>10}".format(
        row['Metadata_repurposing_target'][:15],
        row['n_samples'],
        row['mean_average_precision'],
        row['mean_normalized_average_precision'],
        raw_pass
    ))

print("\n" + "="*80)
print("END OF INSPECTION")
print("="*80)

CSV:

consistency_map_results.csv

Full hydra folder:

results.tar.gz

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant