Skip to content

Conversation

@ShigrafS
Copy link

@ShigrafS ShigrafS commented Apr 20, 2025

Closes #264

Flexible and Canonicalized Column Handling for dedup and sort

Overview

This pull request enhances the flexibility and robustness of column handling in pairtools, with a primary focus on improving CLI usability, internal consistency, and resilience to variations in input column names.

Note: This PR also fixes some flake8 linting issues.

Key Enhancements

✅ 1. Unified Column Lookup via headerops

  • Introduced headerops.get_column_index() to allow CLI options like --c1, --c2, --p1, --p2, --s1, --s2, and --pt to accept both:
    • Integer indices (e.g., 1, 3) — with bounds and type checks.
    • String names (e.g., "chr1", "chrom1") — supporting canonicalization and case-insensitivity.
  • Introduced headerops.canonicalize_columns() to standardize commonly used aliases (e.g., chr1chrom1, ptpair_type) across all CLI tools and internal logic.

✅ 2. Improved dedup and sort CLI Behavior

  • Replaced static string-based column name assumptions with dynamic lookups via get_column_index().
  • Enabled seamless handling of extra_col_pair and extra_col options with warnings for missing columns instead of hard failures.
  • Made the --pt (pair_type) option optional in sort, skipping it gracefully when not present in the header.
  • Fixed incorrect help text for --c2 in dedup (was "Chrom 1 column", now corrected to "Chrom 2 column").

✅ 3. Column Defaults Remain String-Based

  • CLI defaults (--c1, --c2, etc.) are still defined using canonical string names (e.g., "chr1", "pos1"), not integer indices as initially planned.
  • However, the backend now supports either form due to get_column_index()'s flexibility.
  • Suggested follow-up improvement: update default values to integers to fully align with the original plan.

✅ 4. Code Cleanup and Readability

  • Replaced unclear variable names (e.g., lline) for better readability across modules.
  • Removed unused imports and deprecated warnings (e.g., cython backend placeholder).
  • Refactored import ordering and string formatting for consistency.

✅ 5. Comprehensive Testing

  • Added unit tests in test_headerops.py to validate:
    • Canonicalization of various column aliases.
    • Accurate index lookup from mixed input types (ints, strings, canonicalized names).
    • Proper error handling and edge case coverage (e.g., negative indices, invalid columns).
    • Integration with header extraction utilities.

Summary

This PR lays the groundwork for robust and user-friendly CLI interactions in pairtools, reducing the brittleness of column name handling and allowing greater flexibility for users working with varied input formats. It introduces modular utilities (canonicalize_columns, get_column_index) that can be reused across future tools and extensions.


Follow-Up Considerations

  • [ ] Update CLI defaults (--c1, --c2, etc.) to use integer indices as per the original plan.

@ShigrafS ShigrafS marked this pull request as ready for review April 20, 2025 13:57
@ShigrafS
Copy link
Author

@agalitsyna I've added a warning when sorting by columns not defined in pairsam_format.DTYPES_PAIRSAM, defaulting them to string type, and updated the --extra-col help text to reflect this behavior. I've also added a test to verify warnings for undefined custom columns
The PR is ready to be merged.
Kindly review it.

@ShigrafS
Copy link
Author

@golobor @agalitsyna @Phlya
This PR is ready. Kindly review it.




def canonicalize_columns(columns):
Copy link
Member

@agalitsyna agalitsyna Jun 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"canonicalize" -> "standardize"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make it a function with a single argument.


def canonicalize_columns(columns):
"""Convert between common column name variants."""
canonical_map = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove identities (when key equals the value)

}
return [canonical_map.get(col.lower(), col) for col in columns]

def get_column_index(column_names, column_spec):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"column_spec" -> "column/col"

'pt': 'pair_type',
'pair_type': 'pair_type'
}
return [canonical_map.get(col.lower(), col) for col in columns]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

canonical_map.get(col.lower(), col.lower())

except ValueError:
pass

# Try case-insensitive
Copy link
Member

@agalitsyna agalitsyna Jun 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove the case-insensittive law or explain why it's needed (we have not found the usecase; Open2C mtg 30 June 2025)


# Get column indices with fallbacks
try:
col1 = headerops.get_column_index(column_names, c1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the names like "col1/col2/cols1" are not descriptive enough, so maybe use "col_c1/col_c2/col_s1" instead: (1) use underscore, (2) preserve the recognizable name that was used before (c1/c2/p1/p2/etc.).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

chrom1, chrom2 and pair_type fields are now required in pairs file header

2 participants