Fixed CLI errors. #268

ShigrafS · 2025-04-20T09:50:04Z

Closes #264

Flexible and Canonicalized Column Handling for dedup and sort

Overview

This pull request enhances the flexibility and robustness of column handling in pairtools, with a primary focus on improving CLI usability, internal consistency, and resilience to variations in input column names.

Note: This PR also fixes some flake8 linting issues.

Key Enhancements

✅ 1. Unified Column Lookup via `headerops`

Introduced headerops.get_column_index() to allow CLI options like --c1, --c2, --p1, --p2, --s1, --s2, and --pt to accept both:
- Integer indices (e.g., 1, 3) — with bounds and type checks.
- String names (e.g., "chr1", "chrom1") — supporting canonicalization and case-insensitivity.
Introduced headerops.canonicalize_columns() to standardize commonly used aliases (e.g., chr1 → chrom1, pt → pair_type) across all CLI tools and internal logic.

✅ 2. Improved `dedup` and `sort` CLI Behavior

Replaced static string-based column name assumptions with dynamic lookups via get_column_index().
Enabled seamless handling of extra_col_pair and extra_col options with warnings for missing columns instead of hard failures.
Made the --pt (pair_type) option optional in sort, skipping it gracefully when not present in the header.
Fixed incorrect help text for --c2 in dedup (was "Chrom 1 column", now corrected to "Chrom 2 column").

✅ 3. Column Defaults Remain String-Based

CLI defaults (--c1, --c2, etc.) are still defined using canonical string names (e.g., "chr1", "pos1"), not integer indices as initially planned.
However, the backend now supports either form due to get_column_index()'s flexibility.
Suggested follow-up improvement: update default values to integers to fully align with the original plan.

✅ 4. Code Cleanup and Readability

Replaced unclear variable names (e.g., l → line) for better readability across modules.
Removed unused imports and deprecated warnings (e.g., cython backend placeholder).
Refactored import ordering and string formatting for consistency.

✅ 5. Comprehensive Testing

Added unit tests in test_headerops.py to validate:
- Canonicalization of various column aliases.
- Accurate index lookup from mixed input types (ints, strings, canonicalized names).
- Proper error handling and edge case coverage (e.g., negative indices, invalid columns).
- Integration with header extraction utilities.

Summary

This PR lays the groundwork for robust and user-friendly CLI interactions in pairtools, reducing the brittleness of column name handling and allowing greater flexibility for users working with varied input formats. It introduces modular utilities (canonicalize_columns, get_column_index) that can be reused across future tools and extensions.

Follow-Up Considerations

[ ] Update CLI defaults (--c1, --c2, etc.) to use integer indices as per the original plan.

pairtools/cli/sort.py

pairtools/cli/__init__.py

ShigrafS · 2025-04-23T11:11:49Z

@agalitsyna I've added a warning when sorting by columns not defined in pairsam_format.DTYPES_PAIRSAM, defaulting them to string type, and updated the --extra-col help text to reflect this behavior. I've also added a test to verify warnings for undefined custom columns
The PR is ready to be merged.
Kindly review it.

ShigrafS · 2025-06-29T11:51:11Z

@golobor @agalitsyna @Phlya
This PR is ready. Kindly review it.

agalitsyna · 2025-06-30T18:54:02Z

pairtools/lib/headerops.py




+def canonicalize_columns(columns):


"canonicalize" -> "standardize"

Make it a function with a single argument.

agalitsyna · 2025-06-30T18:55:01Z

pairtools/lib/headerops.py


+def canonicalize_columns(columns):
+    """Convert between common column name variants."""
+    canonical_map = {


Move the dictionary to https://github.com/open2c/pairtools/blob/master/pairtools/lib/pairsam_format.py

Remove identities (when key equals the value)

agalitsyna · 2025-06-30T19:02:11Z

pairtools/lib/headerops.py

+    }
+    return [canonical_map.get(col.lower(), col) for col in columns]
+
+def get_column_index(column_names, column_spec):


"column_spec" -> "column/col"

agalitsyna · 2025-06-30T19:05:29Z

pairtools/lib/headerops.py

+        'pt': 'pair_type',
+        'pair_type': 'pair_type'
+    }
+    return [canonical_map.get(col.lower(), col) for col in columns]


canonical_map.get(col.lower(), col.lower())

agalitsyna · 2025-06-30T19:10:38Z

pairtools/lib/headerops.py

+    except ValueError:
+        pass
+
+    # Try case-insensitive


Remove the case-insensittive law or explain why it's needed (we have not found the usecase; Open2C mtg 30 June 2025)

agalitsyna · 2025-06-30T19:13:11Z

pairtools/cli/dedup.py

+
+    # Get column indices with fallbacks
+    try:
+        col1 = headerops.get_column_index(column_names, c1)


the names like "col1/col2/cols1" are not descriptive enough, so maybe use "col_c1/col_c2/col_s1" instead: (1) use underscore, (2) preserve the recognizable name that was used before (c1/c2/p1/p2/etc.).

Fixed CLI errors.

7331123

ShigrafS marked this pull request as ready for review April 20, 2025 13:57

agalitsyna requested changes Apr 21, 2025

View reviewed changes

pairtools/cli/sort.py Show resolved Hide resolved

pairtools/cli/__init__.py Show resolved Hide resolved

ShigrafS requested a review from agalitsyna April 30, 2025 15:52

ShigrafS force-pushed the cli-fix branch from a1be6d2 to 7331123 Compare June 29, 2025 11:47

Merge branch 'master' into cli-fix

313f0f9

agalitsyna reviewed Jun 30, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Fixed CLI errors. #268

Fixed CLI errors. #268

Uh oh!

ShigrafS commented Apr 20, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

ShigrafS commented Apr 23, 2025

Uh oh!

ShigrafS commented Jun 29, 2025

Uh oh!

agalitsyna Jun 30, 2025 •

edited

Loading

Uh oh!

agalitsyna Jun 30, 2025

Uh oh!

agalitsyna Jun 30, 2025

Uh oh!

agalitsyna Jun 30, 2025

Uh oh!

agalitsyna Jun 30, 2025

Uh oh!

agalitsyna Jun 30, 2025

Uh oh!

agalitsyna Jun 30, 2025 •

edited

Loading

Uh oh!

agalitsyna Jun 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants




		def canonicalize_columns(columns):

Uh oh!

Fixed CLI errors. #268

Are you sure you want to change the base?

Fixed CLI errors. #268

Uh oh!

Conversation

ShigrafS commented Apr 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Key Enhancements

✅ 1. Unified Column Lookup via headerops

✅ 2. Improved dedup and sort CLI Behavior

✅ 3. Column Defaults Remain String-Based

✅ 4. Code Cleanup and Readability

✅ 5. Comprehensive Testing

Summary

Follow-Up Considerations

Uh oh!

Uh oh!

Uh oh!

ShigrafS commented Apr 23, 2025

Uh oh!

ShigrafS commented Jun 29, 2025

Uh oh!

agalitsyna Jun 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

agalitsyna Jun 30, 2025

Choose a reason for hiding this comment

Uh oh!

agalitsyna Jun 30, 2025

Choose a reason for hiding this comment

Uh oh!

agalitsyna Jun 30, 2025

Choose a reason for hiding this comment

Uh oh!

agalitsyna Jun 30, 2025

Choose a reason for hiding this comment

Uh oh!

agalitsyna Jun 30, 2025

Choose a reason for hiding this comment

Uh oh!

agalitsyna Jun 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

agalitsyna Jun 30, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ShigrafS commented Apr 20, 2025 •

edited

Loading

✅ 1. Unified Column Lookup via `headerops`

✅ 2. Improved `dedup` and `sort` CLI Behavior

agalitsyna Jun 30, 2025 •

edited

Loading

agalitsyna Jun 30, 2025 •

edited

Loading