-
Couldn't load subscription status.
- Fork 36
Fixed CLI errors. #268
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Fixed CLI errors. #268
Conversation
|
@agalitsyna I've added a warning when sorting by columns not defined in pairsam_format.DTYPES_PAIRSAM, defaulting them to string type, and updated the --extra-col help text to reflect this behavior. I've also added a test to verify warnings for undefined custom columns |
|
@golobor @agalitsyna @Phlya |
|
|
||
|
|
||
|
|
||
| def canonicalize_columns(columns): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"canonicalize" -> "standardize"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make it a function with a single argument.
|
|
||
| def canonicalize_columns(columns): | ||
| """Convert between common column name variants.""" | ||
| canonical_map = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move the dictionary to https://github.com/open2c/pairtools/blob/master/pairtools/lib/pairsam_format.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove identities (when key equals the value)
| } | ||
| return [canonical_map.get(col.lower(), col) for col in columns] | ||
|
|
||
| def get_column_index(column_names, column_spec): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"column_spec" -> "column/col"
| 'pt': 'pair_type', | ||
| 'pair_type': 'pair_type' | ||
| } | ||
| return [canonical_map.get(col.lower(), col) for col in columns] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
canonical_map.get(col.lower(), col.lower())
| except ValueError: | ||
| pass | ||
|
|
||
| # Try case-insensitive |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove the case-insensittive law or explain why it's needed (we have not found the usecase; Open2C mtg 30 June 2025)
|
|
||
| # Get column indices with fallbacks | ||
| try: | ||
| col1 = headerops.get_column_index(column_names, c1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the names like "col1/col2/cols1" are not descriptive enough, so maybe use "col_c1/col_c2/col_s1" instead: (1) use underscore, (2) preserve the recognizable name that was used before (c1/c2/p1/p2/etc.).
Closes #264
Flexible and Canonicalized Column Handling for
dedupandsortOverview
This pull request enhances the flexibility and robustness of column handling in
pairtools, with a primary focus on improving CLI usability, internal consistency, and resilience to variations in input column names.Note: This PR also fixes some flake8 linting issues.
Key Enhancements
✅ 1. Unified Column Lookup via
headeropsheaderops.get_column_index()to allow CLI options like--c1,--c2,--p1,--p2,--s1,--s2, and--ptto accept both:1,3) — with bounds and type checks."chr1","chrom1") — supporting canonicalization and case-insensitivity.headerops.canonicalize_columns()to standardize commonly used aliases (e.g.,chr1→chrom1,pt→pair_type) across all CLI tools and internal logic.✅ 2. Improved
dedupandsortCLI Behaviorget_column_index().extra_col_pairandextra_coloptions with warnings for missing columns instead of hard failures.--pt(pair_type) option optional insort, skipping it gracefully when not present in the header.--c2indedup(was "Chrom 1 column", now corrected to "Chrom 2 column").✅ 3. Column Defaults Remain String-Based
--c1,--c2, etc.) are still defined using canonical string names (e.g.,"chr1","pos1"), not integer indices as initially planned.get_column_index()'s flexibility.✅ 4. Code Cleanup and Readability
l→line) for better readability across modules.✅ 5. Comprehensive Testing
test_headerops.pyto validate:Summary
This PR lays the groundwork for robust and user-friendly CLI interactions in
pairtools, reducing the brittleness of column name handling and allowing greater flexibility for users working with varied input formats. It introduces modular utilities (canonicalize_columns,get_column_index) that can be reused across future tools and extensions.Follow-Up Considerations
--c1,--c2, etc.) to use integer indices as per the original plan.