-
Notifications
You must be signed in to change notification settings - Fork 1
feat: rename transcript ids to gene names #141
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughA new transcript-to-gene mapping step was added that converts Changes
Sequence Diagram(s)sequenceDiagram
participant Counts as merged/all_counts.tsv
participant GFF as references/standardized_genomic.gff
participant Rule as transcriptid_to_gene (rule)
participant Script as transcriptid_to_gene.py
participant Outputs as merged/all_counts_gene.tsv\n+ plot
Note over Counts,GFF: Inputs to mapping step
Counts->>Rule: start rule
GFF->>Rule: provide annotations
Rule->>Script: invoke mapping script
Script->>Script: parse GFF → build ID→gene map
Script->>Script: clean transcript IDs (regex)
Script->>Script: map References → gene-prefixed IDs
Script->>Outputs: write `merged/all_counts_gene.tsv` and plot
Outputs->>Downstream: downstream rules now consume gene-level counts
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes
Possibly related PRs
Suggested labels
Suggested reviewers
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
cmeesters
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just the changes within the code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
♻️ Duplicate comments (1)
workflow/scripts/transcriptid_to_gene.py (1)
37-41: Tab-splitting is correct for GFF3; keep as-is.Spec mandates tab-delimited columns; whitespace split would break attribute fields containing spaces.
🧹 Nitpick comments (5)
workflow/scripts/get_de_genes.py (1)
30-31: Consider vectorized normalization for speed and NaN-safety.You can avoid Python-level apply and preserve NaN.
-df["gene"] = df["gene"].apply(original_id) +df["gene"] = df["gene"].where( + df["gene"].isna(), + df["gene"].astype(str).str.split("::", n=1).str[-1], +)workflow/rules/quantification.smk (2)
47-58: Verify annotation path and aggregation intent.
- Input annotation "references/standardized_genomic.gff": confirm this file is produced/available in the DAG for all configs (or derive via get_reference_files/config).
- Output name all_counts_gene.tsv suggests aggregation by gene, but the script only prefixes "gene::". If aggregation is not intended, consider renaming to reduce confusion.
4-7: Do not force local execution unless required.Keeping transcriptid_to_gene in localrules may hinder cluster scheduling. If not necessary, drop it or gate via config.
workflow/scripts/transcriptid_to_gene.py (2)
16-16: Avoid forcing all columns to string.Only coerce the identifier column; leave counts numeric to reduce surprises downstream.
-counts = pd.read_csv(counts, sep="\t", dtype=str) +counts = pd.read_csv(counts, sep="\t") +if "Reference" not in counts.columns: + raise ValueError("Input counts missing 'Reference' column.") +counts["Reference"] = counts["Reference"].astype(str)
19-21: Minor: fix typos in comments ("dictionary").Nit, but helps readability.
-# Parse gff attributes into a {key: value} dictionairy +# Parse GFF attributes into a {key: value} dictionary @@ -# Build mapping dictionairy {ID: gene} +# Build mapping dictionary {ID: gene}Also applies to: 29-31
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
workflow/rules/commons.smk(1 hunks)workflow/rules/diffexp.smk(1 hunks)workflow/rules/quantification.smk(2 hunks)workflow/scripts/get_de_genes.py(1 hunks)workflow/scripts/transcriptid_to_gene.py(1 hunks)
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: cmeesters
Repo: snakemake-workflows/rna-longseq-de-isoform PR: 133
File: workflow/rules/quantification.smk:11-11
Timestamp: 2025-09-10T07:06:30.765Z
Learning: In the rna-longseq-de-isoform Snakemake workflow, the correct_transcriptome rule in datamod.smk converts transcriptome/transcriptome.fa to transcriptome/corrected_transcriptome.fa by replacing spaces with underscores. References to transcriptome/transcriptome.fa in this context are legitimate inputs to the conversion rule, not stale references.
📚 Learning: 2025-09-10T07:06:30.765Z
Learnt from: cmeesters
Repo: snakemake-workflows/rna-longseq-de-isoform PR: 133
File: workflow/rules/quantification.smk:11-11
Timestamp: 2025-09-10T07:06:30.765Z
Learning: In the rna-longseq-de-isoform Snakemake workflow, the correct_transcriptome rule in datamod.smk converts transcriptome/transcriptome.fa to transcriptome/corrected_transcriptome.fa by replacing spaces with underscores. References to transcriptome/transcriptome.fa in this context are legitimate inputs to the conversion rule, not stale references.
Applied to files:
workflow/rules/diffexp.smkworkflow/rules/quantification.smkworkflow/rules/commons.smkworkflow/scripts/transcriptid_to_gene.py
🪛 Ruff (0.14.3)
workflow/scripts/transcriptid_to_gene.py
6-6: Undefined name snakemake
(F821)
10-10: Undefined name snakemake
(F821)
11-11: Undefined name snakemake
(F821)
12-12: Undefined name snakemake
(F821)
🔇 Additional comments (2)
workflow/rules/diffexp.smk (1)
9-9: Input switch to gene-tagged counts looks fine; confirm downstream expectations.deseq2-init.R must accept IDs like "GENE::TRANSCRIPT" and not require unique gene rows. If true gene-level DE is intended, consider aggregating counts by gene upstream; otherwise this is a pure rename.
workflow/rules/commons.smk (1)
169-169: Wiring to new gene-tagged counts is consistent.This ensures the DAG pulls transcriptid_to_gene. Please confirm that the producing rule exists in all profiles and that no targets still rely on merged/all_counts.tsv.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (3)
workflow/scripts/get_de_genes.py (1)
23-29: Consider the trade-offs of strict validation vs. graceful degradation.The function now raises
ValueErroron invalid inputs rather than handling them gracefully. While strict validation ensures data integrity, it may halt the pipeline on edge cases (e.g., if the upstream data contains unexpected NaN values). Consider whether:
- Invalid IDs should fail fast (current approach), or
- They should be logged/warned and passed through unchanged
If strict validation is intended, the implementation is correct. However, note the static analysis issues on Line 25:
- Remove the
fprefix since the string has no placeholders- Consider a custom exception class or shorter message to follow best practices
Apply this diff to address the static analysis issues:
def original_id(ref): if not isinstance(ref, str) or pd.isna(ref): - raise ValueError(f"Invalid reference ID encountered.") + raise ValueError("Invalid reference ID encountered.") if "::" in ref: return ref.split("::", 1)[1] else: return refworkflow/scripts/transcriptid_to_gene.py (2)
83-102: Add validation and fix warning stacklevel.The mapping and statistics are well-implemented, but consider two improvements:
- Validate the Reference column exists before applying the mapping to provide clearer error messages
- Add stacklevel to the warning as flagged by static analysis
Apply this diff:
+# Validate expected column exists +if "Reference" not in counts.columns: + raise ValueError("Input counts missing 'Reference' column.") + # Replace transcript IDs in "Reference" column with mapping results counts["Reference"] = counts["Reference"].apply(map_id) # Write output counts.to_csv(output, sep="\t", index=False) # stats total = len(counts) mapped = counts["Reference"].str.contains("::", regex=True).sum() pct = round(mapped / total * 100, 2) if total > mapped: - warnings.warn("Some transcripts could not be named.") + warnings.warn("Some transcripts could not be named.", stacklevel=2)
104-116: Fix spacing in expressions.The pie chart generation is well-implemented, but fix the spacing issues in the calculations for consistency.
Apply this diff:
# Generate pie chart sizes = [mapped, total - mapped] -labels = [f"Named ({pct}%)", f"Unnamed ({round(100- pct,2)}%)"] +labels = [f"Named ({pct}%)", f"Unnamed ({round(100 - pct, 2)}%)"] plt.figure(figsize=(5, 5)) plt.pie(sizes, autopct="%1.1f%%", colors=["#4CAF50", "#F44366"]) plt.legend( - [f"Named ({mapped})", f"Unnamed ({total- mapped})"], + [f"Named ({mapped})", f"Unnamed ({total - mapped})"], loc="upper right", title="Transcripts", ) plt.savefig(output_plot, format="svg") plt.close()
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
workflow/report/name_stats.rst(1 hunks)workflow/rules/quantification.smk(2 hunks)workflow/scripts/get_de_genes.py(1 hunks)workflow/scripts/transcriptid_to_gene.py(1 hunks)
✅ Files skipped from review due to trivial changes (1)
- workflow/report/name_stats.rst
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: cmeesters
Repo: snakemake-workflows/rna-longseq-de-isoform PR: 133
File: workflow/rules/quantification.smk:11-11
Timestamp: 2025-09-10T07:06:30.765Z
Learning: In the rna-longseq-de-isoform Snakemake workflow, the correct_transcriptome rule in datamod.smk converts transcriptome/transcriptome.fa to transcriptome/corrected_transcriptome.fa by replacing spaces with underscores. References to transcriptome/transcriptome.fa in this context are legitimate inputs to the conversion rule, not stale references.
📚 Learning: 2025-09-10T07:06:30.765Z
Learnt from: cmeesters
Repo: snakemake-workflows/rna-longseq-de-isoform PR: 133
File: workflow/rules/quantification.smk:11-11
Timestamp: 2025-09-10T07:06:30.765Z
Learning: In the rna-longseq-de-isoform Snakemake workflow, the correct_transcriptome rule in datamod.smk converts transcriptome/transcriptome.fa to transcriptome/corrected_transcriptome.fa by replacing spaces with underscores. References to transcriptome/transcriptome.fa in this context are legitimate inputs to the conversion rule, not stale references.
Applied to files:
workflow/scripts/transcriptid_to_gene.pyworkflow/rules/quantification.smkworkflow/scripts/get_de_genes.py
🪛 Ruff (0.14.4)
workflow/scripts/transcriptid_to_gene.py
8-8: Undefined name snakemake
(F821)
12-12: Undefined name snakemake
(F821)
13-13: Undefined name snakemake
(F821)
14-14: Undefined name snakemake
(F821)
15-15: Undefined name snakemake
(F821)
94-94: No explicit stacklevel keyword argument found
Set stacklevel=2
(B028)
workflow/scripts/get_de_genes.py
25-25: Avoid specifying long messages outside the exception class
(TRY003)
25-25: f-string without any placeholders
Remove extraneous f prefix
(F541)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Formatting
🔇 Additional comments (9)
workflow/scripts/get_de_genes.py (1)
32-32: Applying strict validation to all gene entries.The
.apply(original_id)call will raise and halt processing if any gene entry is invalid (non-string or NaN). This fail-fast behavior is appropriate if data quality is guaranteed. If you anticipate edge cases, consider adding a try-except block around this line to provide clearer error context indicating which gene ID caused the failure.workflow/rules/quantification.smk (2)
4-6: LGTM!Adding
transcriptid_to_geneto localrules is appropriate for this data transformation and plotting step.
47-68: Rule is properly configured with all dependencies verified.The
transcriptid_to_generule has valid inputs, outputs, and well-documented plot metadata. Verification confirms that the conda environment (workflow/envs/pydeseq2.yml) contains all required packages (pandas via anndata, matplotlib via seaborn) and the report caption file (workflow/report/name_stats.rst) exists. No issues identified.workflow/scripts/transcriptid_to_gene.py (6)
1-10: LGTM!The imports and logging setup are appropriate. Explicitly managing the log file handle allows proper cleanup at the end.
12-20: Good choice to preserve ID integrity.Loading counts with
dtype=strensures transcript IDs are not misinterpreted as numeric values, which is important for identifiers that may look numeric.
22-29: LGTM!The
parse_attributesfunction correctly handles GFF3 attribute format. Silently skipping malformed parts is acceptable for robustness.
32-59: Excellent fallback strategy for gene name extraction.The implementation now tries multiple attribute keys (
gene,gene_name,Name,gene_id) and derives fromParentwhen needed. This addresses the past review comment and significantly improves compatibility with different GFF annotation formats.
62-67: LGTM!The
clean_idfunction is well-documented and correctly handles region-specific suffixes in transcript IDs.
70-80: LGTM!The two-stage matching strategy (exact then cleaned) is robust. The
gene::transcriptformat is consistent with the downstream processing inget_de_genes.py.
cmeesters
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
accepting on the ground of all discussions. Assuming, it is working.
Summary by CodeRabbit
New Features
Documentation