Skip to content

Conversation

@yeising
Copy link
Collaborator

@yeising yeising commented Nov 7, 2025

Summary by CodeRabbit

  • New Features

    • Gene-level quantification step added to convert transcript counts to gene-level counts, and downstream analyses now use gene-level data.
    • Generates a quality-control plot showing proportion of transcripts mapped to genes and emits a warning if many transcripts remain unnamed.
  • Documentation

    • Added a report section describing the transcript-to-gene mapping and the QC plot.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 7, 2025

Walkthrough

A new transcript-to-gene mapping step was added that converts merged/all_counts.tsv into merged/all_counts_gene.tsv. Downstream Snakemake rules were updated to consume the gene-level counts; supporting scripts and a report fragment were added or modified accordingly.

Changes

Cohort / File(s) Summary
Snakemake rules updated to use gene-level counts
workflow/rules/commons.smk, workflow/rules/diffexp.smk
Inputs updated to reference merged/all_counts_gene.tsv instead of merged/all_counts.tsv.
New transcript-to-gene mapping rule
workflow/rules/quantification.smk
Added transcriptid_to_gene rule: inputs merged/all_counts.tsv, references/standardized_genomic.gff; outputs merged/all_counts_gene.tsv, plot merged/transcriptid_to_gene_plot.svg; logs to logs/transcriptid_to_gene.log; uses conda env ../envs/pydeseq2.yml and script ../scripts/transcriptid_to_gene.py. Registered as a local rule.
New transcript ID mapping script
workflow/scripts/transcriptid_to_gene.py
New script that parses GFF attributes, builds ID→gene mapping, normalizes transcript IDs (removing region-specific suffixes), replaces Reference column with gene-prefixed IDs, writes transformed counts, and emits a pie-chart QC plot and stats.
Gene ID normalization in DE util
workflow/scripts/get_de_genes.py
Added original_id(ref) utility and apply it when building gene sets to normalize gene identifiers by stripping leading prefixes up to ::.
Report fragment
workflow/report/name_stats.rst
New RST describing the transcript→gene mapping and referencing the generated pie chart.

Sequence Diagram(s)

sequenceDiagram
    participant Counts as merged/all_counts.tsv
    participant GFF as references/standardized_genomic.gff
    participant Rule as transcriptid_to_gene (rule)
    participant Script as transcriptid_to_gene.py
    participant Outputs as merged/all_counts_gene.tsv\n+ plot

    Note over Counts,GFF: Inputs to mapping step
    Counts->>Rule: start rule
    GFF->>Rule: provide annotations
    Rule->>Script: invoke mapping script
    Script->>Script: parse GFF → build ID→gene map
    Script->>Script: clean transcript IDs (regex)
    Script->>Script: map References → gene-prefixed IDs
    Script->>Outputs: write `merged/all_counts_gene.tsv` and plot
    Outputs->>Downstream: downstream rules now consume gene-level counts
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

  • Areas to focus:
    • workflow/scripts/transcriptid_to_gene.py GFF attribute parsing and ID normalization correctness.
    • Consistency of downstream rules (commons.smk, diffexp.smk) now using merged/all_counts_gene.tsv.
    • Interaction between original_id() in get_de_genes.py and existing gene-matching logic.

Possibly related PRs

Suggested labels

enhancement

Suggested reviewers

  • cmeesters

Poem

🐰 I hopped through GFF and counts so keen,
I trimmed the prefixes, made names clean,
Transcripts folded into genes so bright,
A pie chart twinkles in the night,
Workflow hopped forward — one small leap! 🥕

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'feat: rename transcript ids to gene names' accurately summarizes the main change: adding a new rule to map transcript IDs to gene names and updating downstream rules to use the renamed output.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feat-rename_id_to_gene

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Collaborator

@cmeesters cmeesters left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just the changes within the code

@yeising yeising marked this pull request as ready for review November 7, 2025 14:08
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (1)
workflow/scripts/transcriptid_to_gene.py (1)

37-41: Tab-splitting is correct for GFF3; keep as-is.

Spec mandates tab-delimited columns; whitespace split would break attribute fields containing spaces.

🧹 Nitpick comments (5)
workflow/scripts/get_de_genes.py (1)

30-31: Consider vectorized normalization for speed and NaN-safety.

You can avoid Python-level apply and preserve NaN.

-df["gene"] = df["gene"].apply(original_id)
+df["gene"] = df["gene"].where(
+    df["gene"].isna(),
+    df["gene"].astype(str).str.split("::", n=1).str[-1],
+)
workflow/rules/quantification.smk (2)

47-58: Verify annotation path and aggregation intent.

  • Input annotation "references/standardized_genomic.gff": confirm this file is produced/available in the DAG for all configs (or derive via get_reference_files/config).
  • Output name all_counts_gene.tsv suggests aggregation by gene, but the script only prefixes "gene::". If aggregation is not intended, consider renaming to reduce confusion.

4-7: Do not force local execution unless required.

Keeping transcriptid_to_gene in localrules may hinder cluster scheduling. If not necessary, drop it or gate via config.

workflow/scripts/transcriptid_to_gene.py (2)

16-16: Avoid forcing all columns to string.

Only coerce the identifier column; leave counts numeric to reduce surprises downstream.

-counts = pd.read_csv(counts, sep="\t", dtype=str)
+counts = pd.read_csv(counts, sep="\t")
+if "Reference" not in counts.columns:
+    raise ValueError("Input counts missing 'Reference' column.")
+counts["Reference"] = counts["Reference"].astype(str)

19-21: Minor: fix typos in comments ("dictionary").

Nit, but helps readability.

-# Parse gff attributes into a {key: value} dictionairy
+# Parse GFF attributes into a {key: value} dictionary
@@
-# Build mapping dictionairy {ID: gene}
+# Build mapping dictionary {ID: gene}

Also applies to: 29-31

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 841cead and 03f3ab6.

📒 Files selected for processing (5)
  • workflow/rules/commons.smk (1 hunks)
  • workflow/rules/diffexp.smk (1 hunks)
  • workflow/rules/quantification.smk (2 hunks)
  • workflow/scripts/get_de_genes.py (1 hunks)
  • workflow/scripts/transcriptid_to_gene.py (1 hunks)
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: cmeesters
Repo: snakemake-workflows/rna-longseq-de-isoform PR: 133
File: workflow/rules/quantification.smk:11-11
Timestamp: 2025-09-10T07:06:30.765Z
Learning: In the rna-longseq-de-isoform Snakemake workflow, the correct_transcriptome rule in datamod.smk converts transcriptome/transcriptome.fa to transcriptome/corrected_transcriptome.fa by replacing spaces with underscores. References to transcriptome/transcriptome.fa in this context are legitimate inputs to the conversion rule, not stale references.
📚 Learning: 2025-09-10T07:06:30.765Z
Learnt from: cmeesters
Repo: snakemake-workflows/rna-longseq-de-isoform PR: 133
File: workflow/rules/quantification.smk:11-11
Timestamp: 2025-09-10T07:06:30.765Z
Learning: In the rna-longseq-de-isoform Snakemake workflow, the correct_transcriptome rule in datamod.smk converts transcriptome/transcriptome.fa to transcriptome/corrected_transcriptome.fa by replacing spaces with underscores. References to transcriptome/transcriptome.fa in this context are legitimate inputs to the conversion rule, not stale references.

Applied to files:

  • workflow/rules/diffexp.smk
  • workflow/rules/quantification.smk
  • workflow/rules/commons.smk
  • workflow/scripts/transcriptid_to_gene.py
🪛 Ruff (0.14.3)
workflow/scripts/transcriptid_to_gene.py

6-6: Undefined name snakemake

(F821)


10-10: Undefined name snakemake

(F821)


11-11: Undefined name snakemake

(F821)


12-12: Undefined name snakemake

(F821)

🔇 Additional comments (2)
workflow/rules/diffexp.smk (1)

9-9: Input switch to gene-tagged counts looks fine; confirm downstream expectations.

deseq2-init.R must accept IDs like "GENE::TRANSCRIPT" and not require unique gene rows. If true gene-level DE is intended, consider aggregating counts by gene upstream; otherwise this is a pure rename.

workflow/rules/commons.smk (1)

169-169: Wiring to new gene-tagged counts is consistent.

This ensures the DAG pulls transcriptid_to_gene. Please confirm that the producing rule exists in all profiles and that no targets still rely on merged/all_counts.tsv.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (3)
workflow/scripts/get_de_genes.py (1)

23-29: Consider the trade-offs of strict validation vs. graceful degradation.

The function now raises ValueError on invalid inputs rather than handling them gracefully. While strict validation ensures data integrity, it may halt the pipeline on edge cases (e.g., if the upstream data contains unexpected NaN values). Consider whether:

  1. Invalid IDs should fail fast (current approach), or
  2. They should be logged/warned and passed through unchanged

If strict validation is intended, the implementation is correct. However, note the static analysis issues on Line 25:

  • Remove the f prefix since the string has no placeholders
  • Consider a custom exception class or shorter message to follow best practices

Apply this diff to address the static analysis issues:

 def original_id(ref):
     if not isinstance(ref, str) or pd.isna(ref):
-        raise ValueError(f"Invalid reference ID encountered.")
+        raise ValueError("Invalid reference ID encountered.")
     if "::" in ref:
         return ref.split("::", 1)[1]
     else:
         return ref
workflow/scripts/transcriptid_to_gene.py (2)

83-102: Add validation and fix warning stacklevel.

The mapping and statistics are well-implemented, but consider two improvements:

  1. Validate the Reference column exists before applying the mapping to provide clearer error messages
  2. Add stacklevel to the warning as flagged by static analysis

Apply this diff:

+# Validate expected column exists
+if "Reference" not in counts.columns:
+    raise ValueError("Input counts missing 'Reference' column.")
+
 # Replace transcript IDs in "Reference" column with mapping results
 counts["Reference"] = counts["Reference"].apply(map_id)
 
 # Write output
 counts.to_csv(output, sep="\t", index=False)
 
 # stats
 total = len(counts)
 mapped = counts["Reference"].str.contains("::", regex=True).sum()
 pct = round(mapped / total * 100, 2)
 if total > mapped:
-    warnings.warn("Some transcripts could not be named.")
+    warnings.warn("Some transcripts could not be named.", stacklevel=2)

104-116: Fix spacing in expressions.

The pie chart generation is well-implemented, but fix the spacing issues in the calculations for consistency.

Apply this diff:

 # Generate pie chart
 sizes = [mapped, total - mapped]
-labels = [f"Named ({pct}%)", f"Unnamed ({round(100- pct,2)}%)"]
+labels = [f"Named ({pct}%)", f"Unnamed ({round(100 - pct, 2)}%)"]
 
 plt.figure(figsize=(5, 5))
 plt.pie(sizes, autopct="%1.1f%%", colors=["#4CAF50", "#F44366"])
 plt.legend(
-    [f"Named ({mapped})", f"Unnamed ({total- mapped})"],
+    [f"Named ({mapped})", f"Unnamed ({total - mapped})"],
     loc="upper right",
     title="Transcripts",
 )
 plt.savefig(output_plot, format="svg")
 plt.close()
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 03f3ab6 and 8bf39bb.

📒 Files selected for processing (4)
  • workflow/report/name_stats.rst (1 hunks)
  • workflow/rules/quantification.smk (2 hunks)
  • workflow/scripts/get_de_genes.py (1 hunks)
  • workflow/scripts/transcriptid_to_gene.py (1 hunks)
✅ Files skipped from review due to trivial changes (1)
  • workflow/report/name_stats.rst
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: cmeesters
Repo: snakemake-workflows/rna-longseq-de-isoform PR: 133
File: workflow/rules/quantification.smk:11-11
Timestamp: 2025-09-10T07:06:30.765Z
Learning: In the rna-longseq-de-isoform Snakemake workflow, the correct_transcriptome rule in datamod.smk converts transcriptome/transcriptome.fa to transcriptome/corrected_transcriptome.fa by replacing spaces with underscores. References to transcriptome/transcriptome.fa in this context are legitimate inputs to the conversion rule, not stale references.
📚 Learning: 2025-09-10T07:06:30.765Z
Learnt from: cmeesters
Repo: snakemake-workflows/rna-longseq-de-isoform PR: 133
File: workflow/rules/quantification.smk:11-11
Timestamp: 2025-09-10T07:06:30.765Z
Learning: In the rna-longseq-de-isoform Snakemake workflow, the correct_transcriptome rule in datamod.smk converts transcriptome/transcriptome.fa to transcriptome/corrected_transcriptome.fa by replacing spaces with underscores. References to transcriptome/transcriptome.fa in this context are legitimate inputs to the conversion rule, not stale references.

Applied to files:

  • workflow/scripts/transcriptid_to_gene.py
  • workflow/rules/quantification.smk
  • workflow/scripts/get_de_genes.py
🪛 Ruff (0.14.4)
workflow/scripts/transcriptid_to_gene.py

8-8: Undefined name snakemake

(F821)


12-12: Undefined name snakemake

(F821)


13-13: Undefined name snakemake

(F821)


14-14: Undefined name snakemake

(F821)


15-15: Undefined name snakemake

(F821)


94-94: No explicit stacklevel keyword argument found

Set stacklevel=2

(B028)

workflow/scripts/get_de_genes.py

25-25: Avoid specifying long messages outside the exception class

(TRY003)


25-25: f-string without any placeholders

Remove extraneous f prefix

(F541)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Formatting
🔇 Additional comments (9)
workflow/scripts/get_de_genes.py (1)

32-32: Applying strict validation to all gene entries.

The .apply(original_id) call will raise and halt processing if any gene entry is invalid (non-string or NaN). This fail-fast behavior is appropriate if data quality is guaranteed. If you anticipate edge cases, consider adding a try-except block around this line to provide clearer error context indicating which gene ID caused the failure.

workflow/rules/quantification.smk (2)

4-6: LGTM!

Adding transcriptid_to_gene to localrules is appropriate for this data transformation and plotting step.


47-68: Rule is properly configured with all dependencies verified.

The transcriptid_to_gene rule has valid inputs, outputs, and well-documented plot metadata. Verification confirms that the conda environment (workflow/envs/pydeseq2.yml) contains all required packages (pandas via anndata, matplotlib via seaborn) and the report caption file (workflow/report/name_stats.rst) exists. No issues identified.

workflow/scripts/transcriptid_to_gene.py (6)

1-10: LGTM!

The imports and logging setup are appropriate. Explicitly managing the log file handle allows proper cleanup at the end.


12-20: Good choice to preserve ID integrity.

Loading counts with dtype=str ensures transcript IDs are not misinterpreted as numeric values, which is important for identifiers that may look numeric.


22-29: LGTM!

The parse_attributes function correctly handles GFF3 attribute format. Silently skipping malformed parts is acceptable for robustness.


32-59: Excellent fallback strategy for gene name extraction.

The implementation now tries multiple attribute keys (gene, gene_name, Name, gene_id) and derives from Parent when needed. This addresses the past review comment and significantly improves compatibility with different GFF annotation formats.


62-67: LGTM!

The clean_id function is well-documented and correctly handles region-specific suffixes in transcript IDs.


70-80: LGTM!

The two-stage matching strategy (exact then cleaned) is robust. The gene::transcript format is consistent with the downstream processing in get_de_genes.py.

@cmeesters cmeesters self-requested a review November 19, 2025 10:32
Copy link
Collaborator

@cmeesters cmeesters left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

accepting on the ground of all discussions. Assuming, it is working.

@cmeesters cmeesters merged commit f3e643f into main Nov 19, 2025
7 checks passed
@cmeesters cmeesters deleted the feat-rename_id_to_gene branch November 19, 2025 10:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants