feat: rename transcript ids to gene names #141

yeising · 2025-11-07T10:08:52Z

Summary by CodeRabbit

New Features
- Gene-level quantification step added to convert transcript counts to gene-level counts, and downstream analyses now use gene-level data.
- Generates a quality-control plot showing proportion of transcripts mapped to genes and emits a warning if many transcripts remain unnamed.
Documentation
- Added a report section describing the transcript-to-gene mapping and the QC plot.

… annotation

…criptome.fa entries

coderabbitai · 2025-11-07T10:09:10Z

Walkthrough

A new transcript-to-gene mapping step was added that converts merged/all_counts.tsv into merged/all_counts_gene.tsv. Downstream Snakemake rules were updated to consume the gene-level counts; supporting scripts and a report fragment were added or modified accordingly.

Changes

Cohort / File(s)	Summary
Snakemake rules updated to use gene-level counts `workflow/rules/commons.smk`, `workflow/rules/diffexp.smk`	Inputs updated to reference `merged/all_counts_gene.tsv` instead of `merged/all_counts.tsv`.
New transcript-to-gene mapping rule `workflow/rules/quantification.smk`	Added `transcriptid_to_gene` rule: inputs `merged/all_counts.tsv`, `references/standardized_genomic.gff`; outputs `merged/all_counts_gene.tsv`, plot `merged/transcriptid_to_gene_plot.svg`; logs to `logs/transcriptid_to_gene.log`; uses conda env `../envs/pydeseq2.yml` and script `../scripts/transcriptid_to_gene.py`. Registered as a local rule.
New transcript ID mapping script `workflow/scripts/transcriptid_to_gene.py`	New script that parses GFF attributes, builds ID→gene mapping, normalizes transcript IDs (removing region-specific suffixes), replaces Reference column with gene-prefixed IDs, writes transformed counts, and emits a pie-chart QC plot and stats.
Gene ID normalization in DE util `workflow/scripts/get_de_genes.py`	Added `original_id(ref)` utility and apply it when building gene sets to normalize gene identifiers by stripping leading prefixes up to `::`.
Report fragment `workflow/report/name_stats.rst`	New RST describing the transcript→gene mapping and referencing the generated pie chart.

Sequence Diagram(s)

sequenceDiagram
    participant Counts as merged/all_counts.tsv
    participant GFF as references/standardized_genomic.gff
    participant Rule as transcriptid_to_gene (rule)
    participant Script as transcriptid_to_gene.py
    participant Outputs as merged/all_counts_gene.tsv\n+ plot

    Note over Counts,GFF: Inputs to mapping step
    Counts->>Rule: start rule
    GFF->>Rule: provide annotations
    Rule->>Script: invoke mapping script
    Script->>Script: parse GFF → build ID→gene map
    Script->>Script: clean transcript IDs (regex)
    Script->>Script: map References → gene-prefixed IDs
    Script->>Outputs: write `merged/all_counts_gene.tsv` and plot
    Outputs->>Downstream: downstream rules now consume gene-level counts

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Areas to focus:
- workflow/scripts/transcriptid_to_gene.py GFF attribute parsing and ID normalization correctness.
- Consistency of downstream rules (commons.smk, diffexp.smk) now using merged/all_counts_gene.tsv.
- Interaction between original_id() in get_de_genes.py and existing gene-matching logic.

Possibly related PRs

feat: Implement DESeq2 in R instead of pyDESeq2 #138 — modifies DE pipeline inputs/expectations for merged/all_counts.tsv, likely overlapping with this PR's change to gene-level counts.

Suggested labels

enhancement

Suggested reviewers

cmeesters

Poem

🐰 I hopped through GFF and counts so keen,
I trimmed the prefixes, made names clean,
Transcripts folded into genes so bright,
A pie chart twinkles in the night,
Workflow hopped forward — one small leap! 🥕

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'feat: rename transcript ids to gene names' accurately summarizes the main change: adding a new rule to map transcript IDs to gene names and updating downstream rules to use the renamed output.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch feat-rename_id_to_gene

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

cmeesters

just the changes within the code

workflow/scripts/transcriptid_to_gene.py

coderabbitai

Actionable comments posted: 3

♻️ Duplicate comments (1)

workflow/scripts/transcriptid_to_gene.py (1)

37-41: Tab-splitting is correct for GFF3; keep as-is.

Spec mandates tab-delimited columns; whitespace split would break attribute fields containing spaces.

🧹 Nitpick comments (5)

workflow/scripts/get_de_genes.py (1)
30-31: Consider vectorized normalization for speed and NaN-safety.

You can avoid Python-level apply and preserve NaN.
-df["gene"] = df["gene"].apply(original_id)
+df["gene"] = df["gene"].where(
+    df["gene"].isna(),
+    df["gene"].astype(str).str.split("::", n=1).str[-1],
+)
workflow/rules/quantification.smk (2)

47-58: Verify annotation path and aggregation intent.

Input annotation "references/standardized_genomic.gff": confirm this file is produced/available in the DAG for all configs (or derive via get_reference_files/config).

Output name all_counts_gene.tsv suggests aggregation by gene, but the script only prefixes "gene::". If aggregation is not intended, consider renaming to reduce confusion.

4-7: Do not force local execution unless required.

Keeping transcriptid_to_gene in localrules may hinder cluster scheduling. If not necessary, drop it or gate via config.
workflow/scripts/transcriptid_to_gene.py (2)
16-16: Avoid forcing all columns to string.

Only coerce the identifier column; leave counts numeric to reduce surprises downstream.
-counts = pd.read_csv(counts, sep="\t", dtype=str)
+counts = pd.read_csv(counts, sep="\t")
+if "Reference" not in counts.columns:
+    raise ValueError("Input counts missing 'Reference' column.")
+counts["Reference"] = counts["Reference"].astype(str)
19-21: Minor: fix typos in comments ("dictionary").

Nit, but helps readability.
-# Parse gff attributes into a {key: value} dictionairy
+# Parse GFF attributes into a {key: value} dictionary
@@
-# Build mapping dictionairy {ID: gene}
+# Build mapping dictionary {ID: gene}
Also applies to: 29-31

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 841cead and 03f3ab6.

📒 Files selected for processing (5)

workflow/rules/commons.smk (1 hunks)
workflow/rules/diffexp.smk (1 hunks)
workflow/rules/quantification.smk (2 hunks)
workflow/scripts/get_de_genes.py (1 hunks)
workflow/scripts/transcriptid_to_gene.py (1 hunks)

🧰 Additional context used

🧠 Learnings (2)

📓 Common learnings

Learnt from: cmeesters
Repo: snakemake-workflows/rna-longseq-de-isoform PR: 133
File: workflow/rules/quantification.smk:11-11
Timestamp: 2025-09-10T07:06:30.765Z
Learning: In the rna-longseq-de-isoform Snakemake workflow, the correct_transcriptome rule in datamod.smk converts transcriptome/transcriptome.fa to transcriptome/corrected_transcriptome.fa by replacing spaces with underscores. References to transcriptome/transcriptome.fa in this context are legitimate inputs to the conversion rule, not stale references.

📚 Learning: 2025-09-10T07:06:30.765Z

Learnt from: cmeesters
Repo: snakemake-workflows/rna-longseq-de-isoform PR: 133
File: workflow/rules/quantification.smk:11-11
Timestamp: 2025-09-10T07:06:30.765Z
Learning: In the rna-longseq-de-isoform Snakemake workflow, the correct_transcriptome rule in datamod.smk converts transcriptome/transcriptome.fa to transcriptome/corrected_transcriptome.fa by replacing spaces with underscores. References to transcriptome/transcriptome.fa in this context are legitimate inputs to the conversion rule, not stale references.

Applied to files:

workflow/rules/diffexp.smk
workflow/rules/quantification.smk
workflow/rules/commons.smk
workflow/scripts/transcriptid_to_gene.py

🪛 Ruff (0.14.3)

workflow/scripts/transcriptid_to_gene.py

6-6: Undefined name snakemake

(F821)

10-10: Undefined name snakemake

(F821)

11-11: Undefined name snakemake

(F821)

12-12: Undefined name snakemake

(F821)

🔇 Additional comments (2)

workflow/rules/diffexp.smk (1)

9-9: Input switch to gene-tagged counts looks fine; confirm downstream expectations.

deseq2-init.R must accept IDs like "GENE::TRANSCRIPT" and not require unique gene rows. If true gene-level DE is intended, consider aggregating counts by gene upstream; otherwise this is a pure rename.

workflow/rules/commons.smk (1)

169-169: Wiring to new gene-tagged counts is consistent.

This ensures the DAG pulls transcriptid_to_gene. Please confirm that the producing rule exists in all profiles and that no targets still rely on merged/all_counts.tsv.

workflow/scripts/get_de_genes.py

workflow/scripts/transcriptid_to_gene.py

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (3)

workflow/scripts/get_de_genes.py (1)
23-29: Consider the trade-offs of strict validation vs. graceful degradation.

The function now raises ValueError on invalid inputs rather than handling them gracefully. While strict validation ensures data integrity, it may halt the pipeline on edge cases (e.g., if the upstream data contains unexpected NaN values). Consider whether:

Invalid IDs should fail fast (current approach), or

They should be logged/warned and passed through unchanged

If strict validation is intended, the implementation is correct. However, note the static analysis issues on Line 25:

Remove the f prefix since the string has no placeholders

Consider a custom exception class or shorter message to follow best practices

Apply this diff to address the static analysis issues:
 def original_id(ref):
     if not isinstance(ref, str) or pd.isna(ref):
-        raise ValueError(f"Invalid reference ID encountered.")
+        raise ValueError("Invalid reference ID encountered.")
     if "::" in ref:
         return ref.split("::", 1)[1]
     else:
         return ref
workflow/scripts/transcriptid_to_gene.py (2)
83-102: Add validation and fix warning stacklevel.

The mapping and statistics are well-implemented, but consider two improvements:

Validate the Reference column exists before applying the mapping to provide clearer error messages

Add stacklevel to the warning as flagged by static analysis

Apply this diff:
+# Validate expected column exists
+if "Reference" not in counts.columns:
+    raise ValueError("Input counts missing 'Reference' column.")
+
 # Replace transcript IDs in "Reference" column with mapping results
 counts["Reference"] = counts["Reference"].apply(map_id)
 
 # Write output
 counts.to_csv(output, sep="\t", index=False)
 
 # stats
 total = len(counts)
 mapped = counts["Reference"].str.contains("::", regex=True).sum()
 pct = round(mapped / total * 100, 2)
 if total > mapped:
-    warnings.warn("Some transcripts could not be named.")
+    warnings.warn("Some transcripts could not be named.", stacklevel=2)
104-116: Fix spacing in expressions.

The pie chart generation is well-implemented, but fix the spacing issues in the calculations for consistency.

Apply this diff:
 # Generate pie chart
 sizes = [mapped, total - mapped]
-labels = [f"Named ({pct}%)", f"Unnamed ({round(100- pct,2)}%)"]
+labels = [f"Named ({pct}%)", f"Unnamed ({round(100 - pct, 2)}%)"]
 
 plt.figure(figsize=(5, 5))
 plt.pie(sizes, autopct="%1.1f%%", colors=["#4CAF50", "#F44366"])
 plt.legend(
-    [f"Named ({mapped})", f"Unnamed ({total- mapped})"],
+    [f"Named ({mapped})", f"Unnamed ({total - mapped})"],
     loc="upper right",
     title="Transcripts",
 )
 plt.savefig(output_plot, format="svg")
 plt.close()

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 03f3ab6 and 8bf39bb.

📒 Files selected for processing (4)

workflow/report/name_stats.rst (1 hunks)
workflow/rules/quantification.smk (2 hunks)
workflow/scripts/get_de_genes.py (1 hunks)
workflow/scripts/transcriptid_to_gene.py (1 hunks)

✅ Files skipped from review due to trivial changes (1)

workflow/report/name_stats.rst

🧰 Additional context used

🧠 Learnings (2)

📓 Common learnings

Learnt from: cmeesters
Repo: snakemake-workflows/rna-longseq-de-isoform PR: 133
File: workflow/rules/quantification.smk:11-11
Timestamp: 2025-09-10T07:06:30.765Z
Learning: In the rna-longseq-de-isoform Snakemake workflow, the correct_transcriptome rule in datamod.smk converts transcriptome/transcriptome.fa to transcriptome/corrected_transcriptome.fa by replacing spaces with underscores. References to transcriptome/transcriptome.fa in this context are legitimate inputs to the conversion rule, not stale references.

📚 Learning: 2025-09-10T07:06:30.765Z

Learnt from: cmeesters
Repo: snakemake-workflows/rna-longseq-de-isoform PR: 133
File: workflow/rules/quantification.smk:11-11
Timestamp: 2025-09-10T07:06:30.765Z
Learning: In the rna-longseq-de-isoform Snakemake workflow, the correct_transcriptome rule in datamod.smk converts transcriptome/transcriptome.fa to transcriptome/corrected_transcriptome.fa by replacing spaces with underscores. References to transcriptome/transcriptome.fa in this context are legitimate inputs to the conversion rule, not stale references.

Applied to files:

workflow/scripts/transcriptid_to_gene.py
workflow/rules/quantification.smk
workflow/scripts/get_de_genes.py

🪛 Ruff (0.14.4)

workflow/scripts/transcriptid_to_gene.py

8-8: Undefined name snakemake

(F821)

12-12: Undefined name snakemake

(F821)

13-13: Undefined name snakemake

(F821)

14-14: Undefined name snakemake

(F821)

15-15: Undefined name snakemake

(F821)

94-94: No explicit stacklevel keyword argument found

Set stacklevel=2

(B028)

workflow/scripts/get_de_genes.py

25-25: Avoid specifying long messages outside the exception class

(TRY003)

25-25: f-string without any placeholders

Remove extraneous f prefix

(F541)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Formatting

🔇 Additional comments (9)

workflow/scripts/get_de_genes.py (1)

32-32: Applying strict validation to all gene entries.

The .apply(original_id) call will raise and halt processing if any gene entry is invalid (non-string or NaN). This fail-fast behavior is appropriate if data quality is guaranteed. If you anticipate edge cases, consider adding a try-except block around this line to provide clearer error context indicating which gene ID caused the failure.

workflow/rules/quantification.smk (2)

4-6: LGTM!

Adding transcriptid_to_gene to localrules is appropriate for this data transformation and plotting step.

47-68: Rule is properly configured with all dependencies verified.

The transcriptid_to_gene rule has valid inputs, outputs, and well-documented plot metadata. Verification confirms that the conda environment (workflow/envs/pydeseq2.yml) contains all required packages (pandas via anndata, matplotlib via seaborn) and the report caption file (workflow/report/name_stats.rst) exists. No issues identified.

workflow/scripts/transcriptid_to_gene.py (6)

1-10: LGTM!

The imports and logging setup are appropriate. Explicitly managing the log file handle allows proper cleanup at the end.

12-20: Good choice to preserve ID integrity.

Loading counts with dtype=str ensures transcript IDs are not misinterpreted as numeric values, which is important for identifiers that may look numeric.

22-29: LGTM!

The parse_attributes function correctly handles GFF3 attribute format. Silently skipping malformed parts is acceptable for robustness.

32-59: Excellent fallback strategy for gene name extraction.

The implementation now tries multiple attribute keys (gene, gene_name, Name, gene_id) and derives from Parent when needed. This addresses the past review comment and significantly improves compatibility with different GFF annotation formats.

62-67: LGTM!

The clean_id function is well-documented and correctly handles region-specific suffixes in transcript IDs.

70-80: LGTM!

The two-stage matching strategy (exact then cleaned) is robust. The gene::transcript format is consistent with the downstream processing in get_de_genes.py.

cmeesters

accepting on the ground of all discussions. Assuming, it is working.

yeising added 4 commits November 7, 2025 10:58

feat: rename transcript IDs of count data to gene names obtained from…

0a2dc78

… annotation

chore: update downstream diffexp to use new named input

af6adad

fix: updated script to remove gene name so transcript_ids match trans…

5e6c27e

…criptome.fa entries

style: snakefmt reformatting

63afb3e

chore: update rule all input with gnee named merged count data

bf4d7a4

cmeesters requested changes Nov 7, 2025

View reviewed changes

workflow/scripts/transcriptid_to_gene.py Show resolved Hide resolved

workflow/scripts/transcriptid_to_gene.py Show resolved Hide resolved

docs: elaborated script commentary

03f3ab6

yeising marked this pull request as ready for review November 7, 2025 14:08

coderabbitai bot reviewed Nov 7, 2025

View reviewed changes

workflow/scripts/get_de_genes.py Show resolved Hide resolved

workflow/scripts/transcriptid_to_gene.py Show resolved Hide resolved

workflow/scripts/transcriptid_to_gene.py Show resolved Hide resolved

yeising added 4 commits November 10, 2025 10:21

feat: broadened gene name extraction by expanding attribute keys

9d09ec3

style: snakefmt formatting

8f790ff

feat: introduced missing id handling

cfde63d

feat: adds warning and stats for report

8bf39bb

coderabbitai bot reviewed Nov 12, 2025

View reviewed changes

cmeesters self-requested a review November 19, 2025 10:32

cmeesters approved these changes Nov 19, 2025

View reviewed changes

cmeesters merged commit f3e643f into main Nov 19, 2025
7 checks passed

cmeesters deleted the feat-rename_id_to_gene branch November 19, 2025 10:33

github-actions bot mentioned this pull request Nov 19, 2025

chore(main): release 2.5.0 #137

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: rename transcript ids to gene names #141

feat: rename transcript ids to gene names #141

Uh oh!

yeising commented Nov 7, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Nov 7, 2025 •

edited

Loading

Uh oh!

cmeesters left a comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

cmeesters left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

feat: rename transcript ids to gene names #141

feat: rename transcript ids to gene names #141

Uh oh!

Conversation

yeising commented Nov 7, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Nov 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

Pre-merge checks and finishing touches

Uh oh!

cmeesters left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

cmeesters left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

yeising commented Nov 7, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Nov 7, 2025 •

edited

Loading