Merge pull request #599 from d4straub/pr2-species-assignment

d4straub · web-flow · commit 311548869f8c · 2023-06-26T14:21:38.000+02:00
PR2 exact species assignment now without taxa ending with sp.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -21,6 +21,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 - [#563](https://github.com/nf-core/ampliseq/pull/563) - Renamed DADA2 taxonomic classification files to include the chosen reference taxonomy abbreviation.
 - [#567](https://github.com/nf-core/ampliseq/pull/567) - Renamed `--dada_tax_agglom_min` and `--qiime_tax_agglom_min` to `--tax_agglom_min` and `--dada_tax_agglom_max` and `--qiime_tax_agglom_max` to `--tax_agglom_max`
 - [#598](https://github.com/nf-core/ampliseq/pull/598) - Updated Workflow figure with SINTAX and phylogenetic placement
+- [#599](https://github.com/nf-core/ampliseq/pull/599) - For exact species assignment (DADA2's addSpecies) PR2 taxonomy database (e.g. `--dada_ref_taxonomy pr2`) now excludes any taxa that end with " sp.".
 
 ### `Fixed`
 
diff --git a/bin/taxref_reformat_pr2.sh b/bin/taxref_reformat_pr2.sh
@@ -7,4 +7,5 @@ gunzip -c *dada2.fasta.gz > assignTaxonomy.fna
 
 # For addSpecies(), the UTAX file is downloaded and reformated to only contain the id and species.
 # The second two sed calls are to replace "_" with space only in the species name and not the last part of the id (overdoing it a bit, as I don't the id actually matters as long as it's unique).
-gunzip -c *UTAX.fasta.gz | sed '/^>/s/>\([^;]*\);.*,s:\(.*\)/>\1 \2/' | sed 's/_/ /g' | sed 's/ \([A-Z]\) /_\1 /' > addSpecies.fna
+# The awk part removes any entries (sequence name and sequence) that have a sequence name ending with " sp."
+gunzip -c *UTAX.fasta.gz | sed '/^>/s/>\([^;]*\);.*,s:\(.*\)/>\1 \2/' | sed 's/_/ /g' | sed 's/ \([A-Z]\) /_\1 /' | awk '!/ sp.\n/' RS=">" ORS=">" > addSpecies.fna