From 51db9011805d5f26d97ef51f455bd48354152731 Mon Sep 17 00:00:00 2001 From: FriederikeHanssen Date: Fri, 16 May 2025 13:24:31 +0200 Subject: [PATCH 01/15] rough plan for metadata training --- docs/side_quests/metadata.md | 84 ++++++++++++++++++++++++++++++++++++ 1 file changed, 84 insertions(+) create mode 100644 docs/side_quests/metadata.md diff --git a/docs/side_quests/metadata.md b/docs/side_quests/metadata.md new file mode 100644 index 0000000000..a1d190c0f9 --- /dev/null +++ b/docs/side_quests/metadata.md @@ -0,0 +1,84 @@ +# Metadata + +introduction: + +what is meta data +why is it important + +## 0. Warmup + +check material + +### 0.1 Prerequisites + +### 0.2 Starting Point + +## 1. Read in samplesheet + +### 1.1. Read in samplesheet with splitCsv + +read in file with splitCsv + +Straight up the same content as splitting and grouping + +### Takeaway + +In this section, you've learned: + +- **Reading in a samplesheet**: How to read in a samplesheet with `splitCsv` +- Assign columns to fields in meta map. why is this useful (Objective: Learn about maps and how to add values) + +--- + +## 2. Create a new meta map key + +Run a module; maybe something conceptually like a QC module +Assign a new value in the meta map following the QC module (Objective: tweak the map with computed values) + +### Takeaway + +In this section, you've learned: + +- **Creating custom keys** + +--- + +## 3. Group data with matching value in the meta map + + +Repeat a grouping example, using this new value as grouping or branching decider (Objective: use the meta map to decide on workflow paths) +Run a module with the groups + +### Takeaway + +In this section, you've learned: + +- **Extracting an arbitray value to group or filter on** + +--- + +## 4. Publishing location based on meta map value + +tweak the publishing directory based on a field in the meta map (Objective: use the meta map in the module) + +### Takeaway + +In this section, you've learned: + +- **Tweaking directives using meta values** + +--- + +## 5. Tweak tool arguments based on meta map value + +tweak the publishing directory based on a field in the meta map (Objective: use the meta map in the module) + +### Takeaway + +In this section, you've learned: + +- **Tweaking script section based on meta values** + +--- + +## Summary From 8fcc9987e6cabf4bd48a8da59651fdc104f87856 Mon Sep 17 00:00:00 2001 From: FriederikeHanssen Date: Tue, 20 May 2025 16:26:44 +0200 Subject: [PATCH 02/15] solution for training --- side-quests/metadata/main.nf | 70 ++++++++++++++++++++++++++++++++++++ 1 file changed, 70 insertions(+) create mode 100644 side-quests/metadata/main.nf diff --git a/side-quests/metadata/main.nf b/side-quests/metadata/main.nf new file mode 100644 index 0000000000..d22cc54e39 --- /dev/null +++ b/side-quests/metadata/main.nf @@ -0,0 +1,70 @@ +/* + * Use echo to print 'Hello World!' to a file + */ +process IDENTIFY_LANGUAGE { + publishDir 'results', mode: 'copy' + tag "${meta.id}" + + container 'community.wave.seqera.io/library/pip_langid:b2269f456a5629ff' + + input: + tuple val(meta), path(greeting) + + output: + tuple val(meta), stdout + + script: + """ + langid < ${greeting} -l en,de,fr,es,it | sed -E "s/.*\\('([a-z]+)'.*/\\1/" | tr -d '\\n' + """ +} + +/* + * Generate ASCII art with cowpy +*/ +process COWPY { + tag "${meta.id}" + + publishDir "results/${meta.lang}", mode: 'copy' + + container 'community.wave.seqera.io/library/cowpy:1.1.5--3db457ae1977a273' + + input: + tuple val(meta), path(input_file) + val character + + output: + tuple val(meta), path("cowpy-${input_file}") + + script: + """ + cat $input_file | cowpy -c "$character" > cowpy-${input_file} + """ + +} + +workflow { + + files = Channel.fromPath("./data/*.txt").map { file -> [ [id:file.getName()], file] } + + ch_prediction = IDENTIFY_LANGUAGE(files) + + ch_language_groups = files.join(ch_prediction) + //Uses meta map to join + .map { meta, file, lang -> + [ meta + [lang:lang], file ] + } + //generates a new key in the meta map + .map{ meta, file -> + def lang_group = (meta.lang.equals('de') || meta.lang.equals('en')) ? 'germanic' : 'romanic' + [ meta + [lang_group:lang_group], file ] + } + + //uses meta map field to filter the channel to only include languages in the romanic group + romanic_languages = ch_language_groups.filter { meta, file -> + meta.lang_group == 'romanic' + } + + COWPY(romanic_languages, params.character) + +} From bdcb3199b1f5eae5d31b9228e331c60f7848f398 Mon Sep 17 00:00:00 2001 From: FriederikeHanssen Date: Tue, 20 May 2025 16:54:43 +0200 Subject: [PATCH 03/15] update solution --- side-quests/metadata/data/bonjour.txt | 2 ++ side-quests/metadata/data/ciao.txt | 2 ++ side-quests/metadata/data/guten_tag.txt | 2 ++ side-quests/metadata/data/hallo.txt | 2 ++ side-quests/metadata/data/hello.txt | 2 ++ side-quests/metadata/data/hola.txt | 2 ++ side-quests/metadata/data/salut.txt | 2 ++ side-quests/metadata/data/samplesheet.csv | 8 ++++++++ side-quests/metadata/main.nf | 13 ++++++++----- side-quests/metadata/nextflow.config | 1 + 10 files changed, 31 insertions(+), 5 deletions(-) create mode 100644 side-quests/metadata/data/bonjour.txt create mode 100644 side-quests/metadata/data/ciao.txt create mode 100644 side-quests/metadata/data/guten_tag.txt create mode 100644 side-quests/metadata/data/hallo.txt create mode 100644 side-quests/metadata/data/hello.txt create mode 100644 side-quests/metadata/data/hola.txt create mode 100644 side-quests/metadata/data/salut.txt create mode 100644 side-quests/metadata/data/samplesheet.csv create mode 100644 side-quests/metadata/nextflow.config diff --git a/side-quests/metadata/data/bonjour.txt b/side-quests/metadata/data/bonjour.txt new file mode 100644 index 0000000000..e43b286664 --- /dev/null +++ b/side-quests/metadata/data/bonjour.txt @@ -0,0 +1,2 @@ +Bonjour +Salut, à demain \ No newline at end of file diff --git a/side-quests/metadata/data/ciao.txt b/side-quests/metadata/data/ciao.txt new file mode 100644 index 0000000000..180527c14f --- /dev/null +++ b/side-quests/metadata/data/ciao.txt @@ -0,0 +1,2 @@ +Ciao +Ci vediamo domani \ No newline at end of file diff --git a/side-quests/metadata/data/guten_tag.txt b/side-quests/metadata/data/guten_tag.txt new file mode 100644 index 0000000000..42cb3eb17a --- /dev/null +++ b/side-quests/metadata/data/guten_tag.txt @@ -0,0 +1,2 @@ +Guten Tag, wie geht es dir? +Auf Wiedersehen, bis morgen \ No newline at end of file diff --git a/side-quests/metadata/data/hallo.txt b/side-quests/metadata/data/hallo.txt new file mode 100644 index 0000000000..c7c5ff2489 --- /dev/null +++ b/side-quests/metadata/data/hallo.txt @@ -0,0 +1,2 @@ +Hallo +Tschüss, bis morgen diff --git a/side-quests/metadata/data/hello.txt b/side-quests/metadata/data/hello.txt new file mode 100644 index 0000000000..e19e49cce7 --- /dev/null +++ b/side-quests/metadata/data/hello.txt @@ -0,0 +1,2 @@ +Hello +Bye, see you tomorrow diff --git a/side-quests/metadata/data/hola.txt b/side-quests/metadata/data/hola.txt new file mode 100644 index 0000000000..1b1b35ca2d --- /dev/null +++ b/side-quests/metadata/data/hola.txt @@ -0,0 +1,2 @@ +Hola +Adiós, hasta mañana \ No newline at end of file diff --git a/side-quests/metadata/data/salut.txt b/side-quests/metadata/data/salut.txt new file mode 100644 index 0000000000..2b5a68b1d5 --- /dev/null +++ b/side-quests/metadata/data/salut.txt @@ -0,0 +1,2 @@ +Salut, ça va? +à plus dans le bus \ No newline at end of file diff --git a/side-quests/metadata/data/samplesheet.csv b/side-quests/metadata/data/samplesheet.csv new file mode 100644 index 0000000000..8c3db104f6 --- /dev/null +++ b/side-quests/metadata/data/samplesheet.csv @@ -0,0 +1,8 @@ +id,animal,recording +sampleA,squirrel,/workspaces/training/side-quests/metadata/data/bonjour.txt +sampleB,tux,/workspaces/training/side-quests/metadata/data/guten_tag.txt +sampleC,sheep,/workspaces/training/side-quests/metadata/data/hallo.txt +sampleD,turkey,/workspaces/training/side-quests/metadata/data/hello.txt +sampleE,stegosaurus,/workspaces/training/side-quests/metadata/data/hola.txt +sampleF,moose,/workspaces/training/side-quests/metadata/data/salut.txt +sampleG,turtle,/workspaces/training/side-quests/metadata/data/ciao.txt \ No newline at end of file diff --git a/side-quests/metadata/main.nf b/side-quests/metadata/main.nf index d22cc54e39..408bace8c3 100644 --- a/side-quests/metadata/main.nf +++ b/side-quests/metadata/main.nf @@ -2,7 +2,7 @@ * Use echo to print 'Hello World!' to a file */ process IDENTIFY_LANGUAGE { - publishDir 'results', mode: 'copy' + tag "${meta.id}" container 'community.wave.seqera.io/library/pip_langid:b2269f456a5629ff' @@ -31,21 +31,24 @@ process COWPY { input: tuple val(meta), path(input_file) - val character output: tuple val(meta), path("cowpy-${input_file}") script: """ - cat $input_file | cowpy -c "$character" > cowpy-${input_file} + cat $input_file | cowpy -c ${meta.animal} > cowpy-${input_file} """ } workflow { - files = Channel.fromPath("./data/*.txt").map { file -> [ [id:file.getName()], file] } + files = Channel.fromPath("./data/samplesheet.csv") + .splitCsv(header: true) + .map { row -> + [ [id:row.id, animal:row.animal], row.recording ] + } ch_prediction = IDENTIFY_LANGUAGE(files) @@ -65,6 +68,6 @@ workflow { meta.lang_group == 'romanic' } - COWPY(romanic_languages, params.character) + COWPY(romanic_languages) } diff --git a/side-quests/metadata/nextflow.config b/side-quests/metadata/nextflow.config new file mode 100644 index 0000000000..e685c23b0d --- /dev/null +++ b/side-quests/metadata/nextflow.config @@ -0,0 +1 @@ +docker.enabled=true \ No newline at end of file From 14ae39d93f0b5f44ca0eb4f0babb3b0710cb2923 Mon Sep 17 00:00:00 2001 From: FriederikeHanssen Date: Wed, 21 May 2025 10:33:52 +0200 Subject: [PATCH 04/15] first half of the text --- docs/side_quests/metadata.md | 560 ++++++++++++++++++++++++++++++++++- 1 file changed, 550 insertions(+), 10 deletions(-) diff --git a/docs/side_quests/metadata.md b/docs/side_quests/metadata.md index a1d190c0f9..477666bc07 100644 --- a/docs/side_quests/metadata.md +++ b/docs/side_quests/metadata.md @@ -4,46 +4,586 @@ introduction: what is meta data why is it important +sample specific, not something that is the same for all samples ## 0. Warmup -check material - ### 0.1 Prerequisites +Before taking on this side quest you should: + +- Complete the [Hello Nextflow](../hello_nextflow/README.md) tutorial +- Understand basic Nextflow concepts (processes, channels, operators) + ### 0.2 Starting Point +Let's move into the project directory. + +```bash +cd side-quests/metadata +``` + +You'll find a `data` directory containing a samplesheet and a main workflow file. + +```console title="Directory contents" +> tree +. +├── data +│ ├── bonjour.txt +│ ├── ciao.txt +│ ├── guten_tag.txt +│ ├── hallo.txt +│ ├── hello.txt +│ ├── hola.txt +│ ├── salut.txt +│ └── samplesheet.csv +├── main.nf +└── nextflow.config +``` + +The samplesheet contains information about different samples and some associated data that we will use in this exercise to tailor our analysis to each sample. In particular, the samplesheet has 3 columns: + + - `id`: self-explanatory, an ID given to the sample + - `character`: a character name, that we will use later to draw different creatures + - `data`: paths to `.txt` files that contain phrases in different languages + + ```console title="samplesheet.csv" +id,character,recording +sampleA,squirrel,/workspaces/training/side-quests/metadata/data/bonjour.txt +sampleB,tux,/workspaces/training/side-quests/metadata/data/guten_tag.txt +sampleC,sheep,/workspaces/training/side-quests/metadata/data/hallo.txt +sampleD,turkey,/workspaces/training/side-quests/metadata/data/hello.txt +sampleE,stegosaurus,/workspaces/training/side-quests/metadata/data/hola.txt +sampleF,moose,/workspaces/training/side-quests/metadata/data/salut.txt +sampleG,turtle,/workspaces/training/side-quests/metadata/data/ciao.txt +``` + ## 1. Read in samplesheet ### 1.1. Read in samplesheet with splitCsv -read in file with splitCsv +Let's start by reading in the samplesheet with `splitCsv`. In the main workflow file, you'll see that we've already started the workflow. + +```groovy title="main.nf" linenums="1"s +workflow { + + ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") + .splitCsv(header: true) + +} +``` + +!!! note + + Throughout this tutorial, we'll use the `ch_` prefix for all channel variables to clearly indicate they are Nextflow channels. + +=== "After" + + ```groovy title="main.nf" linenums="2" hl_lines="2-3" + ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") + .splitCsv(header: true) + .view() + ``` + +=== "Before" + + ```groovy title="main.nf" linenums="2" + ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") + ``` + +We can use the [`splitCsv` operator](https://www.nextflow.io/docs/latest/operator.html#splitcsv) to split the samplesheet into a channel of maps, where each map represents a row from the CSV file. + +The `header: true` option tells Nextflow to use the first row of the CSV file as the header row, which will be used as keys for the values. Let's see what Nextflow can see after reading with `splitCsv`. To do this, we can use the `view` operator. + +Run the pipeline: + +```bash title="Read the samplesheet" +nextflow run main.nf +``` + +```console title="Read samplesheet with splitCsv" + N E X T F L O W ~ version 24.10.4 -Straight up the same content as splitting and grouping +Launching `main.nf` [exotic_albattani] DSL2 - revision: c0d03cec83 + +[id:sampleA, character:squirrel, recording:/workspaces/training/side-quests/metadata/data/bonjour.txt] +[id:sampleB, character:tux, recording:/workspaces/training/side-quests/metadata/data/guten_tag.txt] +[id:sampleC, character:sheep, recording:/workspaces/training/side-quests/metadata/data/hallo.txt] +[id:sampleD, character:turkey, recording:/workspaces/training/side-quests/metadata/data/hello.txt] +[id:sampleE, character:stegosaurus, recording:/workspaces/training/side-quests/metadata/data/hola.txt] +[id:sampleF, character:moose, recording:/workspaces/training/side-quests/metadata/data/salut.txt] +[id:sampleG, character:turtle, recording:/workspaces/training/side-quests/metadata/data/ciao.txt] +``` + +We can see that each row from the CSV file has been converted into a map with keys matching the header row. A map is a key-value data structure similar to dictionaries in Python, objects in JavaScript, or hashes in Ruby. + +Each map contains: + + - `id`: an ID given to the sample + - `character`: a character name, that we will use later to draw different creatures + - `data`: paths to `.txt` files that contain phrases in different languages + +This format makes it easy to access specific fields from each sample. For example, we could access the sample ID with `id` or the txt file path with `data`. The output above shows each row from the CSV file converted into a map with keys matching the header row. Now that we've successfully read in the samplesheet and have access to the data in each row, we can begin implementing our pipeline logic. + +### 1.2 Separate meta data and data + +In the samplesheet, we have both the input files and data about the input files (`id`, `character`), the meta data. As we progress through the workflow, we generate more meta data about each sample. To avoid having to keep track of how many fields we have at any point in time and making the input of our processes more robust, we can combine the meta information into its own key-value paired map: + +=== "After" + + ```groovy title="main.nf" linenums="3" hl_lines="5-8" + ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") + .splitCsv(header: true) + .map { row -> + [ [id:row.id, character:row.character], row.recording ] + } + .view() + ``` + +=== "Before" + + ```groovy title="main.nf" linenums="3" + ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") + .splitCsv(header: true) + .view() + ``` + +```bash title="View meta map" +nextflow run main.nf +``` + +```console title="View meta map" + N E X T F L O W ~ version 24.10.4 + +Launching `main.nf` [lethal_booth] DSL2 - revision: 0d8f844c07 + +[[id:sampleA, character:squirrel], /workspaces/training/side-quests/metadata/data/bonjour.txt] +[[id:sampleB, character:tux], /workspaces/training/side-quests/metadata/data/guten_tag.txt] +[[id:sampleC, character:sheep], /workspaces/training/side-quests/metadata/data/hallo.txt] +[[id:sampleD, character:turkey], /workspaces/training/side-quests/metadata/data/hello.txt] +[[id:sampleE, character:stegosaurus], /workspaces/training/side-quests/metadata/data/hola.txt] +[[id:sampleF, character:moose], /workspaces/training/side-quests/metadata/data/salut.txt] +[[id:sampleG, character:turtle], /workspaces/training/side-quests/metadata/data/ciao.txt] +``` + +We have successfully separate our meta data into its own map to keep it next to the file data. Each of our channel elements now has the shape `[ metamap, file ]`. ### Takeaway In this section, you've learned: - **Reading in a samplesheet**: How to read in a samplesheet with `splitCsv` -- Assign columns to fields in meta map. why is this useful (Objective: Learn about maps and how to add values) +- **Creating a meta map**: Moving columns with meta information into a separate data structure and keep it next to the input data --- -## 2. Create a new meta map key +## 2. Create new meta map keys + +### 2.1 Passing the meta map through a process + +Now we want to process our samples. These samples are language samples, but we don't know what language they are in. Let's add a process definition before the `workflow` that can identify the language in each file: + +=== "After" + + ``` groovy title="main.nf" linenums="1" hl_lines="1-19" + /* + * Use langid to predict the language of each input file + */ + process IDENTIFY_LANGUAGE { + + container 'community.wave.seqera.io/library/pip_langid:b2269f456a5629ff' + + input: + tuple val(meta), path(greeting) + + output: + tuple val(meta), stdout + + script: + """ + langid < ${greeting} -l en,de,fr,es,it | sed -E "s/.*\\('([a-z]+)'.*/\\1/" | tr -d '\\n' + """ + } + + ``` + +=== "Before" + + ```groovy title="main.nf" linenums="1" + workflow { + ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") + .splitCsv(header: true) + .map { row -> + [ [id:row.id, character:row.character], row.recording ] + } + .view() + } + ``` + +The tool [langid](https://github.com/saffsd/langid.py) is a language identification tool. It is pre-trained on a set of languages. For a given phrase, it prints a language guess and a probability score for each guess to the console. In the `script` section, we are removing the probability score, clean up the string by removing a newline character and return the language guess. Since it is printed directly to the console, we are using Nextflow's [`stdout` output qualifier](https://www.nextflow.io/docs/latest/process.html#outputs), passing the string on as output. + +Let's include the process, run, and view it: + +=== "After" + + ``` groovy title="main.nf" linenums="20" hl_lines="27-29" + workflow { + + ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") + .splitCsv(header: true) + .map { row -> + [ [id:row.id, character:row.character], row.recording ] + } + + ch_prediction = IDENTIFY_LANGUAGE(ch_samplesheet) + ch_prediction.view() + + } + + ``` + +=== "Before" + + ```groovy title="main.nf" linenums="20" + workflow { + + ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") + .splitCsv(header: true) + .map { row -> + [ [id:row.id, character:row.character], row.recording ] + } + .view() + } + ``` + +```bash title="Identify languages" +nextflow run main.nf +``` + +```console title="Identify languages" + N E X T F L O W ~ version 24.10.4 + +Launching `main.nf` [voluminous_mcnulty] DSL2 - revision: f9bcfebabb + +executor > local (7) +[2c/888abb] IDENTIFY_LANGUAGE (7) [100%] 7 of 7 ✔ +[[id:sampleA, character:squirrel], fr] +[[id:sampleB, character:tux], de] +[[id:sampleC, character:sheep], de] +[[id:sampleD, character:turkey], en] +[[id:sampleE, character:stegosaurus], es] +[[id:sampleF, character:moose], fr] +[[id:sampleG, character:turtle], it] +``` + +Neat, for each of our samples, we now have a language predicted. You may have noticed something else: we kept the meta data of our samples and associated it with our new piece of information. We achieved this by adding the `meta` map to the output tuple in the process: + +```groovy title="main.nf" linenums="12" +output: + tuple val(meta), stdout +``` + +This is a useful tool to ensure the sample-specific meta information stays connected with any new information that is generated. + +### 2.2 Associate the language prediction with the input file + +At the moment, our sample files and their language prediction are separated in two different channels: `ch_samplesheet` and `ch_predictions`. But both channels have the same meta information associated with the interesting data points. We can use the meta map to combine our channels back together. + +Nextflow includes many methods for combining channels, but in this case the most appropriate operator is [`join`](https://www.nextflow.io/docs/latest/operator.html#join). If you are familiar with SQL, it acts like the `JOIN` operation, where we specify the key to join on and the type of join to perform. + +If we check the [`join`](https://www.nextflow.io/docs/latest/operator.html#join) documentation, we can see that it joins two channels based on a defined item, by default the first item in each tuple. If you don't have the console output still available, let's run the pipeline to check our data structures. If you removed the `view()` operator from the `ch_samplesheet` add it back in: + +=== "After" + + ``` groovy title="main.nf" linenums="20" hl_lines="27" + workflow { + + ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") + .splitCsv(header: true) + .map { row -> + [ [id:row.id, character:row.character], row.recording ] + } + .view() + + ch_prediction = IDENTIFY_LANGUAGE(ch_samplesheet) + ch_prediction.view() + + } -Run a module; maybe something conceptually like a QC module -Assign a new value in the meta map following the QC module (Objective: tweak the map with computed values) + ``` + +=== "Before" + + ```groovy title="main.nf" linenums="20" + workflow { + + ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") + .splitCsv(header: true) + .map { row -> + [ [id:row.id, character:row.character], row.recording ] + } + + ch_prediction = IDENTIFY_LANGUAGE(ch_samplesheet) + ch_prediction.view() + + } + ``` + +```bash title="View samplesheet and prediction channel content" +nextflow run main.nf +``` + +```console title="View samplesheet and prediction channel content" + N E X T F L O W ~ version 24.10.4 + +Launching `main.nf` [trusting_blackwell] DSL2 - revision: de90745ea4 + +executor > local (7) +[d6/0f2efd] IDENTIFY_LANGUAGE (7) [100%] 7 of 7 ✔ +[[id:sampleA, character:squirrel], /workspaces/training/side-quests/metadata/data/bonjour.txt] +[[id:sampleB, character:tux], /workspaces/training/side-quests/metadata/data/guten_tag.txt] +[[id:sampleC, character:sheep], /workspaces/training/side-quests/metadata/data/hallo.txt] +[[id:sampleD, character:turkey], /workspaces/training/side-quests/metadata/data/hello.txt] +[[id:sampleE, character:stegosaurus], /workspaces/training/side-quests/metadata/data/hola.txt] +[[id:sampleF, character:moose], /workspaces/training/side-quests/metadata/data/salut.txt] +[[id:sampleG, character:turtle], /workspaces/training/side-quests/metadata/data/ciao.txt] +[[id:sampleB, character:tux], de] +[[id:sampleA, character:squirrel], fr] +[[id:sampleC, character:sheep], de] +[[id:sampleD, character:turkey], en] +[[id:sampleE, character:stegosaurus], es] +[[id:sampleF, character:moose], fr] +[[id:sampleG, character:turtle], it] +``` + +We can see that the meta map is the first element in each map and the map is the same for both channels. We can simply use the `join` operator to combine the two channels: + +=== "After" + + ``` groovy title="main.nf" linenums="20" hl_lines="27" + workflow { + + ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") + .splitCsv(header: true) + .map { row -> + [ [id:row.id, character:row.character], row.recording ] + } + + ch_prediction = IDENTIFY_LANGUAGE(ch_samplesheet) + + ch_languages = ch_samplesheet.join(ch_prediction) + .view() + } + + ``` + +=== "Before" + + ```groovy title="main.nf" linenums="20" + workflow { + + ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") + .splitCsv(header: true) + .map { row -> + [ [id:row.id, character:row.character], row.recording ] + } + .view() + + ch_prediction = IDENTIFY_LANGUAGE(ch_samplesheet) + ch_prediction.view() + + } + ``` + +```bash title="View joined channel" +nextflow run main.nf +``` + +```console title="View joined channel" +[[id:sampleA, character:squirrel], /workspaces/training/side-quests/metadata/data/bonjour.txt, fr] +[[id:sampleB, character:tux], /workspaces/training/side-quests/metadata/data/guten_tag.txt, de] +[[id:sampleC, character:sheep], /workspaces/training/side-quests/metadata/data/hallo.txt, de] +[[id:sampleD, character:turkey], /workspaces/training/side-quests/metadata/data/hello.txt, en] +[[id:sampleE, character:stegosaurus], /workspaces/training/side-quests/metadata/data/hola.txt, es] +[[id:sampleF, character:moose], /workspaces/training/side-quests/metadata/data/salut.txt, fr] +[[id:sampleG, character:turtle], /workspaces/training/side-quests/metadata/data/ciao.txt, it] +``` + +It is becoming a bit hard to see, but if you look all the way on the right side, you can see that now each of our language predictions is associated with our input files. + +!!! warning + + The `join` operator will discard any un-matched tuples. In this example, we made sure all samples were matched for tumor and normal but if this is not true you must use the parameter `remainder: true` to keep the unmatched tuples. Check the [documentation](https://www.nextflow.io/docs/latest/operator.html#join) for more details. + +### 2.3 Add the language prediction to the meta map + +Given that this is more data about the files, let's add it to our meta map. We can use the [`map` operator](https://www.nextflow.io/docs/latest/operator.html#map) to create a new key `lang` and set the value to prediction: + + +=== "After" + + ``` groovy title="main.nf" linenums="20" hl_lines="31-33" + workflow { + + ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") + .splitCsv(header: true) + .map { row -> + [ [id:row.id, character:row.character], row.recording ] + } + + ch_prediction = IDENTIFY_LANGUAGE(ch_samplesheet) + + ch_languages = ch_samplesheet.join(ch_prediction) + .map { meta, file, lang -> + [ meta + [lang:lang], file ] + } + .view() + + } + ``` + +=== "Before" + + ```groovy title="main.nf" linenums="20" + workflow { + + ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") + .splitCsv(header: true) + .map { row -> + [ [id:row.id, character:row.character], row.recording ] + } + + ch_prediction = IDENTIFY_LANGUAGE(ch_samplesheet) + + ch_languages = ch_samplesheet.join(ch_prediction) + .view() + } + ``` + +```bash title="View new meta map key" +nextflow run main.nf -resume +``` + +```console title="View new meta map key" + + N E X T F L O W ~ version 24.10.4 + +Launching `main.nf` [cheeky_fermat] DSL2 - revision: d096281ee4 + +[da/652cc6] IDENTIFY_LANGUAGE (7) [100%] 7 of 7, cached: 7 ✔ +[[id:sampleA, character:squirrel, lang:fr], /workspaces/training/side-quests/metadata/data/bonjour.txt] +[[id:sampleB, character:tux, lang:de], /workspaces/training/side-quests/metadata/data/guten_tag.txt] +[[id:sampleC, character:sheep, lang:de], /workspaces/training/side-quests/metadata/data/hallo.txt] +[[id:sampleD, character:turkey, lang:en], /workspaces/training/side-quests/metadata/data/hello.txt] +[[id:sampleE, character:stegosaurus, lang:es], /workspaces/training/side-quests/metadata/data/hola.txt] +[[id:sampleF, character:moose, lang:fr], /workspaces/training/side-quests/metadata/data/salut.txt] +[[id:sampleG, character:turtle, lang:it], /workspaces/training/side-quests/metadata/data/ciao.txt] +``` + +Nice, we expanded our meta map with new information we gathered in the pipeline. + +### 2.4 Assign a language group using a ternary operator + +Alright, now that we have our language predictions, let's use the information to assign them into new groups. In our example data, we have provided data sets that belong either to `germanic` (either English or German) or `romanic` (French, Spanish, Italian) languages. + +We can use the `map` operator and an [ternary operator](https://groovy-lang.org/operators.html#_ternary_operator) to assign either group. The ternary operator, is a short cut to an if/else clause. It says: + +``` +variable = ? 'Value' : 'Default' +``` + +and is the same as: + +``` +if (){ + variable = 'Value' +} else { + variable = 'Default' +} +``` + +=== "After" + + ``` groovy title="main.nf" linenums="20" hl_lines="34-37" + workflow { + + ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") + .splitCsv(header: true) + .map { row -> + [ [id:row.id, character:row.character], row.recording ] + } + + ch_prediction = IDENTIFY_LANGUAGE(ch_samplesheet) + + ch_languages = ch_samplesheet.join(ch_prediction) + .map { meta, file, lang -> + [ meta + [lang:lang], file ] + } + .map{ meta, file -> + def lang_group = (meta.lang.equals('de') || meta.lang.equals('en')) ? 'germanic' : 'romanic' + [ meta + [lang_group:lang_group], file ] + } + .view() + + } + + ``` + +=== "Before" + + ```groovy title="main.nf" linenums="20" + workflow { + + ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") + .splitCsv(header: true) + .map { row -> + [ [id:row.id, character:row.character], row.recording ] + } + + ch_prediction = IDENTIFY_LANGUAGE(ch_samplesheet) + + ch_languages = ch_samplesheet.join(ch_prediction) + .map { meta, file, lang -> + [ meta + [lang:lang], file ] + } + .view() + + } + ``` + +Let's rerun it + +```bash title="View language groups" +nextflow run main.nf -resume +``` + +```console title="View language groups" + N E X T F L O W ~ version 24.10.4 + +Launching `main.nf` [wise_almeida] DSL2 - revision: 46778c3cd0 + +[da/652cc6] IDENTIFY_LANGUAGE (7) [100%] 7 of 7, cached: 7 ✔ +[[id:sampleA, character:squirrel, lang:fr, lang_group:romanic], /workspaces/training/side-quests/metadata/data/bonjour.txt] +[[id:sampleB, character:tux, lang:de, lang_group:germanic], /workspaces/training/side-quests/metadata/data/guten_tag.txt] +[[id:sampleC, character:sheep, lang:de, lang_group:germanic], /workspaces/training/side-quests/metadata/data/hallo.txt] +[[id:sampleD, character:turkey, lang:en, lang_group:germanic], /workspaces/training/side-quests/metadata/data/hello.txt] +[[id:sampleE, character:stegosaurus, lang:es, lang_group:romanic], /workspaces/training/side-quests/metadata/data/hola.txt] +[[id:sampleF, character:moose, lang:fr, lang_group:romanic], /workspaces/training/side-quests/metadata/data/salut.txt] +[[id:sampleG, character:turtle, lang:it, lang_group:romanic], /workspaces/training/side-quests/metadata/data/ciao.txt] +``` ### Takeaway In this section, you've learned: -- **Creating custom keys** +- **Merging on meta maps**: You used `join` to combine two channels based on their meta maps +- **Creating custom keys**: You created two new keys in your meta map. One based on a computed value from a process, and one based on a condition you set in the `map` operator. + +Both of these allow you to associated new and existing meta data with files as you progress through your pipeline. --- -## 3. Group data with matching value in the meta map +## 3. Filter data with certain values in the meta map Repeat a grouping example, using this new value as grouping or branching decider (Objective: use the meta map to decide on workflow paths) From 3296c1caf9929101701f3f923131a7ae50500e75 Mon Sep 17 00:00:00 2001 From: FriederikeHanssen Date: Wed, 21 May 2025 15:17:56 +0200 Subject: [PATCH 05/15] add intro and second half of training --- docs/side_quests/metadata.md | 335 ++++++++++++++++++++-- side-quests/metadata/data/samplesheet.csv | 4 +- side-quests/metadata/main.nf | 70 ----- 3 files changed, 321 insertions(+), 88 deletions(-) diff --git a/docs/side_quests/metadata.md b/docs/side_quests/metadata.md index 477666bc07..f4aa65f290 100644 --- a/docs/side_quests/metadata.md +++ b/docs/side_quests/metadata.md @@ -1,10 +1,23 @@ # Metadata -introduction: +Metadata is crucial information that describes and gives context to your data. In workflows, it helps track important details about samples, experimental conditions, and processing parameters. Metadata is sample-specific and helps tailor analyses to each dataset's unique characteristics. -what is meta data -why is it important -sample specific, not something that is the same for all samples +Think of it like a library catalog: while books contain the actual content (raw data), the catalog cards provide essential information about each book - when it was published, who wrote it, where to find it (metadata). In Nextflow pipelines, metadata helps us: + +* Track sample-specific information throughout the workflow +* Configure processes based on sample characteristics +* Group related samples for joint analysis + +In this side quest, we'll explore how to handle metadata effectively in Nextflow workflows. Starting with a simple samplesheet containing basic sample information, you'll learn how to: + +* Read and parse sample metadata from CSV files +* Create and manipulate metadata maps +* Add new metadata fields during workflow execution +* Use metadata to customize process behavior + +These skills will help you build more robust and flexible pipelines that can handle complex sample relationships and processing requirements. + +Let's dive in and see how metadata can make our workflows smarter and more maintainable! ## 0. Warmup @@ -583,40 +596,330 @@ Both of these allow you to associated new and existing meta data with files as y --- -## 3. Filter data with certain values in the meta map +## 3. Filter data based on meta map values +We can use the [`filter` operator](https://www.nextflow.io/docs/latest/operator.html#filter) to filter the data based on a condition. Let's say we only want to process romanic language samples firther. We can do this by filtering the data based on the `lang_group` field. Let's create a new channel that only contains romanic languages and `view` it: -Repeat a grouping example, using this new value as grouping or branching decider (Objective: use the meta map to decide on workflow paths) -Run a module with the groups +=== "After" -### Takeaway + ``` groovy title="main.nf" linenums="20" hl_lines="38-42" + workflow { -In this section, you've learned: + ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") + .splitCsv(header: true) + .map { row -> + [ [id:row.id, character:row.character], row.recording ] + } -- **Extracting an arbitray value to group or filter on** + ch_prediction = IDENTIFY_LANGUAGE(ch_samplesheet) ---- + ch_languages = ch_samplesheet.join(ch_prediction) + .map { meta, file, lang -> + [ meta + [lang:lang], file ] + } + .map{ meta, file -> + def lang_group = (meta.lang.equals('de') || meta.lang.equals('en')) ? 'germanic' : 'romanic' + [ meta + [lang_group:lang_group], file ] + } + + romanic_languages = ch_language_groups.filter { meta, file -> + meta.lang_group == 'romanic' + } + .view() + } + + ``` + +=== "Before" + + ```groovy title="main.nf" linenums="20" + workflow { + + ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") + .splitCsv(header: true) + .map { row -> + [ [id:row.id, character:row.character], row.recording ] + } + + ch_prediction = IDENTIFY_LANGUAGE(ch_samplesheet) + + ch_languages = ch_samplesheet.join(ch_prediction) + .map { meta, file, lang -> + [ meta + [lang:lang], file ] + } + .map{ meta, file -> + def lang_group = (meta.lang.equals('de') || meta.lang.equals('en')) ? 'germanic' : 'romanic' + [ meta + [lang_group:lang_group], file ] + } + .view() -## 4. Publishing location based on meta map value + } + ``` + +Let's rerun it + +```bash title="View romanic samples" +nextflow run main.nf -resume +``` + +```console title="View romanic samples" + N E X T F L O W ~ version 24.10.4 + +Launching `main.nf` [drunk_brattain] DSL2 - revision: 453fdd4e91 + +[da/652cc6] IDENTIFY_LANGUAGE (7) [100%] 7 of 7, cached: 7 ✔ +[[id:sampleA, character:squirrel, lang:fr, lang_group:romanic], /workspaces/training/side-quests/metadata/data/bonjour.txt] +[[id:sampleE, character:stegosaurus, lang:es, lang_group:romanic], /workspaces/training/side-quests/metadata/data/hola.txt] +[[id:sampleF, character:moose, lang:fr, lang_group:romanic], /workspaces/training/side-quests/metadata/data/salut.txt] +[[id:sampleG, character:turtle, lang:it, lang_group:romanic], /workspaces/training/side-quests/metadata/data/ciao.txt] +``` + +We have successfully filtered the data to only include romanic samples. Let's recap how this works. The `filter` operator takes a closure that is applied to each element in the channel. If the closure returns `true`, the element is included in the output channel. If the closure returns `false`, the element is excluded from the output channel. -tweak the publishing directory based on a field in the meta map (Objective: use the meta map in the module) +In this case, we want to keep only the samples where `meta.lang_group == 'romanic'`. In the closure, we first know that our channel elements are all of shape `[meta, file]` and we can then access the individual keys of the meta map. We then check if `meta.lang_group` is equal to `'romanic'`. If it is, the sample is included in the output channel. If it is not, the sample is excluded from the output channel. + +```groovy title="main.nf" linenums="4" +.filter { meta,file -> meta.lang_group == 'romanic' } +``` ### Takeaway In this section, you've learned: -- **Tweaking directives using meta values** +- How to filter data with `filter` + +we now have only the romanic language samples left and can process those further. Next we want to make characters say the phrases. --- -## 5. Tweak tool arguments based on meta map value +## 4. Customize a process with meta map + +Let's let the characters say the phrases that we have passed in. In the [hello-nextflow training](../hello_nextflow/05_hello_containers.md), you already encountered the `cowpy` package, a python implementation of a tool called `cowsay` that generates ASCII art to display arbitrary text inputs in a fun way. + +We will re-use a process from there. + +Copy in the process before your workflow block: + +=== "After" + + ``` groovy title="main.nf" linenums="20" + /* + * Generate ASCII art with cowpy + */ + process COWPY { + + container 'community.wave.seqera.io/library/cowpy:1.1.5--3db457ae1977a273' + + input: + tuple val(meta), path(input_file) + + output: + tuple val(meta), path("cowpy-${input_file}") + + script: + """ + cat $input_file | cowpy -c "stegosaurus" > cowpy-${input_file} + """ + + } + ``` + +=== "Before" + + ```groovy title="main.nf" linenums="20" + workflow{ + ... + } + ``` + +### 4.1 Add a custom publishing location -tweak the publishing directory based on a field in the meta map (Objective: use the meta map in the module) +Let's run our romanic languages through `COWPY` and remove our `view` statement: + +=== "After" + + ``` groovy title="main.nf" linenums="40" hl_lines="61-63" + workflow { + + ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") + .splitCsv(header: true) + .map { row -> + [ [id:row.id, character:row.character], row.recording ] + } + + ch_prediction = IDENTIFY_LANGUAGE(ch_samplesheet) + + ch_languages = ch_samplesheet.join(ch_prediction) + .map { meta, file, lang -> + [ meta + [lang:lang], file ] + } + .map{ meta, file -> + def lang_group = (meta.lang.equals('de') || meta.lang.equals('en')) ? 'germanic' : 'romanic' + [ meta + [lang_group:lang_group], file ] + } + + romanic_languages = ch_languages.filter { meta, file -> + meta.lang_group == 'romanic' + } + + COWPY(romanic_languages) + + } + ``` + +=== "Before" + + ```groovy title="main.nf" linenums="20" + workflow { + + ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") + .splitCsv(header: true) + .map { row -> + [ [id:row.id, character:row.character], row.recording ] + } + + ch_prediction = IDENTIFY_LANGUAGE(ch_samplesheet) + + ch_languages = ch_samplesheet.join(ch_prediction) + .map { meta, file, lang -> + [ meta + [lang:lang], file ] + } + .map{ meta, file -> + def lang_group = (meta.lang.equals('de') || meta.lang.equals('en')) ? 'germanic' : 'romanic' + [ meta + [lang_group:lang_group], file ] + } + + romanic_languages = ch_languages.filter { meta, file -> + meta.lang_group == 'romanic' + }.view() + } + ``` + +We are still missing a publishing location. Given we have been trying to figure out what languages our samples were in, let's group the samples by language in the output directory. Earlier, we added the predicted language to the `meta` map. We can access this `key` in the process and use it in the `publishDir` directive: + +=== "After" + + ``` groovy title="main.nf" linenums="23" hl_lines="25" + /* + * Generate ASCII art with cowpy + */ + process COWPY { + + publishDir "results/${meta.lang}", mode: 'copy' + + container 'community.wave.seqera.io/library/cowpy:1.1.5--3db457ae1977a273' + + ``` + + +=== "Before" + + ```groovy title="main.nf" linenums="23" + process COWPY { + + container 'community.wave.seqera.io/library/cowpy:1.1.5--3db457ae1977a273' + + ``` + +Let's run this: + +```bash title="Use cowpy" +nextflow run main.nf -resume +``` + +You should now see a new folder called `results`: + +```console title="Results folder" +results/ +├── es +│ └── cowpy-hola.txt +├── fr +│ ├── cowpy-bonjour.txt +│ └── cowpy-salut.txt +└── it + └── cowpy-ciao.txt +``` + +Success! All our phrases are correctly sorted and we now see which of them correspond to which language. + +Let's take a look at `cowpy-salut.txt`: + +```console title="cowpy-salut.txt" + ____________________ +/ Salut, ça va? \ +\ à plus dans le bus / + -------------------- +\ . . + \ / `. .' " + \ .---. < > < > .---. + \ | \ \ - ~ ~ - / / | + _____ ..-~ ~-..-~ + | | \~~~\.' `./~~~/ + --------- \__/ \__/ + .' O \ / / \ " + (_____, `._.' | } \/~~~/ + `----. / } | / \__/ + `-. | / | / `. ,~~| + ~-.__| /_ - ~ ^| /- _ `..-' + | / | / ~-. `-. _ _ _ + |_____| |_____| ~ - . _ _ _ _ _> +``` + +Look through the other files. All phrases should be spoken by the fashionable stegosaurus. + +### 4.2 Customize the character + +In our samplesheet, we have another column: `character`. To tailor the tool parameters per sample, we can also access information from the `meta` map in the script section. This is really useful in cases were a tool should have different parameters for each sample. + +Let's customize the characters by changing the `cowpy` command: + +=== "After" + + ``` groovy title="main.nf" linenums="35" hl_lines="37" + script: + """ + cat $input_file | cowpy -c ${meta.character} > cowpy-${input_file} + """ + ``` + +=== "Before" + + ```groovy title="main.nf" linenums="23" + script: + """ + cat $input_file | cowpy -c "stegosaurus" > cowpy-${input_file} + """ + ``` + +Let's run this: + +```bash title="Use cowpy" +nextflow run main.nf -resume +``` + +and take another look at our french phrase: + +```console title="fr/cowpy-salut.txt" + ____________________ +/ Salut, ça va? \ +\ à plus dans le bus / + -------------------- + \ + \ \_\_ _/_/ + \ \__/ + (oo)\_______ + (__)\ )\/\ + ||----w | + || || +``` + +This is a subtle difference to other parameters that we have set in the pipelines in previous trainings. A parameter that is passed as part of the `params` object is generally applied to all samples. When a more surgical approache is necessary, using the sample specific information is a good alternative. ### Takeaway In this section, you've learned: +- **Tweaking directives using meta values** - **Tweaking script section based on meta values** --- diff --git a/side-quests/metadata/data/samplesheet.csv b/side-quests/metadata/data/samplesheet.csv index 8c3db104f6..358a328d92 100644 --- a/side-quests/metadata/data/samplesheet.csv +++ b/side-quests/metadata/data/samplesheet.csv @@ -1,8 +1,8 @@ -id,animal,recording +id,character,recording sampleA,squirrel,/workspaces/training/side-quests/metadata/data/bonjour.txt sampleB,tux,/workspaces/training/side-quests/metadata/data/guten_tag.txt sampleC,sheep,/workspaces/training/side-quests/metadata/data/hallo.txt sampleD,turkey,/workspaces/training/side-quests/metadata/data/hello.txt sampleE,stegosaurus,/workspaces/training/side-quests/metadata/data/hola.txt sampleF,moose,/workspaces/training/side-quests/metadata/data/salut.txt -sampleG,turtle,/workspaces/training/side-quests/metadata/data/ciao.txt \ No newline at end of file +sampleG,turtle,/workspaces/training/side-quests/metadata/data/ciao.txt diff --git a/side-quests/metadata/main.nf b/side-quests/metadata/main.nf index 408bace8c3..581b081ea3 100644 --- a/side-quests/metadata/main.nf +++ b/side-quests/metadata/main.nf @@ -1,73 +1,3 @@ -/* - * Use echo to print 'Hello World!' to a file - */ -process IDENTIFY_LANGUAGE { - - tag "${meta.id}" - - container 'community.wave.seqera.io/library/pip_langid:b2269f456a5629ff' - - input: - tuple val(meta), path(greeting) - - output: - tuple val(meta), stdout - - script: - """ - langid < ${greeting} -l en,de,fr,es,it | sed -E "s/.*\\('([a-z]+)'.*/\\1/" | tr -d '\\n' - """ -} - -/* - * Generate ASCII art with cowpy -*/ -process COWPY { - tag "${meta.id}" - - publishDir "results/${meta.lang}", mode: 'copy' - - container 'community.wave.seqera.io/library/cowpy:1.1.5--3db457ae1977a273' - - input: - tuple val(meta), path(input_file) - - output: - tuple val(meta), path("cowpy-${input_file}") - - script: - """ - cat $input_file | cowpy -c ${meta.animal} > cowpy-${input_file} - """ - -} - workflow { - files = Channel.fromPath("./data/samplesheet.csv") - .splitCsv(header: true) - .map { row -> - [ [id:row.id, animal:row.animal], row.recording ] - } - - ch_prediction = IDENTIFY_LANGUAGE(files) - - ch_language_groups = files.join(ch_prediction) - //Uses meta map to join - .map { meta, file, lang -> - [ meta + [lang:lang], file ] - } - //generates a new key in the meta map - .map{ meta, file -> - def lang_group = (meta.lang.equals('de') || meta.lang.equals('en')) ? 'germanic' : 'romanic' - [ meta + [lang_group:lang_group], file ] - } - - //uses meta map field to filter the channel to only include languages in the romanic group - romanic_languages = ch_language_groups.filter { meta, file -> - meta.lang_group == 'romanic' - } - - COWPY(romanic_languages) - } From d9204a011edb359a46eeb3b98cdb6b8e93dd67ae Mon Sep 17 00:00:00 2001 From: FriederikeHanssen Date: Wed, 21 May 2025 15:26:11 +0200 Subject: [PATCH 06/15] add to side bar --- docs/side_quests/index.md | 3 + docs/side_quests/metadata.md | 2 +- docs/side_quests/orientation.md | 2 + mkdocs.yml | 6 +- side-quests/solutions/metadata/main.nf | 68 +++++++++++++++++++ .../solutions/metadata/nextflow.config | 1 + 6 files changed, 79 insertions(+), 3 deletions(-) create mode 100644 side-quests/solutions/metadata/main.nf create mode 100644 side-quests/solutions/metadata/nextflow.config diff --git a/docs/side_quests/index.md b/docs/side_quests/index.md index 759fb30a92..85ae2f3aa4 100644 --- a/docs/side_quests/index.md +++ b/docs/side_quests/index.md @@ -29,7 +29,10 @@ Otherwise, select a side quest from the menu below. ## Menu of Side Quests - [Introduction to nf-core](./nf-core.md) +- [Metadata](./metadata.md) +- [Splitting and Grouping](./splitting_and_grouping.md) - [Testing with nf-test](./nf-test.md) - [Workflows of workflows](./workflows_of_workflows.md) + Let us know what other domains and use cases you'd like to see covered here by posting in the [Training section](https://community.seqera.io/c/training/) of the community forum. diff --git a/docs/side_quests/metadata.md b/docs/side_quests/metadata.md index f4aa65f290..744e630399 100644 --- a/docs/side_quests/metadata.md +++ b/docs/side_quests/metadata.md @@ -15,7 +15,7 @@ In this side quest, we'll explore how to handle metadata effectively in Nextflow * Add new metadata fields during workflow execution * Use metadata to customize process behavior -These skills will help you build more robust and flexible pipelines that can handle complex sample relationships and processing requirements. +These skills will help you build more robust and flexible pipelines that can handle complex sample relationships and processing requirements. Let's dive in and see how metadata can make our workflows smarter and more maintainable! diff --git a/docs/side_quests/orientation.md b/docs/side_quests/orientation.md index b404f7a56c..aa132dd06d 100644 --- a/docs/side_quests/orientation.md +++ b/docs/side_quests/orientation.md @@ -24,9 +24,11 @@ If you run this inside `side-quests`, you should see the following output: ```console title="Directory contents" . +├── metadata ├── nf-core ├── nf-test ├── solutions +├── splitting_and_grouping └── workflows_of_workflows ``` diff --git a/mkdocs.yml b/mkdocs.yml index cecb40a284..b2c5515a16 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -39,10 +39,12 @@ nav: - Side Quests: - side_quests/index.md - side_quests/orientation.md - - side_quests/workflows_of_workflows.md - - side_quests/splitting_and_grouping.md + - side_quests/metadata.md - side_quests/nf-test.md - side_quests/nf-core.md + - side_quests/splitting_and_grouping.md + - side_quests/workflows_of_workflows.md + - Fundamentals Training: - basic_training/index.md - basic_training/orientation.md diff --git a/side-quests/solutions/metadata/main.nf b/side-quests/solutions/metadata/main.nf new file mode 100644 index 0000000000..124e874ead --- /dev/null +++ b/side-quests/solutions/metadata/main.nf @@ -0,0 +1,68 @@ +/* + * Use langid to predict the language of each input file + */ +process IDENTIFY_LANGUAGE { + + container 'community.wave.seqera.io/library/pip_langid:b2269f456a5629ff' + + input: + tuple val(meta), path(greeting) + + output: + tuple val(meta), stdout + + script: + """ + langid < ${greeting} -l en,de,fr,es,it | sed -E "s/.*\\('([a-z]+)'.*/\\1/" | tr -d '\\n' + """ + +} + +/* + * Generate ASCII art with cowpy +*/ +process COWPY { + + publishDir "results/${meta.lang}", mode: 'copy' + + container 'community.wave.seqera.io/library/cowpy:1.1.5--3db457ae1977a273' + + input: + tuple val(meta), path(input_file) + + output: + tuple val(meta), path("cowpy-${input_file}") + + script: + """ + cat $input_file | cowpy -c ${meta.character} > cowpy-${input_file} + """ + +} + +workflow { + + ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") + .splitCsv(header: true) + .map { row -> + [ [id:row.id, character:row.character], row.recording ] + } + + ch_prediction = IDENTIFY_LANGUAGE(ch_samplesheet) + + ch_languages = ch_samplesheet.join(ch_prediction) + .map { meta, file, lang -> + [ meta + [lang:lang], file ] + } + .map{ meta, file -> + def lang_group = (meta.lang.equals('de') || meta.lang.equals('en')) ? 'germanic' : 'romanic' + [ meta + [lang_group:lang_group], file ] + } + + romanic_languages = ch_languages.filter { meta, file -> + meta.lang_group == 'romanic' + } + + COWPY(romanic_languages) + +} diff --git a/side-quests/solutions/metadata/nextflow.config b/side-quests/solutions/metadata/nextflow.config new file mode 100644 index 0000000000..e685c23b0d --- /dev/null +++ b/side-quests/solutions/metadata/nextflow.config @@ -0,0 +1 @@ +docker.enabled=true \ No newline at end of file From 36507b09fb6e866be7c3f93487aeddf606c4d1d0 Mon Sep 17 00:00:00 2001 From: FriederikeHanssen Date: Wed, 21 May 2025 15:29:37 +0200 Subject: [PATCH 07/15] linting fixes --- docs/side_quests/metadata.md | 358 +++++++++--------- side-quests/metadata/data/bonjour.txt | 2 +- side-quests/metadata/data/ciao.txt | 2 +- side-quests/metadata/data/guten_tag.txt | 2 +- side-quests/metadata/data/hola.txt | 2 +- side-quests/metadata/data/salut.txt | 2 +- side-quests/metadata/nextflow.config | 2 +- .../solutions/metadata/nextflow.config | 2 +- 8 files changed, 185 insertions(+), 187 deletions(-) diff --git a/docs/side_quests/metadata.md b/docs/side_quests/metadata.md index 744e630399..c32fc35556 100644 --- a/docs/side_quests/metadata.md +++ b/docs/side_quests/metadata.md @@ -4,18 +4,18 @@ Metadata is crucial information that describes and gives context to your data. I Think of it like a library catalog: while books contain the actual content (raw data), the catalog cards provide essential information about each book - when it was published, who wrote it, where to find it (metadata). In Nextflow pipelines, metadata helps us: -* Track sample-specific information throughout the workflow -* Configure processes based on sample characteristics -* Group related samples for joint analysis +- Track sample-specific information throughout the workflow +- Configure processes based on sample characteristics +- Group related samples for joint analysis In this side quest, we'll explore how to handle metadata effectively in Nextflow workflows. Starting with a simple samplesheet containing basic sample information, you'll learn how to: -* Read and parse sample metadata from CSV files -* Create and manipulate metadata maps -* Add new metadata fields during workflow execution -* Use metadata to customize process behavior +- Read and parse sample metadata from CSV files +- Create and manipulate metadata maps +- Add new metadata fields during workflow execution +- Use metadata to customize process behavior -These skills will help you build more robust and flexible pipelines that can handle complex sample relationships and processing requirements. +These skills will help you build more robust and flexible pipelines that can handle complex sample relationships and processing requirements. Let's dive in and see how metadata can make our workflows smarter and more maintainable! @@ -56,11 +56,11 @@ You'll find a `data` directory containing a samplesheet and a main workflow file The samplesheet contains information about different samples and some associated data that we will use in this exercise to tailor our analysis to each sample. In particular, the samplesheet has 3 columns: - - `id`: self-explanatory, an ID given to the sample - - `character`: a character name, that we will use later to draw different creatures - - `data`: paths to `.txt` files that contain phrases in different languages +- `id`: self-explanatory, an ID given to the sample +- `character`: a character name, that we will use later to draw different creatures +- `data`: paths to `.txt` files that contain phrases in different languages - ```console title="samplesheet.csv" +```console title="samplesheet.csv" id,character,recording sampleA,squirrel,/workspaces/training/side-quests/metadata/data/bonjour.txt sampleB,tux,/workspaces/training/side-quests/metadata/data/guten_tag.txt @@ -132,9 +132,9 @@ We can see that each row from the CSV file has been converted into a map with ke Each map contains: - - `id`: an ID given to the sample - - `character`: a character name, that we will use later to draw different creatures - - `data`: paths to `.txt` files that contain phrases in different languages +- `id`: an ID given to the sample +- `character`: a character name, that we will use later to draw different creatures +- `data`: paths to `.txt` files that contain phrases in different languages This format makes it easy to access specific fields from each sample. For example, we could access the sample ID with `id` or the txt file path with `data`. The output above shows each row from the CSV file converted into a map with keys matching the header row. Now that we've successfully read in the samplesheet and have access to the data in each row, we can begin implementing our pipeline logic. @@ -198,27 +198,27 @@ Now we want to process our samples. These samples are language samples, but we d === "After" - ``` groovy title="main.nf" linenums="1" hl_lines="1-19" - /* - * Use langid to predict the language of each input file - */ - process IDENTIFY_LANGUAGE { +```groovy title="main.nf" linenums="1" hl_lines="1-19" +/* + * Use langid to predict the language of each input file + */ +process IDENTIFY_LANGUAGE { - container 'community.wave.seqera.io/library/pip_langid:b2269f456a5629ff' + container 'community.wave.seqera.io/library/pip_langid:b2269f456a5629ff' - input: - tuple val(meta), path(greeting) + input: + tuple val(meta), path(greeting) - output: - tuple val(meta), stdout + output: + tuple val(meta), stdout - script: - """ - langid < ${greeting} -l en,de,fr,es,it | sed -E "s/.*\\('([a-z]+)'.*/\\1/" | tr -d '\\n' - """ - } + script: + """ + langid < ${greeting} -l en,de,fr,es,it | sed -E "s/.*\\('([a-z]+)'.*/\\1/" | tr -d '\\n' + """ +} - ``` +``` === "Before" @@ -239,21 +239,21 @@ Let's include the process, run, and view it: === "After" - ``` groovy title="main.nf" linenums="20" hl_lines="27-29" - workflow { +```groovy title="main.nf" linenums="20" hl_lines="27-29" +workflow { - ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") - .splitCsv(header: true) - .map { row -> - [ [id:row.id, character:row.character], row.recording ] - } + ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") + .splitCsv(header: true) + .map { row -> + [ [id:row.id, character:row.character], row.recording ] + } - ch_prediction = IDENTIFY_LANGUAGE(ch_samplesheet) - ch_prediction.view() + ch_prediction = IDENTIFY_LANGUAGE(ch_samplesheet) + ch_prediction.view() - } + } - ``` +``` === "Before" @@ -308,22 +308,22 @@ If we check the [`join`](https://www.nextflow.io/docs/latest/operator.html#join) === "After" - ``` groovy title="main.nf" linenums="20" hl_lines="27" - workflow { +```groovy title="main.nf" linenums="20" hl_lines="27" +workflow { - ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") - .splitCsv(header: true) - .map { row -> - [ [id:row.id, character:row.character], row.recording ] - } - .view() + ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") + .splitCsv(header: true) + .map { row -> + [ [id:row.id, character:row.character], row.recording ] + } + .view() - ch_prediction = IDENTIFY_LANGUAGE(ch_samplesheet) - ch_prediction.view() + ch_prediction = IDENTIFY_LANGUAGE(ch_samplesheet) + ch_prediction.view() - } + } - ``` +``` === "Before" @@ -373,22 +373,22 @@ We can see that the meta map is the first element in each map and the map is the === "After" - ``` groovy title="main.nf" linenums="20" hl_lines="27" - workflow { +```groovy title="main.nf" linenums="20" hl_lines="27" +workflow { - ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") - .splitCsv(header: true) - .map { row -> - [ [id:row.id, character:row.character], row.recording ] - } + ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") + .splitCsv(header: true) + .map { row -> + [ [id:row.id, character:row.character], row.recording ] + } - ch_prediction = IDENTIFY_LANGUAGE(ch_samplesheet) + ch_prediction = IDENTIFY_LANGUAGE(ch_samplesheet) - ch_languages = ch_samplesheet.join(ch_prediction) - .view() - } + ch_languages = ch_samplesheet.join(ch_prediction) + .view() +} - ``` +``` === "Before" @@ -432,28 +432,27 @@ It is becoming a bit hard to see, but if you look all the way on the right side, Given that this is more data about the files, let's add it to our meta map. We can use the [`map` operator](https://www.nextflow.io/docs/latest/operator.html#map) to create a new key `lang` and set the value to prediction: - === "After" - ``` groovy title="main.nf" linenums="20" hl_lines="31-33" - workflow { +```groovy title="main.nf" linenums="20" hl_lines="31-33" +workflow { - ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") - .splitCsv(header: true) - .map { row -> - [ [id:row.id, character:row.character], row.recording ] - } + ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") + .splitCsv(header: true) + .map { row -> + [ [id:row.id, character:row.character], row.recording ] + } - ch_prediction = IDENTIFY_LANGUAGE(ch_samplesheet) + ch_prediction = IDENTIFY_LANGUAGE(ch_samplesheet) - ch_languages = ch_samplesheet.join(ch_prediction) - .map { meta, file, lang -> - [ meta + [lang:lang], file ] - } - .view() + ch_languages = ch_samplesheet.join(ch_prediction) + .map { meta, file, lang -> + [ meta + [lang:lang], file ] + } + .view() - } - ``` +} +``` === "Before" @@ -517,30 +516,30 @@ if (){ === "After" - ``` groovy title="main.nf" linenums="20" hl_lines="34-37" - workflow { +```groovy title="main.nf" linenums="20" hl_lines="34-37" +workflow { - ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") - .splitCsv(header: true) - .map { row -> - [ [id:row.id, character:row.character], row.recording ] - } + ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") + .splitCsv(header: true) + .map { row -> + [ [id:row.id, character:row.character], row.recording ] + } - ch_prediction = IDENTIFY_LANGUAGE(ch_samplesheet) + ch_prediction = IDENTIFY_LANGUAGE(ch_samplesheet) - ch_languages = ch_samplesheet.join(ch_prediction) - .map { meta, file, lang -> - [ meta + [lang:lang], file ] - } - .map{ meta, file -> - def lang_group = (meta.lang.equals('de') || meta.lang.equals('en')) ? 'germanic' : 'romanic' - [ meta + [lang_group:lang_group], file ] - } - .view() + ch_languages = ch_samplesheet.join(ch_prediction) + .map { meta, file, lang -> + [ meta + [lang:lang], file ] + } + .map{ meta, file -> + def lang_group = (meta.lang.equals('de') || meta.lang.equals('en')) ? 'germanic' : 'romanic' + [ meta + [lang_group:lang_group], file ] + } + .view() - } +} - ``` +``` === "Before" @@ -602,33 +601,33 @@ We can use the [`filter` operator](https://www.nextflow.io/docs/latest/operator. === "After" - ``` groovy title="main.nf" linenums="20" hl_lines="38-42" - workflow { +```groovy title="main.nf" linenums="20" hl_lines="38-42" +workflow { - ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") - .splitCsv(header: true) - .map { row -> - [ [id:row.id, character:row.character], row.recording ] - } + ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") + .splitCsv(header: true) + .map { row -> + [ [id:row.id, character:row.character], row.recording ] + } - ch_prediction = IDENTIFY_LANGUAGE(ch_samplesheet) + ch_prediction = IDENTIFY_LANGUAGE(ch_samplesheet) - ch_languages = ch_samplesheet.join(ch_prediction) - .map { meta, file, lang -> - [ meta + [lang:lang], file ] - } - .map{ meta, file -> - def lang_group = (meta.lang.equals('de') || meta.lang.equals('en')) ? 'germanic' : 'romanic' - [ meta + [lang_group:lang_group], file ] - } + ch_languages = ch_samplesheet.join(ch_prediction) + .map { meta, file, lang -> + [ meta + [lang:lang], file ] + } + .map{ meta, file -> + def lang_group = (meta.lang.equals('de') || meta.lang.equals('en')) ? 'germanic' : 'romanic' + [ meta + [lang_group:lang_group], file ] + } - romanic_languages = ch_language_groups.filter { meta, file -> - meta.lang_group == 'romanic' - } - .view() - } +romanic_languages = ch_language_groups.filter { meta, file -> + meta.lang_group == 'romanic' + } + .view() +} - ``` +``` === "Before" @@ -676,7 +675,7 @@ Launching `main.nf` [drunk_brattain] DSL2 - revision: 453fdd4e91 We have successfully filtered the data to only include romanic samples. Let's recap how this works. The `filter` operator takes a closure that is applied to each element in the channel. If the closure returns `true`, the element is included in the output channel. If the closure returns `false`, the element is excluded from the output channel. -In this case, we want to keep only the samples where `meta.lang_group == 'romanic'`. In the closure, we first know that our channel elements are all of shape `[meta, file]` and we can then access the individual keys of the meta map. We then check if `meta.lang_group` is equal to `'romanic'`. If it is, the sample is included in the output channel. If it is not, the sample is excluded from the output channel. +In this case, we want to keep only the samples where `meta.lang_group == 'romanic'`. In the closure, we first know that our channel elements are all of shape `[meta, file]` and we can then access the individual keys of the meta map. We then check if `meta.lang_group` is equal to `'romanic'`. If it is, the sample is included in the output channel. If it is not, the sample is excluded from the output channel. ```groovy title="main.nf" linenums="4" .filter { meta,file -> meta.lang_group == 'romanic' } @@ -702,27 +701,27 @@ Copy in the process before your workflow block: === "After" - ``` groovy title="main.nf" linenums="20" - /* - * Generate ASCII art with cowpy - */ - process COWPY { +```groovy title="main.nf" linenums="20" +/* + * Generate ASCII art with cowpy +*/ +process COWPY { - container 'community.wave.seqera.io/library/cowpy:1.1.5--3db457ae1977a273' + container 'community.wave.seqera.io/library/cowpy:1.1.5--3db457ae1977a273' - input: - tuple val(meta), path(input_file) + input: + tuple val(meta), path(input_file) - output: - tuple val(meta), path("cowpy-${input_file}") + output: + tuple val(meta), path("cowpy-${input_file}") - script: - """ - cat $input_file | cowpy -c "stegosaurus" > cowpy-${input_file} - """ + script: + """ + cat $input_file | cowpy -c "stegosaurus" > cowpy-${input_file} + """ - } - ``` +} +``` === "Before" @@ -738,34 +737,34 @@ Let's run our romanic languages through `COWPY` and remove our `view` statement: === "After" - ``` groovy title="main.nf" linenums="40" hl_lines="61-63" - workflow { - - ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") - .splitCsv(header: true) - .map { row -> - [ [id:row.id, character:row.character], row.recording ] - } - - ch_prediction = IDENTIFY_LANGUAGE(ch_samplesheet) - - ch_languages = ch_samplesheet.join(ch_prediction) - .map { meta, file, lang -> - [ meta + [lang:lang], file ] - } - .map{ meta, file -> - def lang_group = (meta.lang.equals('de') || meta.lang.equals('en')) ? 'germanic' : 'romanic' - [ meta + [lang_group:lang_group], file ] - } +```groovy title="main.nf" linenums="40" hl_lines="61-63" +workflow { - romanic_languages = ch_languages.filter { meta, file -> - meta.lang_group == 'romanic' - } + ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") + .splitCsv(header: true) + .map { row -> + [ [id:row.id, character:row.character], row.recording ] + } + + ch_prediction = IDENTIFY_LANGUAGE(ch_samplesheet) + + ch_languages = ch_samplesheet.join(ch_prediction) + .map { meta, file, lang -> + [ meta + [lang:lang], file ] + } + .map{ meta, file -> + def lang_group = (meta.lang.equals('de') || meta.lang.equals('en')) ? 'germanic' : 'romanic' + [ meta + [lang_group:lang_group], file ] + } + + romanic_languages = ch_languages.filter { meta, file -> + meta.lang_group == 'romanic' + } - COWPY(romanic_languages) + COWPY(romanic_languages) - } - ``` +} +``` === "Before" @@ -799,18 +798,17 @@ We are still missing a publishing location. Given we have been trying to figure === "After" - ``` groovy title="main.nf" linenums="23" hl_lines="25" - /* - * Generate ASCII art with cowpy - */ - process COWPY { +```groovy title="main.nf" linenums="23" hl_lines="25" +/* + * Generate ASCII art with cowpy +*/ +process COWPY { - publishDir "results/${meta.lang}", mode: 'copy' + publishDir "results/${meta.lang}", mode: 'copy' - container 'community.wave.seqera.io/library/cowpy:1.1.5--3db457ae1977a273' - - ``` + container 'community.wave.seqera.io/library/cowpy:1.1.5--3db457ae1977a273' +``` === "Before" @@ -875,12 +873,12 @@ Let's customize the characters by changing the `cowpy` command: === "After" - ``` groovy title="main.nf" linenums="35" hl_lines="37" - script: - """ - cat $input_file | cowpy -c ${meta.character} > cowpy-${input_file} - """ - ``` +```groovy title="main.nf" linenums="35" hl_lines="37" +script: +""" +cat $input_file | cowpy -c ${meta.character} > cowpy-${input_file} +""" +``` === "Before" diff --git a/side-quests/metadata/data/bonjour.txt b/side-quests/metadata/data/bonjour.txt index e43b286664..fb961df833 100644 --- a/side-quests/metadata/data/bonjour.txt +++ b/side-quests/metadata/data/bonjour.txt @@ -1,2 +1,2 @@ Bonjour -Salut, à demain \ No newline at end of file +Salut, à demain diff --git a/side-quests/metadata/data/ciao.txt b/side-quests/metadata/data/ciao.txt index 180527c14f..375f1a38e3 100644 --- a/side-quests/metadata/data/ciao.txt +++ b/side-quests/metadata/data/ciao.txt @@ -1,2 +1,2 @@ Ciao -Ci vediamo domani \ No newline at end of file +Ci vediamo domani diff --git a/side-quests/metadata/data/guten_tag.txt b/side-quests/metadata/data/guten_tag.txt index 42cb3eb17a..43eb608710 100644 --- a/side-quests/metadata/data/guten_tag.txt +++ b/side-quests/metadata/data/guten_tag.txt @@ -1,2 +1,2 @@ Guten Tag, wie geht es dir? -Auf Wiedersehen, bis morgen \ No newline at end of file +Auf Wiedersehen, bis morgen diff --git a/side-quests/metadata/data/hola.txt b/side-quests/metadata/data/hola.txt index 1b1b35ca2d..b3be4059af 100644 --- a/side-quests/metadata/data/hola.txt +++ b/side-quests/metadata/data/hola.txt @@ -1,2 +1,2 @@ Hola -Adiós, hasta mañana \ No newline at end of file +Adiós, hasta mañana diff --git a/side-quests/metadata/data/salut.txt b/side-quests/metadata/data/salut.txt index 2b5a68b1d5..92d8678a4d 100644 --- a/side-quests/metadata/data/salut.txt +++ b/side-quests/metadata/data/salut.txt @@ -1,2 +1,2 @@ Salut, ça va? -à plus dans le bus \ No newline at end of file +à plus dans le bus diff --git a/side-quests/metadata/nextflow.config b/side-quests/metadata/nextflow.config index e685c23b0d..684e04e89e 100644 --- a/side-quests/metadata/nextflow.config +++ b/side-quests/metadata/nextflow.config @@ -1 +1 @@ -docker.enabled=true \ No newline at end of file +docker.enabled=true diff --git a/side-quests/solutions/metadata/nextflow.config b/side-quests/solutions/metadata/nextflow.config index e685c23b0d..684e04e89e 100644 --- a/side-quests/solutions/metadata/nextflow.config +++ b/side-quests/solutions/metadata/nextflow.config @@ -1 +1 @@ -docker.enabled=true \ No newline at end of file +docker.enabled=true From 23d5e1facee86ab98050479532831228de0109b4 Mon Sep 17 00:00:00 2001 From: FriederikeHanssen Date: Wed, 21 May 2025 15:49:39 +0200 Subject: [PATCH 08/15] linting fixes --- docs/side_quests/index.md | 1 - 1 file changed, 1 deletion(-) diff --git a/docs/side_quests/index.md b/docs/side_quests/index.md index 85ae2f3aa4..c208e10f0c 100644 --- a/docs/side_quests/index.md +++ b/docs/side_quests/index.md @@ -34,5 +34,4 @@ Otherwise, select a side quest from the menu below. - [Testing with nf-test](./nf-test.md) - [Workflows of workflows](./workflows_of_workflows.md) - Let us know what other domains and use cases you'd like to see covered here by posting in the [Training section](https://community.seqera.io/c/training/) of the community forum. From 3299b51870bc36c69ad26f52d2487193c612939a Mon Sep 17 00:00:00 2001 From: FriederikeHanssen Date: Wed, 21 May 2025 16:22:52 +0200 Subject: [PATCH 09/15] add sumary and key concepts --- docs/side_quests/metadata.md | 353 ++++++++++++++++++++++------------- 1 file changed, 221 insertions(+), 132 deletions(-) diff --git a/docs/side_quests/metadata.md b/docs/side_quests/metadata.md index c32fc35556..113fb0b254 100644 --- a/docs/side_quests/metadata.md +++ b/docs/side_quests/metadata.md @@ -1,6 +1,6 @@ # Metadata -Metadata is crucial information that describes and gives context to your data. In workflows, it helps track important details about samples, experimental conditions, and processing parameters. Metadata is sample-specific and helps tailor analyses to each dataset's unique characteristics. +Metadata is crucial information that describes and gives context to your data. In workflows, it helps track important details about samples, experimental conditions that can influence processing parameters. Metadata is sample-specific and helps tailor analyses to each dataset's unique characteristics. Think of it like a library catalog: while books contain the actual content (raw data), the catalog cards provide essential information about each book - when it was published, who wrote it, where to find it (metadata). In Nextflow pipelines, metadata helps us: @@ -77,7 +77,7 @@ sampleG,turtle,/workspaces/training/side-quests/metadata/data/ciao.txt Let's start by reading in the samplesheet with `splitCsv`. In the main workflow file, you'll see that we've already started the workflow. -```groovy title="main.nf" linenums="1"s +```groovy title="main.nf" linenums="1" workflow { ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") @@ -198,27 +198,27 @@ Now we want to process our samples. These samples are language samples, but we d === "After" -```groovy title="main.nf" linenums="1" hl_lines="1-19" -/* - * Use langid to predict the language of each input file - */ -process IDENTIFY_LANGUAGE { + ```groovy title="main.nf" linenums="1" + /* + * Use langid to predict the language of each input file + */ + process IDENTIFY_LANGUAGE { - container 'community.wave.seqera.io/library/pip_langid:b2269f456a5629ff' + container 'community.wave.seqera.io/library/pip_langid:b2269f456a5629ff' - input: - tuple val(meta), path(greeting) + input: + tuple val(meta), path(greeting) - output: - tuple val(meta), stdout + output: + tuple val(meta), stdout - script: - """ - langid < ${greeting} -l en,de,fr,es,it | sed -E "s/.*\\('([a-z]+)'.*/\\1/" | tr -d '\\n' - """ -} + script: + """ + langid < ${greeting} -l en,de,fr,es,it | sed -E "s/.*\\('([a-z]+)'.*/\\1/" | tr -d '\\n' + """ + } -``` + ``` === "Before" @@ -239,21 +239,21 @@ Let's include the process, run, and view it: === "After" -```groovy title="main.nf" linenums="20" hl_lines="27-29" -workflow { + ```groovy title="main.nf" linenums="20" hl_lines="27-29" + workflow { - ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") - .splitCsv(header: true) - .map { row -> - [ [id:row.id, character:row.character], row.recording ] - } + ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") + .splitCsv(header: true) + .map { row -> + [ [id:row.id, character:row.character], row.recording ] + } - ch_prediction = IDENTIFY_LANGUAGE(ch_samplesheet) - ch_prediction.view() + ch_prediction = IDENTIFY_LANGUAGE(ch_samplesheet) + ch_prediction.view() - } + } -``` + ``` === "Before" @@ -434,25 +434,25 @@ Given that this is more data about the files, let's add it to our meta map. We c === "After" -```groovy title="main.nf" linenums="20" hl_lines="31-33" -workflow { + ```groovy title="main.nf" linenums="20" hl_lines="31-33" + workflow { - ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") - .splitCsv(header: true) - .map { row -> - [ [id:row.id, character:row.character], row.recording ] - } + ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") + .splitCsv(header: true) + .map { row -> + [ [id:row.id, character:row.character], row.recording ] + } - ch_prediction = IDENTIFY_LANGUAGE(ch_samplesheet) + ch_prediction = IDENTIFY_LANGUAGE(ch_samplesheet) - ch_languages = ch_samplesheet.join(ch_prediction) - .map { meta, file, lang -> - [ meta + [lang:lang], file ] - } - .view() + ch_languages = ch_samplesheet.join(ch_prediction) + .map { meta, file, lang -> + [ meta + [lang:lang], file ] + } + .view() -} -``` + } + ``` === "Before" @@ -516,30 +516,30 @@ if (){ === "After" -```groovy title="main.nf" linenums="20" hl_lines="34-37" -workflow { + ```groovy title="main.nf" linenums="20" hl_lines="34-37" + workflow { - ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") - .splitCsv(header: true) - .map { row -> - [ [id:row.id, character:row.character], row.recording ] - } + ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") + .splitCsv(header: true) + .map { row -> + [ [id:row.id, character:row.character], row.recording ] + } - ch_prediction = IDENTIFY_LANGUAGE(ch_samplesheet) + ch_prediction = IDENTIFY_LANGUAGE(ch_samplesheet) - ch_languages = ch_samplesheet.join(ch_prediction) - .map { meta, file, lang -> - [ meta + [lang:lang], file ] - } - .map{ meta, file -> - def lang_group = (meta.lang.equals('de') || meta.lang.equals('en')) ? 'germanic' : 'romanic' - [ meta + [lang_group:lang_group], file ] - } - .view() + ch_languages = ch_samplesheet.join(ch_prediction) + .map { meta, file, lang -> + [ meta + [lang:lang], file ] + } + .map{ meta, file -> + def lang_group = (meta.lang.equals('de') || meta.lang.equals('en')) ? 'germanic' : 'romanic' + [ meta + [lang_group:lang_group], file ] + } + .view() -} + } -``` + ``` === "Before" @@ -601,33 +601,33 @@ We can use the [`filter` operator](https://www.nextflow.io/docs/latest/operator. === "After" -```groovy title="main.nf" linenums="20" hl_lines="38-42" -workflow { + ```groovy title="main.nf" linenums="20" hl_lines="38-42" + workflow { - ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") - .splitCsv(header: true) - .map { row -> - [ [id:row.id, character:row.character], row.recording ] - } + ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") + .splitCsv(header: true) + .map { row -> + [ [id:row.id, character:row.character], row.recording ] + } - ch_prediction = IDENTIFY_LANGUAGE(ch_samplesheet) + ch_prediction = IDENTIFY_LANGUAGE(ch_samplesheet) - ch_languages = ch_samplesheet.join(ch_prediction) - .map { meta, file, lang -> - [ meta + [lang:lang], file ] - } - .map{ meta, file -> - def lang_group = (meta.lang.equals('de') || meta.lang.equals('en')) ? 'germanic' : 'romanic' - [ meta + [lang_group:lang_group], file ] - } - -romanic_languages = ch_language_groups.filter { meta, file -> - meta.lang_group == 'romanic' - } - .view() -} + ch_languages = ch_samplesheet.join(ch_prediction) + .map { meta, file, lang -> + [ meta + [lang:lang], file ] + } + .map{ meta, file -> + def lang_group = (meta.lang.equals('de') || meta.lang.equals('en')) ? 'germanic' : 'romanic' + [ meta + [lang_group:lang_group], file ] + } -``` + romanic_languages = ch_language_groups.filter { meta, file -> + meta.lang_group == 'romanic' + } + .view() + } + + ``` === "Before" @@ -701,27 +701,27 @@ Copy in the process before your workflow block: === "After" -```groovy title="main.nf" linenums="20" -/* - * Generate ASCII art with cowpy -*/ -process COWPY { + ```groovy title="main.nf" linenums="20" + /* + * Generate ASCII art with cowpy + */ + process COWPY { - container 'community.wave.seqera.io/library/cowpy:1.1.5--3db457ae1977a273' + container 'community.wave.seqera.io/library/cowpy:1.1.5--3db457ae1977a273' - input: - tuple val(meta), path(input_file) + input: + tuple val(meta), path(input_file) - output: - tuple val(meta), path("cowpy-${input_file}") + output: + tuple val(meta), path("cowpy-${input_file}") - script: - """ - cat $input_file | cowpy -c "stegosaurus" > cowpy-${input_file} - """ + script: + """ + cat $input_file | cowpy -c "stegosaurus" > cowpy-${input_file} + """ -} -``` + } + ``` === "Before" @@ -737,34 +737,34 @@ Let's run our romanic languages through `COWPY` and remove our `view` statement: === "After" -```groovy title="main.nf" linenums="40" hl_lines="61-63" -workflow { + ```groovy title="main.nf" linenums="40" hl_lines="61-63" + workflow { - ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") - .splitCsv(header: true) - .map { row -> - [ [id:row.id, character:row.character], row.recording ] - } + ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") + .splitCsv(header: true) + .map { row -> + [ [id:row.id, character:row.character], row.recording ] + } - ch_prediction = IDENTIFY_LANGUAGE(ch_samplesheet) + ch_prediction = IDENTIFY_LANGUAGE(ch_samplesheet) - ch_languages = ch_samplesheet.join(ch_prediction) - .map { meta, file, lang -> - [ meta + [lang:lang], file ] - } - .map{ meta, file -> - def lang_group = (meta.lang.equals('de') || meta.lang.equals('en')) ? 'germanic' : 'romanic' - [ meta + [lang_group:lang_group], file ] - } - - romanic_languages = ch_languages.filter { meta, file -> - meta.lang_group == 'romanic' + ch_languages = ch_samplesheet.join(ch_prediction) + .map { meta, file, lang -> + [ meta + [lang:lang], file ] + } + .map{ meta, file -> + def lang_group = (meta.lang.equals('de') || meta.lang.equals('en')) ? 'germanic' : 'romanic' + [ meta + [lang_group:lang_group], file ] } - COWPY(romanic_languages) + romanic_languages = ch_languages.filter { meta, file -> + meta.lang_group == 'romanic' + } -} -``` + COWPY(romanic_languages) + + } + ``` === "Before" @@ -798,17 +798,17 @@ We are still missing a publishing location. Given we have been trying to figure === "After" -```groovy title="main.nf" linenums="23" hl_lines="25" -/* - * Generate ASCII art with cowpy -*/ -process COWPY { + ```groovy title="main.nf" linenums="23" hl_lines="25" + /* + * Generate ASCII art with cowpy + */ + process COWPY { - publishDir "results/${meta.lang}", mode: 'copy' + publishDir "results/${meta.lang}", mode: 'copy' - container 'community.wave.seqera.io/library/cowpy:1.1.5--3db457ae1977a273' + container 'community.wave.seqera.io/library/cowpy:1.1.5--3db457ae1977a273' -``` + ``` === "Before" @@ -923,3 +923,92 @@ In this section, you've learned: --- ## Summary + +In this side quest, you've explored how to effectively work with metadata in Nextflow workflows. Here's what you've learned: + +1. **Reading and Structuring Metadata**: + + - Using splitCsv to read samplesheets + - Creating structured meta maps from CSV data + - Keeping metadata associated with files through tuples + +2. **Expanding Metadata During Workflow**: + + - Adding process outputs (language detection) to meta maps + - Creating derived metadata (language groups) using conditional logic + - Using join to merge new metadata with existing records + +3. **Filtering Based on Metadata**: + + - Creating subsets of data based on metadata properties + +4. **Customizing Process Behavior**: + + - Using metadata to configure output directories + - Adjusting process parameters based on sample properties + - Creating sample-specific outputs + +This approach offers several advantages over hardcoding sample information: + +- Sample metadata stays associated with files throughout the workflow +- Process behavior can be customized per sample +- Output organization can reflect sample properties +- Sample information can be expanded during pipeline execution +- Filtering and grouping become more intuitive + +### Key Concepts + +- **Reading Samplesheets & creating meta maps** + + ```nextflow + Channel.fromPath('samplesheet.csv') + .splitCsv(header: true) + .map { row -> + [ [id:row.id, character:row.character], row.recording ] + } + ``` + +- **Adding new keys to the meta map** + +1. based on process output: + + ```nextflow + .map { meta, file, lang -> + [ meta + [lang:lang], file ] + } + ``` + +2. and using a conditional clause + + ```nextflow + .map{ meta, file -> + def lang_group = (meta.lang.equals('de') || meta.lang.equals('en')) ? 'germanic' : 'romanic' + [ meta + [lang_group:lang_group], file ] + } + ``` + +- **Filtering on meta values** + + ```nextflow + .filter { meta, file -> + meta.lang_group == 'romanic' + } + ``` + +- **Using Meta in Process Directives** + + ```nextflow + publishDir "results/${meta.lang}", mode: 'copy' + ``` + +- **Adapting tool parameters for individual samples** + + ```nextflow + cat $input_file | cowpy -c ${meta.character} > cowpy-${input_file} + ``` + +## Resources + +- [filter](https://www.nextflow.io/docs/latest/operator.html#filter) +- [map](https://www.nextflow.io/docs/latest/operator.html#map) +- [join](https://www.nextflow.io/docs/latest/operator.html#join) From 9f15d58838c967b747c6bedde5ad2ee2ded3f87f Mon Sep 17 00:00:00 2001 From: FriederikeHanssen Date: Wed, 21 May 2025 16:25:14 +0200 Subject: [PATCH 10/15] linting. again. --- docs/side_quests/metadata.md | 42 ++++++++++++++++++------------------ 1 file changed, 21 insertions(+), 21 deletions(-) diff --git a/docs/side_quests/metadata.md b/docs/side_quests/metadata.md index 113fb0b254..7639fb31ff 100644 --- a/docs/side_quests/metadata.md +++ b/docs/side_quests/metadata.md @@ -928,25 +928,25 @@ In this side quest, you've explored how to effectively work with metadata in Nex 1. **Reading and Structuring Metadata**: - - Using splitCsv to read samplesheets - - Creating structured meta maps from CSV data - - Keeping metadata associated with files through tuples +- Using splitCsv to read samplesheets +- Creating structured meta maps from CSV data +- Keeping metadata associated with files through tuples 2. **Expanding Metadata During Workflow**: - - Adding process outputs (language detection) to meta maps - - Creating derived metadata (language groups) using conditional logic - - Using join to merge new metadata with existing records +- Adding process outputs (language detection) to meta maps +- Creating derived metadata (language groups) using conditional logic +- Using join to merge new metadata with existing records 3. **Filtering Based on Metadata**: - - Creating subsets of data based on metadata properties +- Creating subsets of data based on metadata properties 4. **Customizing Process Behavior**: - - Using metadata to configure output directories - - Adjusting process parameters based on sample properties - - Creating sample-specific outputs +- Using metadata to configure output directories +- Adjusting process parameters based on sample properties +- Creating sample-specific outputs This approach offers several advantages over hardcoding sample information: @@ -972,20 +972,20 @@ This approach offers several advantages over hardcoding sample information: 1. based on process output: - ```nextflow - .map { meta, file, lang -> - [ meta + [lang:lang], file ] - } - ``` +```nextflow +.map { meta, file, lang -> + [ meta + [lang:lang], file ] +} +``` 2. and using a conditional clause - ```nextflow - .map{ meta, file -> - def lang_group = (meta.lang.equals('de') || meta.lang.equals('en')) ? 'germanic' : 'romanic' - [ meta + [lang_group:lang_group], file ] - } - ``` +```nextflow +.map{ meta, file -> + def lang_group = (meta.lang.equals('de') || meta.lang.equals('en')) ? 'germanic' : 'romanic' + [ meta + [lang_group:lang_group], file ] +} +``` - **Filtering on meta values** From e24e2e528367c4275a59d8e53db66cd675f2b4a6 Mon Sep 17 00:00:00 2001 From: FriederikeHanssen Date: Wed, 21 May 2025 16:31:26 +0200 Subject: [PATCH 11/15] fix line highlights --- docs/side_quests/metadata.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/side_quests/metadata.md b/docs/side_quests/metadata.md index 7639fb31ff..541cc3a170 100644 --- a/docs/side_quests/metadata.md +++ b/docs/side_quests/metadata.md @@ -144,7 +144,7 @@ In the samplesheet, we have both the input files and data about the input files === "After" - ```groovy title="main.nf" linenums="3" hl_lines="5-8" + ```groovy title="main.nf" linenums="3" hl_lines="3-6" ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") .splitCsv(header: true) .map { row -> @@ -239,7 +239,7 @@ Let's include the process, run, and view it: === "After" - ```groovy title="main.nf" linenums="20" hl_lines="27-29" + ```groovy title="main.nf" linenums="20" hl_lines="9-10" workflow { ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") From 0d53fd6c19f675755d272299190499e2ca059e5f Mon Sep 17 00:00:00 2001 From: FriederikeHanssen Date: Wed, 21 May 2025 16:42:30 +0200 Subject: [PATCH 12/15] fix line highlights, formatting --- docs/side_quests/metadata.md | 79 +++++++++++++++++------------------- 1 file changed, 38 insertions(+), 41 deletions(-) diff --git a/docs/side_quests/metadata.md b/docs/side_quests/metadata.md index 541cc3a170..137300c825 100644 --- a/docs/side_quests/metadata.md +++ b/docs/side_quests/metadata.md @@ -308,22 +308,22 @@ If we check the [`join`](https://www.nextflow.io/docs/latest/operator.html#join) === "After" -```groovy title="main.nf" linenums="20" hl_lines="27" -workflow { + ```groovy title="main.nf" linenums="20" hl_lines="8,11" + workflow { - ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") - .splitCsv(header: true) - .map { row -> - [ [id:row.id, character:row.character], row.recording ] - } - .view() + ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") + .splitCsv(header: true) + .map { row -> + [ [id:row.id, character:row.character], row.recording ] + } + .view() - ch_prediction = IDENTIFY_LANGUAGE(ch_samplesheet) - ch_prediction.view() + ch_prediction = IDENTIFY_LANGUAGE(ch_samplesheet) + ch_prediction.view() - } + } -``` + ``` === "Before" @@ -373,22 +373,22 @@ We can see that the meta map is the first element in each map and the map is the === "After" -```groovy title="main.nf" linenums="20" hl_lines="27" -workflow { + ```groovy title="main.nf" linenums="20" hl_lines="11-12" + workflow { - ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") - .splitCsv(header: true) - .map { row -> - [ [id:row.id, character:row.character], row.recording ] - } + ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") + .splitCsv(header: true) + .map { row -> + [ [id:row.id, character:row.character], row.recording ] + } - ch_prediction = IDENTIFY_LANGUAGE(ch_samplesheet) + ch_prediction = IDENTIFY_LANGUAGE(ch_samplesheet) - ch_languages = ch_samplesheet.join(ch_prediction) - .view() -} + ch_languages = ch_samplesheet.join(ch_prediction) + .view() + } -``` + ``` === "Before" @@ -434,7 +434,7 @@ Given that this is more data about the files, let's add it to our meta map. We c === "After" - ```groovy title="main.nf" linenums="20" hl_lines="31-33" + ```groovy title="main.nf" linenums="20" hl_lines="12-14" workflow { ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") @@ -500,13 +500,13 @@ Alright, now that we have our language predictions, let's use the information to We can use the `map` operator and an [ternary operator](https://groovy-lang.org/operators.html#_ternary_operator) to assign either group. The ternary operator, is a short cut to an if/else clause. It says: -``` +```console title="Ternary" variable = ? 'Value' : 'Default' ``` and is the same as: -``` +```console title="If/else" if (){ variable = 'Value' } else { @@ -516,7 +516,7 @@ if (){ === "After" - ```groovy title="main.nf" linenums="20" hl_lines="34-37" + ```groovy title="main.nf" linenums="20" hl_lines="15-18" workflow { ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") @@ -601,7 +601,7 @@ We can use the [`filter` operator](https://www.nextflow.io/docs/latest/operator. === "After" - ```groovy title="main.nf" linenums="20" hl_lines="38-42" + ```groovy title="main.nf" linenums="20" hl_lines="20-23" workflow { ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") @@ -737,7 +737,7 @@ Let's run our romanic languages through `COWPY` and remove our `view` statement: === "After" - ```groovy title="main.nf" linenums="40" hl_lines="61-63" + ```groovy title="main.nf" linenums="40" hl_lines="24" workflow { ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") @@ -798,10 +798,7 @@ We are still missing a publishing location. Given we have been trying to figure === "After" - ```groovy title="main.nf" linenums="23" hl_lines="25" - /* - * Generate ASCII art with cowpy - */ + ```groovy title="main.nf" linenums="24" hl_lines=3"" process COWPY { publishDir "results/${meta.lang}", mode: 'copy' @@ -873,16 +870,16 @@ Let's customize the characters by changing the `cowpy` command: === "After" -```groovy title="main.nf" linenums="35" hl_lines="37" -script: -""" -cat $input_file | cowpy -c ${meta.character} > cowpy-${input_file} -""" -``` + ```groovy title="main.nf" linenums="38" hl_lines="3" + script: + """ + cat $input_file | cowpy -c ${meta.character} > cowpy-${input_file} + """ + ``` === "Before" - ```groovy title="main.nf" linenums="23" + ```groovy title="main.nf" linenums="38" script: """ cat $input_file | cowpy -c "stegosaurus" > cowpy-${input_file} From 909e36cc2a34c10cfec6d1e55c7796a109ac850a Mon Sep 17 00:00:00 2001 From: FriederikeHanssen Date: Wed, 21 May 2025 16:43:51 +0200 Subject: [PATCH 13/15] fix indents --- docs/side_quests/metadata.md | 24 ++++++++++++------------ 1 file changed, 12 insertions(+), 12 deletions(-) diff --git a/docs/side_quests/metadata.md b/docs/side_quests/metadata.md index 137300c825..c9fa3beacb 100644 --- a/docs/side_quests/metadata.md +++ b/docs/side_quests/metadata.md @@ -925,25 +925,25 @@ In this side quest, you've explored how to effectively work with metadata in Nex 1. **Reading and Structuring Metadata**: -- Using splitCsv to read samplesheets -- Creating structured meta maps from CSV data -- Keeping metadata associated with files through tuples + - Using splitCsv to read samplesheets + - Creating structured meta maps from CSV data + - Keeping metadata associated with files through tuples 2. **Expanding Metadata During Workflow**: -- Adding process outputs (language detection) to meta maps -- Creating derived metadata (language groups) using conditional logic -- Using join to merge new metadata with existing records + - Adding process outputs (language detection) to meta maps + - Creating derived metadata (language groups) using conditional logic + - Using join to merge new metadata with existing records 3. **Filtering Based on Metadata**: -- Creating subsets of data based on metadata properties + - Creating subsets of data based on metadata properties 4. **Customizing Process Behavior**: -- Using metadata to configure output directories -- Adjusting process parameters based on sample properties -- Creating sample-specific outputs + - Using metadata to configure output directories + - Adjusting process parameters based on sample properties + - Creating sample-specific outputs This approach offers several advantages over hardcoding sample information: @@ -967,7 +967,7 @@ This approach offers several advantages over hardcoding sample information: - **Adding new keys to the meta map** -1. based on process output: + 1. based on process output: ```nextflow .map { meta, file, lang -> @@ -975,7 +975,7 @@ This approach offers several advantages over hardcoding sample information: } ``` -2. and using a conditional clause +2. and using a conditional clause ```nextflow .map{ meta, file -> From b50a976452c94263271334a7c8b8d55ab88bd5f2 Mon Sep 17 00:00:00 2001 From: FriederikeHanssen Date: Wed, 21 May 2025 18:27:10 +0200 Subject: [PATCH 14/15] flesh out some of the explanations --- docs/side_quests/metadata.md | 110 +++++++++++++++---------- side-quests/solutions/metadata/main.nf | 8 +- 2 files changed, 69 insertions(+), 49 deletions(-) diff --git a/docs/side_quests/metadata.md b/docs/side_quests/metadata.md index c9fa3beacb..d374ecfdc4 100644 --- a/docs/side_quests/metadata.md +++ b/docs/side_quests/metadata.md @@ -128,7 +128,19 @@ Launching `main.nf` [exotic_albattani] DSL2 - revision: c0d03cec83 [id:sampleG, character:turtle, recording:/workspaces/training/side-quests/metadata/data/ciao.txt] ``` -We can see that each row from the CSV file has been converted into a map with keys matching the header row. A map is a key-value data structure similar to dictionaries in Python, objects in JavaScript, or hashes in Ruby. +We can see that each row from the CSV file has been converted into a map with keys matching the header row. A map is a key-value data structure similar to dictionaries in Python, objects in JavaScript, or hashes in Ruby. For example: + +```groovy +// Groovy map +def meta = [id:'sample1', character:'squirrel'] +println meta.id // Prints: sample1 +``` + +```python +# Python equivalent dictionary +meta = {'id': 'sample1', 'character': 'squirrel'} +print(meta['id']) # Prints: sample1 +``` Each map contains: @@ -140,7 +152,16 @@ This format makes it easy to access specific fields from each sample. For exampl ### 1.2 Separate meta data and data -In the samplesheet, we have both the input files and data about the input files (`id`, `character`), the meta data. As we progress through the workflow, we generate more meta data about each sample. To avoid having to keep track of how many fields we have at any point in time and making the input of our processes more robust, we can combine the meta information into its own key-value paired map: +In the samplesheet, we have both the input files and data about the input files (`id`, `character`), the meta data. As we progress through the workflow, we generate more meta data about each sample. To avoid having to keep track of how many fields we have at any point in time and making the input of our processes more robust, we can combine the meta information into its own key-value paired map. + +Think of a meta map like a label attached to your data that contains important information about that sample - similar to how a library catalog card provides essential details about a book without being the book itself. This separation makes it easier to: + +- Track sample information throughout the workflow +- Add new metadata as you process samples +- Keep process inputs/outputs clean and organized +- Query and filter samples based on their properties + +Now let's use this and separate our metadata from the file path. We'll use the `map` operator to restructure our channel elements into a tuple consisting of the meta map and file: === "After" @@ -161,6 +182,8 @@ In the samplesheet, we have both the input files and data about the input files .view() ``` +Let's run it: + ```bash title="View meta map" nextflow run main.nf ``` @@ -179,14 +202,17 @@ Launching `main.nf` [lethal_booth] DSL2 - revision: 0d8f844c07 [[id:sampleG, character:turtle], /workspaces/training/side-quests/metadata/data/ciao.txt] ``` -We have successfully separate our meta data into its own map to keep it next to the file data. Each of our channel elements now has the shape `[ metamap, file ]`. +We have successfully separate our meta data into its own map to keep it next to the file data. Each of our channel elements now has the shape `[ meta, file ]`. ### Takeaway In this section, you've learned: -- **Reading in a samplesheet**: How to read in a samplesheet with `splitCsv` -- **Creating a meta map**: Moving columns with meta information into a separate data structure and keep it next to the input data +- **Reading in a samplesheet**: Using `splitCsv` to read CSV files with header information and transform rows into structured data +- **Creating a meta map**: + - Separating metadata from file data using tuple structure `[ [id:value, ...], file ]` + - Keeping sample information organized and accessible throughout the workflow + - Declaring meta map as input/output declarations in processes --- @@ -207,7 +233,7 @@ Now we want to process our samples. These samples are language samples, but we d container 'community.wave.seqera.io/library/pip_langid:b2269f456a5629ff' input: - tuple val(meta), path(greeting) + tuple val(meta), path(file) output: tuple val(meta), stdout @@ -235,7 +261,7 @@ Now we want to process our samples. These samples are language samples, but we d The tool [langid](https://github.com/saffsd/langid.py) is a language identification tool. It is pre-trained on a set of languages. For a given phrase, it prints a language guess and a probability score for each guess to the console. In the `script` section, we are removing the probability score, clean up the string by removing a newline character and return the language guess. Since it is printed directly to the console, we are using Nextflow's [`stdout` output qualifier](https://www.nextflow.io/docs/latest/process.html#outputs), passing the string on as output. -Let's include the process, run, and view it: +Let's include the process, then run, and view it: === "After" @@ -252,7 +278,6 @@ Let's include the process, run, and view it: ch_prediction.view() } - ``` === "Before" @@ -308,7 +333,7 @@ If we check the [`join`](https://www.nextflow.io/docs/latest/operator.html#join) === "After" - ```groovy title="main.nf" linenums="20" hl_lines="8,11" + ```groovy title="main.nf" linenums="20" hl_lines="8 11" workflow { ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") @@ -322,7 +347,6 @@ If we check the [`join`](https://www.nextflow.io/docs/latest/operator.html#join) ch_prediction.view() } - ``` === "Before" @@ -387,7 +411,6 @@ We can see that the meta map is the first element in each map and the map is the ch_languages = ch_samplesheet.join(ch_prediction) .view() } - ``` === "Before" @@ -501,14 +524,14 @@ Alright, now that we have our language predictions, let's use the information to We can use the `map` operator and an [ternary operator](https://groovy-lang.org/operators.html#_ternary_operator) to assign either group. The ternary operator, is a short cut to an if/else clause. It says: ```console title="Ternary" -variable = ? 'Value' : 'Default' +variable = ? 'if-the-condition-is-true' : 'Default' ``` and is the same as: ```console title="If/else" if (){ - variable = 'Value' + variable = 'if-the-condition-is-true' } else { variable = 'Default' } @@ -584,20 +607,31 @@ Launching `main.nf` [wise_almeida] DSL2 - revision: 46778c3cd0 [[id:sampleG, character:turtle, lang:it, lang_group:romanic], /workspaces/training/side-quests/metadata/data/ciao.txt] ``` +Let's understand how this transformation works. The `map` operator takes a closure that processes each element in the channel. Inside the closure, we're using a ternary operator to create a new language group classification. + +The ternary expression `(meta.lang.equals('de') || meta.lang.equals('en')) ? 'germanic' : 'romanic'` works like this: + +- First, it evaluates the condition before the `?`: checks if the language is either German ('de') or English ('en') +- If the condition is true (language is German or English), it returns 'germanic' +- If the condition is false (any other language), it returns 'romanic' + +We store this result in the `lang_group` variable and then add it to our meta map using `meta + [lang_group:lang_group]`. The resulting channel elements maintain their `[meta, file]` structure, but the meta map now includes this new classification. This allows us to group samples by their language family later in the workflow. + ### Takeaway In this section, you've learned: -- **Merging on meta maps**: You used `join` to combine two channels based on their meta maps -- **Creating custom keys**: You created two new keys in your meta map. One based on a computed value from a process, and one based on a condition you set in the `map` operator. +- **Merging on meta maps**: You used `join` to combine two channels based on their meta maps to maintain relationships across processes and channels +- **Creating custom keys**: You created two new keys in your meta map, adding them with `meta + [new_key:value]`. One based on a computed value from a process, and one based on a condition you set in the `map` operator. +- **Ternary operator**: You used the ternary operator to determine which language belongs to which group. -Both of these allow you to associated new and existing meta data with files as you progress through your pipeline. +These allow you to associated new and existing meta data with files as you progress through your pipeline. --- ## 3. Filter data based on meta map values -We can use the [`filter` operator](https://www.nextflow.io/docs/latest/operator.html#filter) to filter the data based on a condition. Let's say we only want to process romanic language samples firther. We can do this by filtering the data based on the `lang_group` field. Let's create a new channel that only contains romanic languages and `view` it: +We can use the [`filter` operator](https://www.nextflow.io/docs/latest/operator.html#filter) to filter the data based on a condition. Let's say we only want to process romanic language samples further. We can do this by filtering the data based on the `lang_group` field. Let's create a new channel that only contains romanic languages and `view` it: === "After" @@ -655,7 +689,7 @@ We can use the [`filter` operator](https://www.nextflow.io/docs/latest/operator. } ``` -Let's rerun it +Let's rerun it: ```bash title="View romanic samples" nextflow run main.nf -resume @@ -685,17 +719,15 @@ In this case, we want to keep only the samples where `meta.lang_group == 'romani In this section, you've learned: -- How to filter data with `filter` +- How to use `filter` to select samples based on metadata -we now have only the romanic language samples left and can process those further. Next we want to make characters say the phrases. +We now have only the romanic language samples left and can process those further. Next we want to make characters say the phrases. --- ## 4. Customize a process with meta map -Let's let the characters say the phrases that we have passed in. In the [hello-nextflow training](../hello_nextflow/05_hello_containers.md), you already encountered the `cowpy` package, a python implementation of a tool called `cowsay` that generates ASCII art to display arbitrary text inputs in a fun way. - -We will re-use a process from there. +Let's let the characters say the phrases that we have passed in. In the [hello-nextflow training](../hello_nextflow/05_hello_containers.md), you already encountered the `cowpy` package, a python implementation of a tool called `cowsay` that generates ASCII art to display arbitrary text inputs in a fun way. We will re-use a process from there. Copy in the process before your workflow block: @@ -798,7 +830,7 @@ We are still missing a publishing location. Given we have been trying to figure === "After" - ```groovy title="main.nf" linenums="24" hl_lines=3"" + ```groovy title="main.nf" linenums="24" hl_lines="3" process COWPY { publishDir "results/${meta.lang}", mode: 'copy' @@ -819,7 +851,7 @@ We are still missing a publishing location. Given we have been trying to figure Let's run this: ```bash title="Use cowpy" -nextflow run main.nf -resume +nextflow run main.nf ``` You should now see a new folder called `results`: @@ -914,8 +946,9 @@ This is a subtle difference to other parameters that we have set in the pipeline In this section, you've learned: -- **Tweaking directives using meta values** -- **Tweaking script section based on meta values** +- **Tweaking directives using meta values**: Using meta map values in `publishDir` directives to create dynamic output paths based on sample properties + +- **Tweaking script section based on meta values**: Customizing tool parameters per sample using meta information in the `script` section --- @@ -923,27 +956,15 @@ In this section, you've learned: In this side quest, you've explored how to effectively work with metadata in Nextflow workflows. Here's what you've learned: -1. **Reading and Structuring Metadata**: - - - Using splitCsv to read samplesheets - - Creating structured meta maps from CSV data - - Keeping metadata associated with files through tuples - -2. **Expanding Metadata During Workflow**: - - - Adding process outputs (language detection) to meta maps - - Creating derived metadata (language groups) using conditional logic - - Using join to merge new metadata with existing records +1. **Reading and Structuring Metadata**: Reading CSV files and creating organized metadata maps that stay associated with your data files -3. **Filtering Based on Metadata**: +2. **Expanding Metadata During Workflow**: Adding new information to your metadata as your pipeline progresses by adding process outputs and deriving values through conditional logic - - Creating subsets of data based on metadata properties +3. **Joining based on Metadata**: Using metadata to join process outputs and existing channels -4. **Customizing Process Behavior**: +4. **Filtering Based on Metadata**: Using metadata values to create specific subsets of your data - - Using metadata to configure output directories - - Adjusting process parameters based on sample properties - - Creating sample-specific outputs +5. **Customizing Process Behavior**: Using metadata to adapt how processes handle different samples This approach offers several advantages over hardcoding sample information: @@ -951,7 +972,6 @@ This approach offers several advantages over hardcoding sample information: - Process behavior can be customized per sample - Output organization can reflect sample properties - Sample information can be expanded during pipeline execution -- Filtering and grouping become more intuitive ### Key Concepts diff --git a/side-quests/solutions/metadata/main.nf b/side-quests/solutions/metadata/main.nf index 124e874ead..49cf3258f1 100644 --- a/side-quests/solutions/metadata/main.nf +++ b/side-quests/solutions/metadata/main.nf @@ -6,7 +6,7 @@ process IDENTIFY_LANGUAGE { container 'community.wave.seqera.io/library/pip_langid:b2269f456a5629ff' input: - tuple val(meta), path(greeting) + tuple val(meta), path(file) output: tuple val(meta), stdout @@ -28,14 +28,14 @@ process COWPY { container 'community.wave.seqera.io/library/cowpy:1.1.5--3db457ae1977a273' input: - tuple val(meta), path(input_file) + tuple val(meta), path(file) output: - tuple val(meta), path("cowpy-${input_file}") + tuple val(meta), path("cowpy-${file}") script: """ - cat $input_file | cowpy -c ${meta.character} > cowpy-${input_file} + cat $file | cowpy -c ${meta.character} > cowpy-${file} """ } From 54c9d4c347d3c1db2171a0b55dd3bccbe3e7ccb0 Mon Sep 17 00:00:00 2001 From: FriederikeHanssen Date: Wed, 21 May 2025 19:03:00 +0200 Subject: [PATCH 15/15] fix text --- docs/side_quests/metadata.md | 42 +++++++++++++++++++++--------------- 1 file changed, 25 insertions(+), 17 deletions(-) diff --git a/docs/side_quests/metadata.md b/docs/side_quests/metadata.md index d374ecfdc4..d4d9e6be0c 100644 --- a/docs/side_quests/metadata.md +++ b/docs/side_quests/metadata.md @@ -1,6 +1,6 @@ # Metadata -Metadata is crucial information that describes and gives context to your data. In workflows, it helps track important details about samples, experimental conditions that can influence processing parameters. Metadata is sample-specific and helps tailor analyses to each dataset's unique characteristics. +Metadata is information that describes and gives context to your data. In workflows, it helps track important details about samples, experimental conditions that can influence processing parameters. Metadata is sample-specific and helps tailor analyses to each dataset's unique characteristics. Think of it like a library catalog: while books contain the actual content (raw data), the catalog cards provide essential information about each book - when it was published, who wrote it, where to find it (metadata). In Nextflow pipelines, metadata helps us: @@ -8,7 +8,7 @@ Think of it like a library catalog: while books contain the actual content (raw - Configure processes based on sample characteristics - Group related samples for joint analysis -In this side quest, we'll explore how to handle metadata effectively in Nextflow workflows. Starting with a simple samplesheet containing basic sample information, you'll learn how to: +We'll explore how to handle metadata in workflows. Starting with a simple samplesheet containing basic sample information, you'll learn how to: - Read and parse sample metadata from CSV files - Create and manipulate metadata maps @@ -17,7 +17,7 @@ In this side quest, we'll explore how to handle metadata effectively in Nextflow These skills will help you build more robust and flexible pipelines that can handle complex sample relationships and processing requirements. -Let's dive in and see how metadata can make our workflows smarter and more maintainable! +Let's dive in! ## 0. Warmup @@ -54,7 +54,7 @@ You'll find a `data` directory containing a samplesheet and a main workflow file └── nextflow.config ``` -The samplesheet contains information about different samples and some associated data that we will use in this exercise to tailor our analysis to each sample. In particular, the samplesheet has 3 columns: +The samplesheet contains information about different samples and some associated data that we will use in this exercise to tailor our analysis to each sample. Each data files contains greetings in different languages, but we don't know what language they are in. In particular, the samplesheet has 3 columns: - `id`: self-explanatory, an ID given to the sample - `character`: a character name, that we will use later to draw different creatures @@ -142,11 +142,11 @@ meta = {'id': 'sample1', 'character': 'squirrel'} print(meta['id']) # Prints: sample1 ``` -Each map contains: +Each map entry corresponds to a column: -- `id`: an ID given to the sample -- `character`: a character name, that we will use later to draw different creatures -- `data`: paths to `.txt` files that contain phrases in different languages +- `id` +- `character` +- `data` This format makes it easy to access specific fields from each sample. For example, we could access the sample ID with `id` or the txt file path with `data`. The output above shows each row from the CSV file converted into a map with keys matching the header row. Now that we've successfully read in the samplesheet and have access to the data in each row, we can begin implementing our pipeline logic. @@ -154,7 +154,7 @@ This format makes it easy to access specific fields from each sample. For exampl In the samplesheet, we have both the input files and data about the input files (`id`, `character`), the meta data. As we progress through the workflow, we generate more meta data about each sample. To avoid having to keep track of how many fields we have at any point in time and making the input of our processes more robust, we can combine the meta information into its own key-value paired map. -Think of a meta map like a label attached to your data that contains important information about that sample - similar to how a library catalog card provides essential details about a book without being the book itself. This separation makes it easier to: +This separation makes it easier to: - Track sample information throughout the workflow - Add new metadata as you process samples @@ -209,10 +209,7 @@ We have successfully separate our meta data into its own map to keep it next to In this section, you've learned: - **Reading in a samplesheet**: Using `splitCsv` to read CSV files with header information and transform rows into structured data -- **Creating a meta map**: - - Separating metadata from file data using tuple structure `[ [id:value, ...], file ]` - - Keeping sample information organized and accessible throughout the workflow - - Declaring meta map as input/output declarations in processes +- **Creating a meta map**: Separating metadata from file data using tuple structure `[ [id:value, ...], file ]` --- @@ -321,7 +318,7 @@ output: tuple val(meta), stdout ``` -This is a useful tool to ensure the sample-specific meta information stays connected with any new information that is generated. +This is a useful way to ensure the sample-specific meta information stays connected with any new information that is generated. ### 2.2 Associate the language prediction with the input file @@ -453,7 +450,7 @@ It is becoming a bit hard to see, but if you look all the way on the right side, ### 2.3 Add the language prediction to the meta map -Given that this is more data about the files, let's add it to our meta map. We can use the [`map` operator](https://www.nextflow.io/docs/latest/operator.html#map) to create a new key `lang` and set the value to prediction: +Given that this is more data about the files, let's add it to our meta map. We can use the [`map` operator](https://www.nextflow.io/docs/latest/operator.html#map) to create a new key `lang` and set the value to the predicted language: === "After" @@ -515,7 +512,15 @@ Launching `main.nf` [cheeky_fermat] DSL2 - revision: d096281ee4 [[id:sampleG, character:turtle, lang:it], /workspaces/training/side-quests/metadata/data/ciao.txt] ``` -Nice, we expanded our meta map with new information we gathered in the pipeline. +Nice, we expanded our meta map with new information we gathered in the pipeline. Let's take a look at what happened here: + +After joining our channels, each element looks like this: + +```console +[meta, file, lang] // e.g. [[id:sampleA, character:squirrel], bonjour.txt, fr] +``` + +The `map` operator takes each channel element and processes it to create a modified version. Inside the closure `{ meta, file, lang -> ... }`, we then take the existing `meta` map, create a new map `[lang:lang]`, and merge both together using `+`, Groovy's way of combining maps. ### 2.4 Assign a language group using a ternary operator @@ -894,6 +899,8 @@ Let's take a look at `cowpy-salut.txt`: Look through the other files. All phrases should be spoken by the fashionable stegosaurus. +How did this work? The `publishDir` directive is evaluated at runtime when the process executes.Each process task gets its own meta map from the input tuple When the directive is evaluated, `${meta.lang}` is replaced with the actual language value for that sample creating the dynamic paths like `results/fr`. + ### 4.2 Customize the character In our samplesheet, we have another column: `character`. To tailor the tool parameters per sample, we can also access information from the `meta` map in the script section. This is really useful in cases were a tool should have different parameters for each sample. @@ -1012,7 +1019,7 @@ This approach offers several advantages over hardcoding sample information: } ``` -- **Using Meta in Process Directives** +- **Using meta values in Process Directives** ```nextflow publishDir "results/${meta.lang}", mode: 'copy' @@ -1029,3 +1036,4 @@ This approach offers several advantages over hardcoding sample information: - [filter](https://www.nextflow.io/docs/latest/operator.html#filter) - [map](https://www.nextflow.io/docs/latest/operator.html#map) - [join](https://www.nextflow.io/docs/latest/operator.html#join) +- [stdout](https://www.nextflow.io/docs/latest/process.html#outputs)