Metadata Training #611

FriederikeHanssen · 2025-05-16T11:27:51Z

Closes #520

This PR adds a training for meta maps. There is a solid overlap with working-with-files training for things like parsing samplesheets, getting values from it, and filtering. As we come up with new courses we need to decide which of them comes first and can be used as a basis.
Using maps a lot
a bit on the fence of using ternary operator here. Maybe it is too much, on the other hand it is absolutely everywhere so we have to include it at some point
Sorting the side quests alphabetically and adding in a missing one to the overview page

netlify · 2025-05-16T11:27:57Z

✅ Deploy Preview for nextflow-training ready!

Name	Link
🔨 Latest commit	`54c9d4c`
🔍 Latest deploy log	https://app.netlify.com/projects/nextflow-training/deploys/682e0757693fe40008bc6010
😎 Deploy Preview	https://deploy-preview-611--nextflow-training.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

adamrtalbot

There's some heavy overlap with splitting and joining, I might be tempted to reduce it and stick with using meta values.

Simplify the ternary stuff, I think it takes attention away from metamaps.

adamrtalbot · 2025-05-21T17:35:01Z

docs/side_quests/metadata.md

@@ -0,0 +1,1039 @@
+# Metadata
+
+Metadata is information that describes and gives context to your data. In workflows, it helps track important details about samples, experimental conditions that can influence processing parameters. Metadata is sample-specific and helps tailor analyses to each dataset's unique characteristics.


Suggested change

Metadata is information that describes and gives context to your data. In workflows, it helps track important details about samples, experimental conditions that can influence processing parameters. Metadata is sample-specific and helps tailor analyses to each dataset's unique characteristics.

Metadata is information that describes and gives context to your data. In workflows, it helps track important details about samples and experimental conditions that can influence processing parameters. Metadata is sample-specific and helps tailor analyses to each dataset's unique characteristics.

Maybe not "processing parameters", but I'm not sure what else to put...

adamrtalbot · 2025-05-21T17:35:22Z

docs/side_quests/metadata.md

+
+Metadata is information that describes and gives context to your data. In workflows, it helps track important details about samples, experimental conditions that can influence processing parameters. Metadata is sample-specific and helps tailor analyses to each dataset's unique characteristics.
+
+Think of it like a library catalog: while books contain the actual content (raw data), the catalog cards provide essential information about each book - when it was published, who wrote it, where to find it (metadata). In Nextflow pipelines, metadata helps us:


Suggested change

Think of it like a library catalog: while books contain the actual content (raw data), the catalog cards provide essential information about each book - when it was published, who wrote it, where to find it (metadata). In Nextflow pipelines, metadata helps us:

Think of it like a library catalog: while books contain the actual content (raw data), the catalog cards provide essential information about each book - when it was published, who wrote it, where to find it (metadata). In Nextflow pipelines, metadata can be used to:

adamrtalbot · 2025-05-21T17:39:41Z

docs/side_quests/metadata.md

+### 1.2 Separate meta data and data
+
+In the samplesheet, we have both the input files and data about the input files (`id`, `character`), the meta data. As we progress through the workflow, we generate more meta data about each sample. To avoid having to keep track of how many fields we have at any point in time and making the input of our processes more robust, we can combine the meta information into its own key-value paired map.
+
+This separation makes it easier to:
+
+- Track sample information throughout the workflow
+- Add new metadata as you process samples
+- Keep process inputs/outputs clean and organized
+- Query and filter samples based on their properties
+
+Now let's use this and separate our metadata from the file path. We'll use the `map` operator to restructure our channel elements into a tuple consisting of the meta map and file:


I think the motivation for this bit isn't very clear. Why do I want to separate the file from the values?

adamrtalbot · 2025-05-21T17:40:55Z

docs/side_quests/metadata.md

+[[id:sampleG, character:turtle], /workspaces/training/side-quests/metadata/data/ciao.txt]
+```
+
+We have successfully separate our meta data into its own map to keep it next to the file data. Each of our channel elements now has the shape `[ meta,  file ]`.


Suggested change

We have successfully separate our meta data into its own map to keep it next to the file data. Each of our channel elements now has the shape `[ meta, file ]`.

We have successfully separated our values into their own map to separate it from the file. Each of our channel elements now has the shape `[ meta, file ]`.

Then finish with a sentence saying why this is good.

adamrtalbot · 2025-05-21T17:41:40Z

docs/side_quests/metadata.md

+
+In this section, you've learned:
+
+- **Reading in a samplesheet**: Using `splitCsv` to read CSV files with header information and transform rows into structured data


Suggested change

- **Reading in a samplesheet**: Using `splitCsv` to read CSV files with header information and transform rows into structured data

- **Why metadata is important?** Something here about preserving sample info

- **Reading in a samplesheet**: Using `splitCsv` to read CSV files with header information and transform rows into structured data

adamrtalbot · 2025-05-21T17:45:09Z

docs/side_quests/metadata.md

+### 2.2 Associate the language prediction with the input file
+
+At the moment, our sample files and their language prediction are separated in two different channels: `ch_samplesheet` and `ch_predictions`. But both channels have the same meta information associated with the interesting data points. We can use the meta map to combine our channels back together.
+
+Nextflow includes many methods for combining channels, but in this case the most appropriate operator is [`join`](https://www.nextflow.io/docs/latest/operator.html#join). If you are familiar with SQL, it acts like the `JOIN` operation, where we specify the key to join on and the type of join to perform.
+
+If we check the [`join`](https://www.nextflow.io/docs/latest/operator.html#join) documentation, we can see that it joins two channels based on a defined item, by default the first item in each tuple. If you don't have the console output still available, let's run the pipeline to check our data structures. If you removed the `view()` operator from the `ch_samplesheet` add it back in:


Do we want to do this or just pass the file through the process? It might add too much too early.

I agree, this should be a two-step thing:

just pass the input file to the output (they've seen that pattern in earlier sections if they've done Genomics)

actually, that's a bad plan, because X. What we usually do is keep them separate to keep channel structure simple, yada yada yada

adamrtalbot · 2025-05-21T17:46:26Z

docs/side_quests/metadata.md

+
+### 2.4 Assign a language group using a ternary operator
+
+Alright, now that we have our language predictions, let's use the information to assign them into new groups. In our example data, we have provided data sets that belong either to `germanic` (either English or German) or `romanic` (French, Spanish, Italian) languages.


In English we call them "romance" languages, but we should definitely check.

Definitely romance.

adamrtalbot · 2025-05-21T17:54:19Z

docs/side_quests/metadata.md

+                                  .map{ meta, file ->
+                                      def lang_group = (meta.lang.equals('de') || meta.lang.equals('en')) ? 'germanic' : 'romanic'
+                                      [ meta + [lang_group:lang_group], file ]
+                                  }
+                                  .view()


This is a bit messy...can we do something cleaner?

We could either (A) write a function or (B) use case or (C) use good old if statements.

adamrtalbot · 2025-05-21T17:58:25Z

docs/side_quests/metadata.md

+The ternary expression `(meta.lang.equals('de') || meta.lang.equals('en')) ? 'germanic' : 'romanic'` works like this:
+
+- First, it evaluates the condition before the `?`: checks if the language is either German ('de') or English ('en')
+- If the condition is true (language is German or English), it returns 'germanic'
+- If the condition is false (any other language), it returns 'romanic'


Simplifying or removing ternaries will keep this more focused on metadata.

adamrtalbot · 2025-05-21T17:59:06Z

docs/side_quests/metadata.md

+
+Look through the other files. All phrases should be spoken by the fashionable stegosaurus.
+
+How did this work? The `publishDir` directive is evaluated at runtime when the process executes.Each process task gets its own meta map from the input tuple When the directive is evaluated, `${meta.lang}` is replaced with the actual language value for that sample creating the dynamic paths like `results/fr`.


Suggested change

How did this work? The `publishDir` directive is evaluated at runtime when the process executes.Each process task gets its own meta map from the input tuple When the directive is evaluated, `${meta.lang}` is replaced with the actual language value for that sample creating the dynamic paths like `results/fr`.

How did this work? The `publishDir` directive is evaluated at runtime when the process executes. Each process task gets its own meta map from the input tuple When the directive is evaluated, `${meta.lang}` is replaced with the actual language value for that sample creating the dynamic paths like `results/fr`.

pinin4fjords

Nice work great simple subject to teach with! In general:

Having the 'after' before the 'before' in examples makes me go a bit cross-eyed
You don't need to include such large sections of text in the 'before', just enough to anchor the reader in the text. Copy whole blocks repeatedly is hard to parse, and for us to maintain!
It definitely is 'romance' rather than 'romanic'.

pinin4fjords · 2025-05-22T07:33:38Z

docs/side_quests/metadata.md

+└── nextflow.config
+```
+
+The samplesheet contains information about different samples and some associated data that we will use in this exercise to tailor our analysis to each sample. Each data files contains greetings in different languages, but we don't know what language they are in. In particular, the samplesheet has 3 columns:


Suggested change

The samplesheet contains information about different samples and some associated data that we will use in this exercise to tailor our analysis to each sample. Each data files contains greetings in different languages, but we don't know what language they are in. In particular, the samplesheet has 3 columns:

The samplesheet contains information about different samples and some associated data that we will use in this exercise to tailor our analysis to each sample. Each data file contains greetings in different languages, but we don't know what language they are in. In particular, the samplesheet has 3 columns:

pinin4fjords · 2025-05-22T07:35:30Z

docs/side_quests/metadata.md

+
+    Throughout this tutorial, we'll use the `ch_` prefix for all channel variables to clearly indicate they are Nextflow channels.
+
+=== "After"


Did you mean to show the after before the 'before'?

pinin4fjords · 2025-05-22T07:36:58Z

docs/side_quests/metadata.md

+    ch_samplesheet = Channel.fromPath("./data/samplesheet.csv")
+    ```
+
+We can use the [`splitCsv` operator](https://www.nextflow.io/docs/latest/operator.html#splitcsv) to split the samplesheet into a channel of maps, where each map represents a row from the CSV file.


Maybe we should actually explain the concept of a map/ dict/ hash, even if just in a note. I don't thing we really stress it a lot anywhere else right now, and not all of the audience will be coders familiar with the concept.

pinin4fjords · 2025-05-22T07:43:43Z

docs/side_quests/metadata.md

+[id:sampleG, character:turtle, recording:/workspaces/training/side-quests/metadata/data/ciao.txt]
+```
+
+We can see that each row from the CSV file has been converted into a map with keys matching the header row. A map is a key-value data structure similar to dictionaries in Python, objects in JavaScript, or hashes in Ruby. For example:


Ahh, you do the map explanation here, it just needs moving up to when you first use the concept.

pinin4fjords · 2025-05-22T07:50:01Z

docs/side_quests/metadata.md

+
+### 1.2 Separate meta data and data
+
+In the samplesheet, we have both the input files and data about the input files (`id`, `character`), the meta data. As we progress through the workflow, we generate more meta data about each sample. To avoid having to keep track of how many fields we have at any point in time and making the input of our processes more robust, we can combine the meta information into its own key-value paired map.


Suggested change

In the samplesheet, we have both the input files and data about the input files (`id`, `character`), the meta data. As we progress through the workflow, we generate more meta data about each sample. To avoid having to keep track of how many fields we have at any point in time and making the input of our processes more robust, we can combine the meta information into its own key-value paired map.

In the samplesheet, we have both data (input files) and associated metadata (`id`, `character`). We'll be referencing and adding to the metadata, but it will be the data files themselves that form the core of our processing. As such, it's important to separate data and metadata early on.

Just my take on a revised motivation, disregard at will.

pinin4fjords · 2025-05-22T08:16:29Z

docs/side_quests/metadata.md

+Let's understand how this transformation works. The `map` operator takes a closure that processes each element in the channel. Inside the closure, we're using a ternary operator to create a new language group classification.
+
+The ternary expression `(meta.lang.equals('de') || meta.lang.equals('en')) ? 'germanic' : 'romanic'` works like this:
+
+- First, it evaluates the condition before the `?`: checks if the language is either German ('de') or English ('en')
+- If the condition is true (language is German or English), it returns 'germanic'
+- If the condition is false (any other language), it returns 'romanic'


I say just scrap this bit. You talk about ternaries above and point to docs, that's sufficient.

pinin4fjords · 2025-05-22T08:17:30Z

docs/side_quests/metadata.md

+In this section, you've learned:
+
+- **Merging on meta maps**: You used `join` to combine two channels based on their meta maps to maintain relationships across processes and channels
+- **Creating custom keys**: You created two new keys in your meta map, adding them with `meta + [new_key:value]`. One based on a computed value from a process, and one based on a condition you set in the `map` operator.
+- **Ternary operator**: You used the ternary operator to determine which language belongs to which group.


Suggested change

In this section, you've learned:

- **Merging on meta maps**: You used `join` to combine two channels based on their meta maps to maintain relationships across processes and channels

- **Creating custom keys**: You created two new keys in your meta map, adding them with `meta + [new_key:value]`. One based on a computed value from a process, and one based on a condition you set in the `map` operator.

- **Ternary operator**: You used the ternary operator to determine which language belongs to which group.

In this section, you've learned how to:

- **Merge on meta maps**: You used `join` to combine two channels based on their meta maps to maintain relationships across processes and channels

- **Create custom keys**: You created two new keys in your meta map, adding them with `meta + [new_key:value]`. One based on a computed value from a process, and one based on a condition you set in the `map` operator.

- **Use a ternary operator**: You used the ternary operator to determine which language belongs to which group.

pinin4fjords · 2025-05-22T08:18:25Z

docs/side_quests/metadata.md

+    ```groovy title="main.nf" linenums="20" hl_lines="20-23"
+    workflow  {
+
+      ch_samplesheet = Channel.fromPath("./data/samplesheet.csv")


You don't need to repeat such big sections, it just detracts from the message

pinin4fjords · 2025-05-22T08:20:28Z

docs/side_quests/metadata.md

+
+## 4. Customize a process with meta map
+
+Let's let the characters say the phrases that we have passed in. In the [hello-nextflow training](../hello_nextflow/05_hello_containers.md), you already encountered the `cowpy` package, a python implementation of a tool called `cowsay` that generates ASCII art to display arbitrary text inputs in a fun way. We will re-use a process from there.


Suggested change

Let's let the characters say the phrases that we have passed in. In the [hello-nextflow training](../hello_nextflow/05_hello_containers.md), you already encountered the `cowpy` package, a python implementation of a tool called `cowsay` that generates ASCII art to display arbitrary text inputs in a fun way. We will re-use a process from there.

Let's let some fun characters say the phrases that we have passed in. In the [hello-nextflow training](../hello_nextflow/05_hello_containers.md), you already encountered the `cowpy` package, a python implementation of a tool called `cowsay` that generates ASCII art to display arbitrary text inputs in a fun way. We will re-use a process from there.

'the' implies we know what characters you're referring to, which we don't yet.

pinin4fjords · 2025-05-22T08:22:48Z

docs/side_quests/metadata.md

+In this section, you've learned:
+
+- **Tweaking directives using meta values**: Using meta map values in `publishDir` directives to create dynamic output paths based on sample properties
+
+- **Tweaking script section based on meta values**: Customizing tool parameters per sample using meta information in the `script` section


Suggested change

In this section, you've learned:

- **Tweaking directives using meta values**: Using meta map values in `publishDir` directives to create dynamic output paths based on sample properties

- **Tweaking script section based on meta values**: Customizing tool parameters per sample using meta information in the `script` section

In this section, you've learned how to:

- **Tweak directives using meta values**: Using meta map values in `publishDir` directives to create dynamic output paths based on sample properties

- **Tweak the script section based on meta values**: Customizing tool parameters per sample using meta information in the `script` section

rough plan for metadata training

51db901

FriederikeHanssen added 14 commits May 20, 2025 16:26

solution for training

8fcc998

update solution

bdcb319

first half of the text

14ae39d

add intro and second half of training

3296c1c

add to side bar

d9204a0

linting fixes

36507b0

linting fixes

23d5e1f

add sumary and key concepts

3299b51

linting. again.

9f15d58

fix line highlights

e24e2e5

fix line highlights, formatting

0d53fd6

fix indents

909e36c

flesh out some of the explanations

b50a976

fix text

54c9d4c

FriederikeHanssen marked this pull request as ready for review May 21, 2025 17:07

FriederikeHanssen requested review from adamrtalbot and vdauwera and removed request for adamrtalbot May 21, 2025 17:07

adamrtalbot reviewed May 21, 2025

View reviewed changes

pinin4fjords reviewed May 22, 2025

View reviewed changes

		@@ -0,0 +1,1039 @@
		# Metadata

		Metadata is information that describes and gives context to your data. In workflows, it helps track important details about samples, experimental conditions that can influence processing parameters. Metadata is sample-specific and helps tailor analyses to each dataset's unique characteristics.


		Metadata is information that describes and gives context to your data. In workflows, it helps track important details about samples, experimental conditions that can influence processing parameters. Metadata is sample-specific and helps tailor analyses to each dataset's unique characteristics.

		Think of it like a library catalog: while books contain the actual content (raw data), the catalog cards provide essential information about each book - when it was published, who wrote it, where to find it (metadata). In Nextflow pipelines, metadata helps us:

	We have successfully separate our meta data into its own map to keep it next to the file data. Each of our channel elements now has the shape `[ meta, file ]`.
	We have successfully separated our values into their own map to separate it from the file. Each of our channel elements now has the shape `[ meta, file ]`.


		In this section, you've learned:

		- Reading in a samplesheet: Using `splitCsv` to read CSV files with header information and transform rows into structured data

	- Reading in a samplesheet: Using `splitCsv` to read CSV files with header information and transform rows into structured data
	- Why metadata is important? Something here about preserving sample info
	- Reading in a samplesheet: Using `splitCsv` to read CSV files with header information and transform rows into structured data


		### 2.4 Assign a language group using a ternary operator

		Alright, now that we have our language predictions, let's use the information to assign them into new groups. In our example data, we have provided data sets that belong either to `germanic` (either English or German) or `romanic` (French, Spanish, Italian) languages.


		Look through the other files. All phrases should be spoken by the fashionable stegosaurus.

		How did this work? The `publishDir` directive is evaluated at runtime when the process executes.Each process task gets its own meta map from the input tuple When the directive is evaluated, `${meta.lang}` is replaced with the actual language value for that sample creating the dynamic paths like `results/fr`.

	The samplesheet contains information about different samples and some associated data that we will use in this exercise to tailor our analysis to each sample. Each data files contains greetings in different languages, but we don't know what language they are in. In particular, the samplesheet has 3 columns:
	The samplesheet contains information about different samples and some associated data that we will use in this exercise to tailor our analysis to each sample. Each data file contains greetings in different languages, but we don't know what language they are in. In particular, the samplesheet has 3 columns:


		Throughout this tutorial, we'll use the `ch_` prefix for all channel variables to clearly indicate they are Nextflow channels.

		=== "After"


		### 1.2 Separate meta data and data

		In the samplesheet, we have both the input files and data about the input files (`id`, `character`), the meta data. As we progress through the workflow, we generate more meta data about each sample. To avoid having to keep track of how many fields we have at any point in time and making the input of our processes more robust, we can combine the meta information into its own key-value paired map.

	In the samplesheet, we have both the input files and data about the input files (`id`, `character`), the meta data. As we progress through the workflow, we generate more meta data about each sample. To avoid having to keep track of how many fields we have at any point in time and making the input of our processes more robust, we can combine the meta information into its own key-value paired map.
	In the samplesheet, we have both data (input files) and associated metadata (`id`, `character`). We'll be referencing and adding to the metadata, but it will be the data files themselves that form the core of our processing. As such, it's important to separate data and metadata early on.


		## 4. Customize a process with meta map

		Let's let the characters say the phrases that we have passed in. In the [hello-nextflow training](../hello_nextflow/05_hello_containers.md), you already encountered the `cowpy` package, a python implementation of a tool called `cowsay` that generates ASCII art to display arbitrary text inputs in a fun way. We will re-use a process from there.

	Let's let the characters say the phrases that we have passed in. In the [hello-nextflow training](../hello_nextflow/05_hello_containers.md), you already encountered the `cowpy` package, a python implementation of a tool called `cowsay` that generates ASCII art to display arbitrary text inputs in a fun way. We will re-use a process from there.
	Let's let some fun characters say the phrases that we have passed in. In the [hello-nextflow training](../hello_nextflow/05_hello_containers.md), you already encountered the `cowpy` package, a python implementation of a tool called `cowsay` that generates ASCII art to display arbitrary text inputs in a fun way. We will re-use a process from there.

Metadata Training #611

Are you sure you want to change the base?

Metadata Training #611

Uh oh!

Conversation

FriederikeHanssen commented May 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

netlify bot commented May 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for nextflow-training ready!

Uh oh!

adamrtalbot left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pinin4fjords left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

FriederikeHanssen commented May 16, 2025 •

edited

Loading

netlify bot commented May 16, 2025 •

edited

Loading