Skip to content

Metadata Training #611

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 15 commits into
base: master
Choose a base branch
from

Conversation

FriederikeHanssen
Copy link
Collaborator

@FriederikeHanssen FriederikeHanssen commented May 16, 2025

Closes #520

  • This PR adds a training for meta maps. There is a solid overlap with working-with-files training for things like parsing samplesheets, getting values from it, and filtering. As we come up with new courses we need to decide which of them comes first and can be used as a basis.
  • Using maps a lot
  • a bit on the fence of using ternary operator here. Maybe it is too much, on the other hand it is absolutely everywhere so we have to include it at some point
  • Sorting the side quests alphabetically and adding in a missing one to the overview page

Copy link

netlify bot commented May 16, 2025

Deploy Preview for nextflow-training ready!

Name Link
🔨 Latest commit 54c9d4c
🔍 Latest deploy log https://app.netlify.com/projects/nextflow-training/deploys/682e0757693fe40008bc6010
😎 Deploy Preview https://deploy-preview-611--nextflow-training.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@FriederikeHanssen FriederikeHanssen marked this pull request as ready for review May 21, 2025 17:07
@FriederikeHanssen FriederikeHanssen requested review from adamrtalbot and vdauwera and removed request for adamrtalbot May 21, 2025 17:07
Copy link
Collaborator

@adamrtalbot adamrtalbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's some heavy overlap with splitting and joining, I might be tempted to reduce it and stick with using meta values.

Simplify the ternary stuff, I think it takes attention away from metamaps.

@@ -0,0 +1,1039 @@
# Metadata

Metadata is information that describes and gives context to your data. In workflows, it helps track important details about samples, experimental conditions that can influence processing parameters. Metadata is sample-specific and helps tailor analyses to each dataset's unique characteristics.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Metadata is information that describes and gives context to your data. In workflows, it helps track important details about samples, experimental conditions that can influence processing parameters. Metadata is sample-specific and helps tailor analyses to each dataset's unique characteristics.
Metadata is information that describes and gives context to your data. In workflows, it helps track important details about samples and experimental conditions that can influence processing parameters. Metadata is sample-specific and helps tailor analyses to each dataset's unique characteristics.

Maybe not "processing parameters", but I'm not sure what else to put...


Metadata is information that describes and gives context to your data. In workflows, it helps track important details about samples, experimental conditions that can influence processing parameters. Metadata is sample-specific and helps tailor analyses to each dataset's unique characteristics.

Think of it like a library catalog: while books contain the actual content (raw data), the catalog cards provide essential information about each book - when it was published, who wrote it, where to find it (metadata). In Nextflow pipelines, metadata helps us:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Think of it like a library catalog: while books contain the actual content (raw data), the catalog cards provide essential information about each book - when it was published, who wrote it, where to find it (metadata). In Nextflow pipelines, metadata helps us:
Think of it like a library catalog: while books contain the actual content (raw data), the catalog cards provide essential information about each book - when it was published, who wrote it, where to find it (metadata). In Nextflow pipelines, metadata can be used to:

Comment on lines +153 to +164
### 1.2 Separate meta data and data

In the samplesheet, we have both the input files and data about the input files (`id`, `character`), the meta data. As we progress through the workflow, we generate more meta data about each sample. To avoid having to keep track of how many fields we have at any point in time and making the input of our processes more robust, we can combine the meta information into its own key-value paired map.

This separation makes it easier to:

- Track sample information throughout the workflow
- Add new metadata as you process samples
- Keep process inputs/outputs clean and organized
- Query and filter samples based on their properties

Now let's use this and separate our metadata from the file path. We'll use the `map` operator to restructure our channel elements into a tuple consisting of the meta map and file:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the motivation for this bit isn't very clear. Why do I want to separate the file from the values?

[[id:sampleG, character:turtle], /workspaces/training/side-quests/metadata/data/ciao.txt]
```

We have successfully separate our meta data into its own map to keep it next to the file data. Each of our channel elements now has the shape `[ meta, file ]`.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
We have successfully separate our meta data into its own map to keep it next to the file data. Each of our channel elements now has the shape `[ meta, file ]`.
We have successfully separated our values into their own map to separate it from the file. Each of our channel elements now has the shape `[ meta, file ]`.

Then finish with a sentence saying why this is good.


In this section, you've learned:

- **Reading in a samplesheet**: Using `splitCsv` to read CSV files with header information and transform rows into structured data
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- **Reading in a samplesheet**: Using `splitCsv` to read CSV files with header information and transform rows into structured data
- **Why metadata is important?** Something here about preserving sample info
- **Reading in a samplesheet**: Using `splitCsv` to read CSV files with header information and transform rows into structured data

Comment on lines +323 to +329
### 2.2 Associate the language prediction with the input file

At the moment, our sample files and their language prediction are separated in two different channels: `ch_samplesheet` and `ch_predictions`. But both channels have the same meta information associated with the interesting data points. We can use the meta map to combine our channels back together.

Nextflow includes many methods for combining channels, but in this case the most appropriate operator is [`join`](https://www.nextflow.io/docs/latest/operator.html#join). If you are familiar with SQL, it acts like the `JOIN` operation, where we specify the key to join on and the type of join to perform.

If we check the [`join`](https://www.nextflow.io/docs/latest/operator.html#join) documentation, we can see that it joins two channels based on a defined item, by default the first item in each tuple. If you don't have the console output still available, let's run the pipeline to check our data structures. If you removed the `view()` operator from the `ch_samplesheet` add it back in:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to do this or just pass the file through the process? It might add too much too early.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, this should be a two-step thing:

  1. just pass the input file to the output (they've seen that pattern in earlier sections if they've done Genomics)
  2. actually, that's a bad plan, because X. What we usually do is keep them separate to keep channel structure simple, yada yada yada


### 2.4 Assign a language group using a ternary operator

Alright, now that we have our language predictions, let's use the information to assign them into new groups. In our example data, we have provided data sets that belong either to `germanic` (either English or German) or `romanic` (French, Spanish, Italian) languages.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In English we call them "romance" languages, but we should definitely check.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely romance.

Comment on lines +562 to +566
.map{ meta, file ->
def lang_group = (meta.lang.equals('de') || meta.lang.equals('en')) ? 'germanic' : 'romanic'
[ meta + [lang_group:lang_group], file ]
}
.view()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit messy...can we do something cleaner?

We could either (A) write a function or (B) use case or (C) use good old if statements.

Comment on lines +617 to +621
The ternary expression `(meta.lang.equals('de') || meta.lang.equals('en')) ? 'germanic' : 'romanic'` works like this:

- First, it evaluates the condition before the `?`: checks if the language is either German ('de') or English ('en')
- If the condition is true (language is German or English), it returns 'germanic'
- If the condition is false (any other language), it returns 'romanic'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simplifying or removing ternaries will keep this more focused on metadata.


Look through the other files. All phrases should be spoken by the fashionable stegosaurus.

How did this work? The `publishDir` directive is evaluated at runtime when the process executes.Each process task gets its own meta map from the input tuple When the directive is evaluated, `${meta.lang}` is replaced with the actual language value for that sample creating the dynamic paths like `results/fr`.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
How did this work? The `publishDir` directive is evaluated at runtime when the process executes.Each process task gets its own meta map from the input tuple When the directive is evaluated, `${meta.lang}` is replaced with the actual language value for that sample creating the dynamic paths like `results/fr`.
How did this work? The `publishDir` directive is evaluated at runtime when the process executes. Each process task gets its own meta map from the input tuple When the directive is evaluated, `${meta.lang}` is replaced with the actual language value for that sample creating the dynamic paths like `results/fr`.

Copy link
Collaborator

@pinin4fjords pinin4fjords left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work great simple subject to teach with! In general:

  1. Having the 'after' before the 'before' in examples makes me go a bit cross-eyed
  2. You don't need to include such large sections of text in the 'before', just enough to anchor the reader in the text. Copy whole blocks repeatedly is hard to parse, and for us to maintain!
  3. It definitely is 'romance' rather than 'romanic'.

└── nextflow.config
```

The samplesheet contains information about different samples and some associated data that we will use in this exercise to tailor our analysis to each sample. Each data files contains greetings in different languages, but we don't know what language they are in. In particular, the samplesheet has 3 columns:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The samplesheet contains information about different samples and some associated data that we will use in this exercise to tailor our analysis to each sample. Each data files contains greetings in different languages, but we don't know what language they are in. In particular, the samplesheet has 3 columns:
The samplesheet contains information about different samples and some associated data that we will use in this exercise to tailor our analysis to each sample. Each data file contains greetings in different languages, but we don't know what language they are in. In particular, the samplesheet has 3 columns:


Throughout this tutorial, we'll use the `ch_` prefix for all channel variables to clearly indicate they are Nextflow channels.

=== "After"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you mean to show the after before the 'before'?

ch_samplesheet = Channel.fromPath("./data/samplesheet.csv")
```

We can use the [`splitCsv` operator](https://www.nextflow.io/docs/latest/operator.html#splitcsv) to split the samplesheet into a channel of maps, where each map represents a row from the CSV file.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should actually explain the concept of a map/ dict/ hash, even if just in a note. I don't thing we really stress it a lot anywhere else right now, and not all of the audience will be coders familiar with the concept.

[id:sampleG, character:turtle, recording:/workspaces/training/side-quests/metadata/data/ciao.txt]
```

We can see that each row from the CSV file has been converted into a map with keys matching the header row. A map is a key-value data structure similar to dictionaries in Python, objects in JavaScript, or hashes in Ruby. For example:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh, you do the map explanation here, it just needs moving up to when you first use the concept.


### 1.2 Separate meta data and data

In the samplesheet, we have both the input files and data about the input files (`id`, `character`), the meta data. As we progress through the workflow, we generate more meta data about each sample. To avoid having to keep track of how many fields we have at any point in time and making the input of our processes more robust, we can combine the meta information into its own key-value paired map.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
In the samplesheet, we have both the input files and data about the input files (`id`, `character`), the meta data. As we progress through the workflow, we generate more meta data about each sample. To avoid having to keep track of how many fields we have at any point in time and making the input of our processes more robust, we can combine the meta information into its own key-value paired map.
In the samplesheet, we have both data (input files) and associated metadata (`id`, `character`). We'll be referencing and adding to the metadata, but it will be the data files themselves that form the core of our processing. As such, it's important to separate data and metadata early on.

Just my take on a revised motivation, disregard at will.

Comment on lines +615 to +621
Let's understand how this transformation works. The `map` operator takes a closure that processes each element in the channel. Inside the closure, we're using a ternary operator to create a new language group classification.

The ternary expression `(meta.lang.equals('de') || meta.lang.equals('en')) ? 'germanic' : 'romanic'` works like this:

- First, it evaluates the condition before the `?`: checks if the language is either German ('de') or English ('en')
- If the condition is true (language is German or English), it returns 'germanic'
- If the condition is false (any other language), it returns 'romanic'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I say just scrap this bit. You talk about ternaries above and point to docs, that's sufficient.

Comment on lines +627 to +631
In this section, you've learned:

- **Merging on meta maps**: You used `join` to combine two channels based on their meta maps to maintain relationships across processes and channels
- **Creating custom keys**: You created two new keys in your meta map, adding them with `meta + [new_key:value]`. One based on a computed value from a process, and one based on a condition you set in the `map` operator.
- **Ternary operator**: You used the ternary operator to determine which language belongs to which group.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
In this section, you've learned:
- **Merging on meta maps**: You used `join` to combine two channels based on their meta maps to maintain relationships across processes and channels
- **Creating custom keys**: You created two new keys in your meta map, adding them with `meta + [new_key:value]`. One based on a computed value from a process, and one based on a condition you set in the `map` operator.
- **Ternary operator**: You used the ternary operator to determine which language belongs to which group.
In this section, you've learned how to:
- **Merge on meta maps**: You used `join` to combine two channels based on their meta maps to maintain relationships across processes and channels
- **Create custom keys**: You created two new keys in your meta map, adding them with `meta + [new_key:value]`. One based on a computed value from a process, and one based on a condition you set in the `map` operator.
- **Use a ternary operator**: You used the ternary operator to determine which language belongs to which group.

```groovy title="main.nf" linenums="20" hl_lines="20-23"
workflow {

ch_samplesheet = Channel.fromPath("./data/samplesheet.csv")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need to repeat such big sections, it just detracts from the message


## 4. Customize a process with meta map

Let's let the characters say the phrases that we have passed in. In the [hello-nextflow training](../hello_nextflow/05_hello_containers.md), you already encountered the `cowpy` package, a python implementation of a tool called `cowsay` that generates ASCII art to display arbitrary text inputs in a fun way. We will re-use a process from there.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Let's let the characters say the phrases that we have passed in. In the [hello-nextflow training](../hello_nextflow/05_hello_containers.md), you already encountered the `cowpy` package, a python implementation of a tool called `cowsay` that generates ASCII art to display arbitrary text inputs in a fun way. We will re-use a process from there.
Let's let some fun characters say the phrases that we have passed in. In the [hello-nextflow training](../hello_nextflow/05_hello_containers.md), you already encountered the `cowpy` package, a python implementation of a tool called `cowsay` that generates ASCII art to display arbitrary text inputs in a fun way. We will re-use a process from there.

'the' implies we know what characters you're referring to, which we don't yet.

Comment on lines +954 to +958
In this section, you've learned:

- **Tweaking directives using meta values**: Using meta map values in `publishDir` directives to create dynamic output paths based on sample properties

- **Tweaking script section based on meta values**: Customizing tool parameters per sample using meta information in the `script` section
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
In this section, you've learned:
- **Tweaking directives using meta values**: Using meta map values in `publishDir` directives to create dynamic output paths based on sample properties
- **Tweaking script section based on meta values**: Customizing tool parameters per sample using meta information in the `script` section
In this section, you've learned how to:
- **Tweak directives using meta values**: Using meta map values in `publishDir` directives to create dynamic output paths based on sample properties
- **Tweak the script section based on meta values**: Customizing tool parameters per sample using meta information in the `script` section

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Side Quest idea: Metadata handling
3 participants