-
Notifications
You must be signed in to change notification settings - Fork 202
Metadata Training #611
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Metadata Training #611
Conversation
✅ Deploy Preview for nextflow-training ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's some heavy overlap with splitting and joining, I might be tempted to reduce it and stick with using meta values.
Simplify the ternary stuff, I think it takes attention away from metamaps.
@@ -0,0 +1,1039 @@ | |||
# Metadata | |||
|
|||
Metadata is information that describes and gives context to your data. In workflows, it helps track important details about samples, experimental conditions that can influence processing parameters. Metadata is sample-specific and helps tailor analyses to each dataset's unique characteristics. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Metadata is information that describes and gives context to your data. In workflows, it helps track important details about samples, experimental conditions that can influence processing parameters. Metadata is sample-specific and helps tailor analyses to each dataset's unique characteristics. | |
Metadata is information that describes and gives context to your data. In workflows, it helps track important details about samples and experimental conditions that can influence processing parameters. Metadata is sample-specific and helps tailor analyses to each dataset's unique characteristics. |
Maybe not "processing parameters", but I'm not sure what else to put...
|
||
Metadata is information that describes and gives context to your data. In workflows, it helps track important details about samples, experimental conditions that can influence processing parameters. Metadata is sample-specific and helps tailor analyses to each dataset's unique characteristics. | ||
|
||
Think of it like a library catalog: while books contain the actual content (raw data), the catalog cards provide essential information about each book - when it was published, who wrote it, where to find it (metadata). In Nextflow pipelines, metadata helps us: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Think of it like a library catalog: while books contain the actual content (raw data), the catalog cards provide essential information about each book - when it was published, who wrote it, where to find it (metadata). In Nextflow pipelines, metadata helps us: | |
Think of it like a library catalog: while books contain the actual content (raw data), the catalog cards provide essential information about each book - when it was published, who wrote it, where to find it (metadata). In Nextflow pipelines, metadata can be used to: |
### 1.2 Separate meta data and data | ||
|
||
In the samplesheet, we have both the input files and data about the input files (`id`, `character`), the meta data. As we progress through the workflow, we generate more meta data about each sample. To avoid having to keep track of how many fields we have at any point in time and making the input of our processes more robust, we can combine the meta information into its own key-value paired map. | ||
|
||
This separation makes it easier to: | ||
|
||
- Track sample information throughout the workflow | ||
- Add new metadata as you process samples | ||
- Keep process inputs/outputs clean and organized | ||
- Query and filter samples based on their properties | ||
|
||
Now let's use this and separate our metadata from the file path. We'll use the `map` operator to restructure our channel elements into a tuple consisting of the meta map and file: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the motivation for this bit isn't very clear. Why do I want to separate the file from the values?
[[id:sampleG, character:turtle], /workspaces/training/side-quests/metadata/data/ciao.txt] | ||
``` | ||
|
||
We have successfully separate our meta data into its own map to keep it next to the file data. Each of our channel elements now has the shape `[ meta, file ]`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have successfully separate our meta data into its own map to keep it next to the file data. Each of our channel elements now has the shape `[ meta, file ]`. | |
We have successfully separated our values into their own map to separate it from the file. Each of our channel elements now has the shape `[ meta, file ]`. |
Then finish with a sentence saying why this is good.
|
||
In this section, you've learned: | ||
|
||
- **Reading in a samplesheet**: Using `splitCsv` to read CSV files with header information and transform rows into structured data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- **Reading in a samplesheet**: Using `splitCsv` to read CSV files with header information and transform rows into structured data | |
- **Why metadata is important?** Something here about preserving sample info | |
- **Reading in a samplesheet**: Using `splitCsv` to read CSV files with header information and transform rows into structured data |
### 2.2 Associate the language prediction with the input file | ||
|
||
At the moment, our sample files and their language prediction are separated in two different channels: `ch_samplesheet` and `ch_predictions`. But both channels have the same meta information associated with the interesting data points. We can use the meta map to combine our channels back together. | ||
|
||
Nextflow includes many methods for combining channels, but in this case the most appropriate operator is [`join`](https://www.nextflow.io/docs/latest/operator.html#join). If you are familiar with SQL, it acts like the `JOIN` operation, where we specify the key to join on and the type of join to perform. | ||
|
||
If we check the [`join`](https://www.nextflow.io/docs/latest/operator.html#join) documentation, we can see that it joins two channels based on a defined item, by default the first item in each tuple. If you don't have the console output still available, let's run the pipeline to check our data structures. If you removed the `view()` operator from the `ch_samplesheet` add it back in: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to do this or just pass the file through the process? It might add too much too early.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, this should be a two-step thing:
- just pass the input file to the output (they've seen that pattern in earlier sections if they've done Genomics)
- actually, that's a bad plan, because X. What we usually do is keep them separate to keep channel structure simple, yada yada yada
|
||
### 2.4 Assign a language group using a ternary operator | ||
|
||
Alright, now that we have our language predictions, let's use the information to assign them into new groups. In our example data, we have provided data sets that belong either to `germanic` (either English or German) or `romanic` (French, Spanish, Italian) languages. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In English we call them "romance" languages, but we should definitely check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely romance.
.map{ meta, file -> | ||
def lang_group = (meta.lang.equals('de') || meta.lang.equals('en')) ? 'germanic' : 'romanic' | ||
[ meta + [lang_group:lang_group], file ] | ||
} | ||
.view() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit messy...can we do something cleaner?
We could either (A) write a function or (B) use case or (C) use good old if
statements.
The ternary expression `(meta.lang.equals('de') || meta.lang.equals('en')) ? 'germanic' : 'romanic'` works like this: | ||
|
||
- First, it evaluates the condition before the `?`: checks if the language is either German ('de') or English ('en') | ||
- If the condition is true (language is German or English), it returns 'germanic' | ||
- If the condition is false (any other language), it returns 'romanic' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Simplifying or removing ternaries will keep this more focused on metadata.
|
||
Look through the other files. All phrases should be spoken by the fashionable stegosaurus. | ||
|
||
How did this work? The `publishDir` directive is evaluated at runtime when the process executes.Each process task gets its own meta map from the input tuple When the directive is evaluated, `${meta.lang}` is replaced with the actual language value for that sample creating the dynamic paths like `results/fr`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How did this work? The `publishDir` directive is evaluated at runtime when the process executes.Each process task gets its own meta map from the input tuple When the directive is evaluated, `${meta.lang}` is replaced with the actual language value for that sample creating the dynamic paths like `results/fr`. | |
How did this work? The `publishDir` directive is evaluated at runtime when the process executes. Each process task gets its own meta map from the input tuple When the directive is evaluated, `${meta.lang}` is replaced with the actual language value for that sample creating the dynamic paths like `results/fr`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work great simple subject to teach with! In general:
- Having the 'after' before the 'before' in examples makes me go a bit cross-eyed
- You don't need to include such large sections of text in the 'before', just enough to anchor the reader in the text. Copy whole blocks repeatedly is hard to parse, and for us to maintain!
- It definitely is 'romance' rather than 'romanic'.
└── nextflow.config | ||
``` | ||
|
||
The samplesheet contains information about different samples and some associated data that we will use in this exercise to tailor our analysis to each sample. Each data files contains greetings in different languages, but we don't know what language they are in. In particular, the samplesheet has 3 columns: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The samplesheet contains information about different samples and some associated data that we will use in this exercise to tailor our analysis to each sample. Each data files contains greetings in different languages, but we don't know what language they are in. In particular, the samplesheet has 3 columns: | |
The samplesheet contains information about different samples and some associated data that we will use in this exercise to tailor our analysis to each sample. Each data file contains greetings in different languages, but we don't know what language they are in. In particular, the samplesheet has 3 columns: |
|
||
Throughout this tutorial, we'll use the `ch_` prefix for all channel variables to clearly indicate they are Nextflow channels. | ||
|
||
=== "After" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you mean to show the after before the 'before'?
ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") | ||
``` | ||
|
||
We can use the [`splitCsv` operator](https://www.nextflow.io/docs/latest/operator.html#splitcsv) to split the samplesheet into a channel of maps, where each map represents a row from the CSV file. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should actually explain the concept of a map/ dict/ hash, even if just in a note. I don't thing we really stress it a lot anywhere else right now, and not all of the audience will be coders familiar with the concept.
[id:sampleG, character:turtle, recording:/workspaces/training/side-quests/metadata/data/ciao.txt] | ||
``` | ||
|
||
We can see that each row from the CSV file has been converted into a map with keys matching the header row. A map is a key-value data structure similar to dictionaries in Python, objects in JavaScript, or hashes in Ruby. For example: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh, you do the map explanation here, it just needs moving up to when you first use the concept.
|
||
### 1.2 Separate meta data and data | ||
|
||
In the samplesheet, we have both the input files and data about the input files (`id`, `character`), the meta data. As we progress through the workflow, we generate more meta data about each sample. To avoid having to keep track of how many fields we have at any point in time and making the input of our processes more robust, we can combine the meta information into its own key-value paired map. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the samplesheet, we have both the input files and data about the input files (`id`, `character`), the meta data. As we progress through the workflow, we generate more meta data about each sample. To avoid having to keep track of how many fields we have at any point in time and making the input of our processes more robust, we can combine the meta information into its own key-value paired map. | |
In the samplesheet, we have both data (input files) and associated metadata (`id`, `character`). We'll be referencing and adding to the metadata, but it will be the data files themselves that form the core of our processing. As such, it's important to separate data and metadata early on. |
Just my take on a revised motivation, disregard at will.
Let's understand how this transformation works. The `map` operator takes a closure that processes each element in the channel. Inside the closure, we're using a ternary operator to create a new language group classification. | ||
|
||
The ternary expression `(meta.lang.equals('de') || meta.lang.equals('en')) ? 'germanic' : 'romanic'` works like this: | ||
|
||
- First, it evaluates the condition before the `?`: checks if the language is either German ('de') or English ('en') | ||
- If the condition is true (language is German or English), it returns 'germanic' | ||
- If the condition is false (any other language), it returns 'romanic' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I say just scrap this bit. You talk about ternaries above and point to docs, that's sufficient.
In this section, you've learned: | ||
|
||
- **Merging on meta maps**: You used `join` to combine two channels based on their meta maps to maintain relationships across processes and channels | ||
- **Creating custom keys**: You created two new keys in your meta map, adding them with `meta + [new_key:value]`. One based on a computed value from a process, and one based on a condition you set in the `map` operator. | ||
- **Ternary operator**: You used the ternary operator to determine which language belongs to which group. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this section, you've learned: | |
- **Merging on meta maps**: You used `join` to combine two channels based on their meta maps to maintain relationships across processes and channels | |
- **Creating custom keys**: You created two new keys in your meta map, adding them with `meta + [new_key:value]`. One based on a computed value from a process, and one based on a condition you set in the `map` operator. | |
- **Ternary operator**: You used the ternary operator to determine which language belongs to which group. | |
In this section, you've learned how to: | |
- **Merge on meta maps**: You used `join` to combine two channels based on their meta maps to maintain relationships across processes and channels | |
- **Create custom keys**: You created two new keys in your meta map, adding them with `meta + [new_key:value]`. One based on a computed value from a process, and one based on a condition you set in the `map` operator. | |
- **Use a ternary operator**: You used the ternary operator to determine which language belongs to which group. |
```groovy title="main.nf" linenums="20" hl_lines="20-23" | ||
workflow { | ||
|
||
ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't need to repeat such big sections, it just detracts from the message
|
||
## 4. Customize a process with meta map | ||
|
||
Let's let the characters say the phrases that we have passed in. In the [hello-nextflow training](../hello_nextflow/05_hello_containers.md), you already encountered the `cowpy` package, a python implementation of a tool called `cowsay` that generates ASCII art to display arbitrary text inputs in a fun way. We will re-use a process from there. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's let the characters say the phrases that we have passed in. In the [hello-nextflow training](../hello_nextflow/05_hello_containers.md), you already encountered the `cowpy` package, a python implementation of a tool called `cowsay` that generates ASCII art to display arbitrary text inputs in a fun way. We will re-use a process from there. | |
Let's let some fun characters say the phrases that we have passed in. In the [hello-nextflow training](../hello_nextflow/05_hello_containers.md), you already encountered the `cowpy` package, a python implementation of a tool called `cowsay` that generates ASCII art to display arbitrary text inputs in a fun way. We will re-use a process from there. |
'the' implies we know what characters you're referring to, which we don't yet.
In this section, you've learned: | ||
|
||
- **Tweaking directives using meta values**: Using meta map values in `publishDir` directives to create dynamic output paths based on sample properties | ||
|
||
- **Tweaking script section based on meta values**: Customizing tool parameters per sample using meta information in the `script` section |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this section, you've learned: | |
- **Tweaking directives using meta values**: Using meta map values in `publishDir` directives to create dynamic output paths based on sample properties | |
- **Tweaking script section based on meta values**: Customizing tool parameters per sample using meta information in the `script` section | |
In this section, you've learned how to: | |
- **Tweak directives using meta values**: Using meta map values in `publishDir` directives to create dynamic output paths based on sample properties | |
- **Tweak the script section based on meta values**: Customizing tool parameters per sample using meta information in the `script` section |
Closes #520
maps
a lot