Skip to content

Add working with files side quest#601

Merged
FriederikeHanssen merged 32 commits intointermediate_trainingfrom
side_quest_working_with_files
Jul 11, 2025
Merged

Add working with files side quest#601
FriederikeHanssen merged 32 commits intointermediate_trainingfrom
side_quest_working_with_files

Conversation

@adamrtalbot
Copy link
Collaborator

@adamrtalbot adamrtalbot commented Apr 11, 2025

Adds a side quest for working with files, vaguely based on the Metadata propagation section of the advanced training.

Story:

  • Create file object from string
  • Look at file object attributes
  • Extract sample metadata from filename
  • Use Channel.fromPath to create a channel of files
  • Extract sample metadata from filename within map operator
  • Use Channel.fromFilePairs to create a channel of file pairs
  • Use publishDir to save results

Should help introduce the concept of handling files better.

Problems:

  • Doesn't ram home the "always use files as inputs!!!" message. We could do that in the final section on processes?

Notes

FASTQ files generated with:

while read name; do
    # Generate a random number between 1 and 10
    random_num=$((1 + RANDOM % 10))
    fq generate -n ${random_num} --read-length 10 data/${name}_R1_001.fastq.gz data/${name}_R2_001.fastq.gz
done < scripts/samples.txt

Where samples.txt:

sampleA_rep1_normal
sampleA_rep1_tumor
sampleA_rep2_normal
sampleA_rep2_tumor
sampleB_rep1_normal
sampleB_rep1_tumor
sampleC_rep1_normal
sampleC_rep1_tumor

Adds a side quest for working with files, vaguely based on the Metadata propagation section of the advanced training.

Story:
 - Create `file` object from string
 - Look at `file` object attributes
 - Extract sample metadata from filename
 - Use `Channel.fromPath` to create a channel of files
 - Extract sample metadata from filename within map operator
 - Use `Channel.fromFilePairs` to create a channel of file pairs
 - Use `publishDir` to save results

Should help introduce the concept of handling files better.

Problems:
 - Doesn't ram home the "always use files as inputs!!!" message. We could do that in the final section on processes?
@netlify
Copy link

netlify bot commented Apr 11, 2025

Deploy Preview for nextflow-training ready!

Name Link
🔨 Latest commit de22980
🔍 Latest deploy log https://app.netlify.com/sites/nextflow-training/deploys/6819fd3073f79600084979da
😎 Deploy Preview https://deploy-preview-601--nextflow-training.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@adamrtalbot adamrtalbot requested a review from vdauwera April 11, 2025 15:28
@FriederikeHanssen FriederikeHanssen self-requested a review April 14, 2025 11:48
Copy link
Collaborator

@FriederikeHanssen FriederikeHanssen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • needs to be added to the "Side Quests" Side bar & "Menu of side quests"
  • IIRC in the other trainings we started to indent the code blocks with the amount that is needed in the code so we don't have to do this when copying the code blocks.

reviewed until approximately the end of section 3.

Copy link
Collaborator

@FriederikeHanssen FriederikeHanssen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am mostly just thinking out loud here:
I am wondering how this will mix with any metadata module. I think at one point we discussed pulling out the section out of the nf-core module and have it separate. It feels too small to be its own side quest and I think it could fit into here quite well. But I also wouldn't want to convolute this module with extra things about hashmaps and keys.

If the flattening of the meta data is not hugely important, we could replace that with a map and collapse the content?


Wait, we have a problem. We have 2 replicates for sampleA, but only 1 output file! We are overwriting the output file each time.

### 5.4. Make the published files unique
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose this is to use something from the meta info as example on how it can be used in the workflow. Should we do something similar as in the nf-core module or change the nf-core module (we used branch to change the execution path) ? Just to revisit the same concepts again

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's not a bad idea. We could use filter (from splitting and grouping) instead of branch and use a closure with a filename to unique-ify the published file in config.


---

## Summary
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The indents of the learnings render weirdly.

@FriederikeHanssen
Copy link
Collaborator

Copying discussion from slack here, so it won't get lost:

How about we rename the "Working with Files" chapter to "Working with Input Data" and cover both files and metadata explicitly? To bring things close to nf-core & bridge the gap to grouping/splitting, we could add a section in 3.5 on how to use a map instead of a list.
That would also remove a little complexity in the grouping chapters and focus more on just the data shuffling.

@adamrtalbot
Copy link
Collaborator Author

I am mostly just thinking out loud here: I am wondering how this will mix with any metadata module. I think at one point we discussed pulling out the section out of the nf-core module and have it separate. It feels too small to be its own side quest and I think it could fit into here quite well. But I also wouldn't want to convolute this module with extra things about hashmaps and keys.

If the flattening of the meta data is not hugely important, we could replace that with a map and collapse the content?

I've added a metamap here: 6998ddf

but I'm not sure about it, we might be introducing too much too early?

@adamrtalbot
Copy link
Collaborator Author

@FriederikeHanssen I've addressed most of your comments now - take a second look.

Copy link
Collaborator

@FriederikeHanssen FriederikeHanssen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love it!

  • Schedule wise, we should probably run this before nf4-science, since we use the file() objects there already.

Launching `main.nf` [infallible_swartz] DSL2 - revision: 7f4e68c0cb

[[id:sampleA, replicate:1, type:normal, readNum:R2], /workspaces/training/side-quests/working_with_files/data/sampleA_rep1_normal_R2_001.fastq.gz]
[[id:sampleA, replicate:1, type:normal, readNum:R1], /workspaces/training/side-quests/working_with_files/data/sampleA_rep1_normal_R1_001.fastq.gz]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this addition. The only thing that could potentially be confusing I think is, that in 3.2 (Extracting Metadata from Filenames) we already had a map and a file, but then flatten it

Therefore, it's easier if the input channel is flat instead of the nested structure we have here.

Maybe we need some justification, why the file is not part of the map

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In 3.2 it's a list (array) and a file, do you think that's different enough to be obvious?

[[sampleA, rep1, normal, R1, 001], /workspaces/training/side-quests/working_with_files/data/sampleA_rep1_normal_R1_001.fastq.gz]
[[sampleA, rep1, normal, R2, 001], /workspaces/training/side-quests/working_with_files/data/sampleA_rep1_normal_R2_001.fastq.gz]

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As for why the file isn't in the map - good point. I don't have a great answer for that other than 'because Nextflow says so'...hmmm not sure what to do.


Launching `main.nf` [prickly_stonebraker] DSL2 - revision: f62ab10a3f

[[id:sampleA, replicate:1, type:normal, readNum:R], [/workspaces/training/side-quests/working_with_files/data/sampleA_rep1_normal_R1_001.fastq.gz, /workspaces/training/side-quests/working_with_files/data/sampleA_rep1_normal_R2_001.fastq.gz]]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get the same results, but do we want readNum:R ? I would have expected 1 or 2.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahhhhhhhhhhh fromFilePairs strips the contents of the {}...hmm we might need to rethink this.


!!! note

We are calling our map '`meta`'. This is the first introduction of a concept called `metamap` which we will cover later!
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure what we want to cover beyond this tbh. The next step would maybe be adding fields via nf-schema, but I think that would be a training focused on input validation

@FriederikeHanssen FriederikeHanssen changed the base branch from master to intermediate_training June 5, 2025 11:06
@FriederikeHanssen FriederikeHanssen merged commit d8b576d into intermediate_training Jul 11, 2025
0 of 3 checks passed
@FriederikeHanssen FriederikeHanssen deleted the side_quest_working_with_files branch July 11, 2025 09:36
pinin4fjords added a commit that referenced this pull request Aug 13, 2025
* Add working with files side quest

Adds a side quest for working with files, vaguely based on the Metadata propagation section of the advanced training.

Story:
 - Create `file` object from string
 - Look at `file` object attributes
 - Extract sample metadata from filename
 - Use `Channel.fromPath` to create a channel of files
 - Extract sample metadata from filename within map operator
 - Use `Channel.fromFilePairs` to create a channel of file pairs
 - Use `publishDir` to save results

Should help introduce the concept of handling files better.

Problems:
 - Doesn't ram home the "always use files as inputs!!!" message. We could do that in the final section on processes?

* Metadata Training (#611)

* rough plan for metadata training

* solution for training

* update solution

* first half of the text

* add intro and second half of training

* add to side bar

* linting fixes

* linting fixes

* add sumary and key concepts

* linting. again.

* fix line highlights

* fix line highlights, formatting

* fix indents

* flesh out some of the explanations

* fix text

* Add working with files side quest (#601)

* Add working with files side quest

Adds a side quest for working with files, vaguely based on the Metadata propagation section of the advanced training.

Story:
 - Create `file` object from string
 - Look at `file` object attributes
 - Extract sample metadata from filename
 - Use `Channel.fromPath` to create a channel of files
 - Extract sample metadata from filename within map operator
 - Use `Channel.fromFilePairs` to create a channel of file pairs
 - Use `publishDir` to save results

Should help introduce the concept of handling files better.

Problems:
 - Doesn't ram home the "always use files as inputs!!!" message. We could do that in the final section on processes?

* fixups and make paths relative to codespaces

* Add real FASTQ data

* Reduce code blocks for clarity

* Typo in splitting and grouping path

* Remove splitting and grouping (wrong branch)

* Clarify the difference between strings and files in Nextflow

* Clarify files in Nextflow

* Correct summary sentence to reflect code, it was copy+pasted wrong

* Update docs/side_quests/working_with_files.md

Co-authored-by: Friederike Hanssen <friederike.hanssen@seqera.io>

* Clarify why we want to flatten tuples

* Simplify the map operation part

* Remove reference to Groovy methods

* Fix indentation

* Conver data to metamap to introduce metamaps early

* Refine summary points

* Use markdown numbering

* Add link to file properties documentation

* Explain multiple assignment better

* Update docs/side_quests/working_with_files.md

Co-authored-by: Maxime U Garcia <maxime.garcia@seqera.io>

* Add Channel.fromFilePairs docs as link to fromFilePairs section

* Added glob explanation as a note

* Added glob explanation as a note

* Use bullet points instead of numbers for key concepts of working with files

* working with files use before-after syntax correctly

* working with files add highlighting

* working with files highlighting fixup

* Fixups: Respond to review comments

---------

Co-authored-by: Friederike Hanssen <friederike.hanssen@seqera.io>
Co-authored-by: Maxime U Garcia <maxime.garcia@seqera.io>

* Intermediate training Reorganisation (Meta maps, working with files, splitting & grouping) (#636)

* Bump NXF_VER to latest stable (25.04.3) (#618)

* Bump NXF_VER to latest stable (25.04.3)

* Set devcontainer image from base:ubuntu to base:dev-ubuntu-24.04

---------

Co-authored-by: Marcel Ribeiro-Dantas <marcel@seqera.io>

* Hello Nextflow: Transcripts (#615)

* Add transcripts for Hello Nextflow videos
* Check headings: ignore transcript files

---------

Co-authored-by: Geraldine Van der Auwera <geraldinevdauwera@gmail.com>

* Small fixes to Hello Nextflow outputs (#620)

* Update outputs
* Update script to match training content

---------

Co-authored-by: Marcel Ribeiro-Dantas <marcel@seqera.io>

* Minor link fixes (#621)

* Minor link fixes
* Delete nf-develop mention

* Update French translation for the Home and Help pages (#622)

* Fix buildx error when image contains attestation manifest (#623)

Co-authored-by: Marcel Ribeiro-Dantas <marcel@seqera.io>

* Address #491 - Selecting a Nextflow version and Environment variables (#627)

* Address #491 - Selecting a Nextflow version and Environment variables

* removing Environment section

* Prettify orientation file

---------

Co-authored-by: Marcel Ribeiro-Dantas <marcel.ribeirodantas@seqera.io>

* expand intro to go from samplespecific data to the term meta data

* rephrase intro

* fix wording to romance

* remove ternary operator to focus on meta maps, instead highlight merging of maps

* remove ternary operator from learnings

* move map explanation up

* First pass of the Nextflow Run course (#626)

Abridged version of Hello Nextflow focused on running rather than developing.

 - Run basic operations (Hello World level)
 - Run pipelines (channels for multiple inputs, multi-step example, modules, containers)
 - Configuration

* remove filtering from the metamap training.

* remove join, filter, fix language, line numbers

* fix numbering

* add samplesheet parsing into starter file

* update solution with simplified training

* abbreviate title

* add trailing dot to make the linter happy

* very minor refactoring, simplifying the quests to how to get to a meta data map

* fix linting

* closing message

* ad solution

* use meta maps

* Update docs/side_quests/working_with_files.md

Co-authored-by: Adam Talbot <12817534+adamrtalbot@users.noreply.github.com>

* remove duplication from sidebar

* add working with files to side quest overview

* remove more duplications

* remove section 5.2/5.3, add solution

* linting

* prettier linting

* simplify if/else more

* formatting

* linting

* add meta data training as prereq

* Adding posthog (#640)

* Adding GTM

* Add custom GTM integration for MkDocs

- Create custom GTM analytics partial to avoid posthog-js dependency issues
- Maintain GTM integration while using Material for MkDocs supported approach
- Remove empty package-lock.json that was causing build confusion

* Updating just posthog

* Updating just posthog

* Fix linting error

---------

Co-authored-by: Marcel Ribeiro-Dantas <marcel@seqera.io>

* Add a proper intro to IDE features (#632)

* Add ide intro

* Correct heading number

* Fix numbering

* Add a bunch of images

* Prettier

* Add to index

* Crop some images for renamed file

* Minor fixups including:
 - Clarifying ctrl/cmd key usage
 - Fixing some language
 - Removing IDE from title

* Some more fixups and adjustments based on a run through

* fixup numbering

* Couple more feedback tweaks

* Resove extensions panel opening via icon

* typo

* Move Cmd admonition, add files icon, tidy up Cntrl/ Cmd

* Update syntax showcase image

* Fix operators autocomplete image

* Fix task autocomplete

* Fix config autocomplete

* correct link following

* Better resolve linking / process inspecting

* Misc fixes

* Fix script

* Text improvement

* More minor text fixes

* Add source control screenshot

* Strip AI coverage in IDE section due to plugin bug

* Clarify terminal shortcut

* Add note on structure

* Fix numbering

* Update docs/side_quests/ide_features.md

---------

Co-authored-by: adamrtalbot <12817534+adamrtalbot@users.noreply.github.com>
Co-authored-by: Friederike Hanssen <friederike.hanssen@seqera.io>

---------

Co-authored-by: Geraldine Van der Auwera <geraldinevdauwera@gmail.com>
Co-authored-by: Marcel Ribeiro-Dantas <marcel@seqera.io>
Co-authored-by: Phil Ewels <phil.ewels@seqera.io>
Co-authored-by: Marcel Ribeiro-Dantas <mribeirodantas@seqera.io>
Co-authored-by: Kristina Gagalova <kristina.gagalova@curtin.edu.au>
Co-authored-by: Marcel Ribeiro-Dantas <marcel.ribeirodantas@seqera.io>
Co-authored-by: Adam Talbot <12817534+adamrtalbot@users.noreply.github.com>
Co-authored-by: mavi-sqr <marta.vidal@seqera.io>
Co-authored-by: Jonathan Manning <jonathan.manning@seqera.io>

* docs(side_quests): add remote file usage section to working with files guide

- Add comprehensive documentation on using remote files with URIs
- Include examples for HTTP, S3, Azure, and Google Cloud Storage protocols
- Demonstrate seamless switching between local and remote data
- Update summary and key concepts sections with remote file examples
- Show how Nextflow automatically handles file staging and caching

* refactor(docs): clean up side quests navigation and remove duplicate files

- Remove duplicate navigation entries for nf-test.md and nf-core.md
- Delete outdated splitting-and-grouping.md file
- Reorder side quests navigation for better organization
- Fix navigation structure to eliminate redundancy

* Apply suggestions from code review

Co-authored-by: Jonathan Manning <jonathan.manning@seqera.io>

* replace sample with file or data in the main text, use consistent spelling for datasheet and metadata

* fix indentation & format solution. Add fix for changed if clause to adhere to language server

* add in problem statement

* adding @pinin4fjords suggestions on emphasis

* add map example

* replace attributes/properties with metadata

* add publishDir right away

* add in reference to splitting and grouping suggested by @pinin4fjords

* tweak wording suggested by @pinin4fjords

* Update docs/side_quests/metadata.md

Co-authored-by: Jonathan Manning <jonathan.manning@seqera.io>

* add a follow on sentence on the balance between meta maps and explicit inputs

* fix linting

* sample -> patient

* Promote remote files coverage

* Add object type coverage

* Do better with strings vs Path objects

* link out to path docs

* Add example to illustrate file handling in a process

* Formatting fix

* Update new example

* Formatting fixes

* Formatting fixes

* update title

* Correct highlights

* Remove duplicate content

* Explain debug

* fix path

* More formatting fixes

* More formatting fixes

* Cover string class

* main.nf -> file_operations.nf

* main.nf -> file_operations.nf

* Fix link

* Continuity fixes

* Apply minor suggestions from code review

* Link to 'working with metadata'

* Fix title

* Fix takeaways

* Remove lingering 'sample' references

* Update learning outcomes

* First batch of edits from final review

* Minor fixes, and we need pipefail

* fix nextflow versions

* Remove messy bolds

* Declare files inside workflow block

* attempt indent fixes

* More fixes

* Files fixes including summary

* Prettier

* Initial cleanup of splitting/ grouping

* syntax fixes

* Convert to file object and link out

* Fix csv name and nextflow versions

* Intro tweaks

* misc fixes

* Add map explanation

* More SnG tweaks

* More SnG tweaks

* More SnG tweaks

* More SnG tweaks

* Fix line nums

* Fix line nums

* tumour -> tumor

* Fix line nums

* More SnG tweaks

* Misc fixes

* Misc fixes

* More fixes

* Fix up takeaways

* AI-assisted prose cleanup

* Fix up code block titles

* Tweak summary

* Smooth transitions

* Update solution

* Prettier

---------

Co-authored-by: Friederike Hanssen <friederike.hanssen@seqera.io>
Co-authored-by: Maxime U Garcia <maxime.garcia@seqera.io>
Co-authored-by: Geraldine Van der Auwera <geraldinevdauwera@gmail.com>
Co-authored-by: Marcel Ribeiro-Dantas <marcel@seqera.io>
Co-authored-by: Phil Ewels <phil.ewels@seqera.io>
Co-authored-by: Marcel Ribeiro-Dantas <mribeirodantas@seqera.io>
Co-authored-by: Kristina Gagalova <kristina.gagalova@curtin.edu.au>
Co-authored-by: Marcel Ribeiro-Dantas <marcel.ribeirodantas@seqera.io>
Co-authored-by: mavi-sqr <marta.vidal@seqera.io>
Co-authored-by: Jonathan Manning <jonathan.manning@seqera.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants