Skip to content

Commit a1ea602

Browse files
authored
Update Hello Modules (#416)
* Updated intro and warmup section * Updated worfklow code and final files * Tweaked the directory names for Config and Modules + relocated solutions (Renamed both `scripts` and `intermediates` to `solutions` for more clarity
1 parent 7bda265 commit a1ea602

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

61 files changed

+507
-619
lines changed

docs/hello_nextflow/01_orientation.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,7 @@ This directory contains all the code files, test data and accessory files you wi
2020
tree . -L 2
2121
```
2222

23-
You should see the following output:
23+
You should see the following output: **TODO: UPDATE**
2424

2525
```console title="Directory contents"
2626
/workspace/gitpod/hello-nextflow
@@ -35,7 +35,7 @@ You should see the following output:
3535
├── hello-nf-test.nf
3636
├── hello-world.nf
3737
├── nextflow.config
38-
└── scripts
38+
└── solutions
3939
├── hello-config-1.config
4040
├── hello-config-2.config
4141
├── hello-config-3.config
@@ -66,7 +66,7 @@ You should see the following output:
6666
```
6767

6868
**The `data` directory** contains the input data we'll use in Part 3: Hello Genomics, which uses an example from genomics to demonstrate how to build a simple analysis pipeline.
69-
The data are described in detail in that section of the course.
69+
The dataset is described in detail in that section of the course.
7070

7171
**The file `nextflow.config`** is a configuration file that sets minimal environment properties.
7272

@@ -77,4 +77,4 @@ In its initial state, it is NOT a functional workflow script.
7777

7878
**The remaining `.nf` files** are functional workflow scripts that serve as starting points for the corresponding parts of the course.
7979

80-
**The `scripts` directory** contains the completed workflow scripts that result from each step of the course. They are intended to be used as a reference to check your work and troubleshoot any issues. The name and number in the filename correspond to the step of the relevant part of the course. For example, the file `hello-world-4.nf` is the expected result of completing steps 1 through 4 of Part 1: Hello World.
80+
**The `solutions` directory** contains the completed workflow scripts and other files that you will generate in each part of the course. They are intended to be used as a reference to check your work and troubleshoot any issues. The name and number in the filename correspond to the step of the relevant part of the course. For example, the file `hello-world-4.nf` is the expected result of completing steps 1 through 4 of Part 1: Hello World.

docs/hello_nextflow/06_hello_config.md

Lines changed: 24 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -11,23 +11,22 @@ So far we've been working with a very loose structure, with just one workflow co
1111
However, we're now moving into the phase of this training series that is more focused on code development and maintenance practices.
1212

1313
As part of that, we're going to adopt a formal project structure.
14-
We're going to work inside a dedicated project directory called `projectC` (C for configuration), and we've renamed the workflow file `main.nf` to match the recommended Nextflow convention.
14+
We're going to work inside a dedicated project directory called `hello-config`, and we've renamed the workflow file `main.nf` to match the recommended Nextflow convention.
1515

16-
### 0.1. Explore the `projectC` directory
16+
### 0.1. Explore the `hello-config` directory
1717

18-
We want to launch the workflow from inside the `projectC` directory, so let's move into it now.
18+
We want to launch the workflow from inside the `hello-config` directory, so let's move into it now.
1919

2020
```bash
21-
cd projectC
21+
cd hello-config
2222
```
2323

2424
Let's take a look at the contents.
2525
You can use the file explorer or the terminal; here we're using the output of `tree` to display the top-level directory contents.
2626

2727
```console title="Directory contents"
28-
projectC
28+
hello-config
2929
├── demo-params.json
30-
├── intermediates
3130
├── main.nf
3231
└── nextflow.config
3332
```
@@ -56,14 +55,12 @@ projectC
5655
- **`demo-params.json`** is a parameter file intended for supplying parameter values to a workflow.
5756
We will use it in section 5 of this tutorial.
5857

59-
- **`intermediates/`** is a directory containing the intermediate forms of the workflow and configuration files for each section of this tutorial.
60-
6158
The one thing that's missing is a way to point to the original data without making a copy of it or updating the file paths wherever they're specified.
6259
The simplest solution is to link to the data location.
6360

6461
### 0.2. Create a symbolic link to the data
6562

66-
Run this command from inside the `projectC` directory:
63+
Run this command from inside the `hello-config` directory:
6764

6865
```bash
6966
ln -s ../data data
@@ -72,10 +69,10 @@ ln -s ../data data
7269
This creates a symbolic link called `data` pointing to the data directory, which allows us to avoid having to change anything to how the file paths are set up.
7370

7471
```console title="Directory contents"
75-
projectC
72+
hello-config
7673
├── data -> ../data
7774
├── demo-params.json
78-
├── intermediates
75+
├── solutions
7976
├── main.nf
8077
└── nextflow.config
8178
```
@@ -105,7 +102,7 @@ executor > local (7)
105102
[ee/2c7855] GATK_JOINTGENOTYPING [100%] 1 of 1 ✔
106103
```
107104

108-
There will now be a `work` directory and a `results_genomics` directory inside your current `projectC` directory.
105+
There will now be a `work` directory and a `results_genomics` directory inside your `hello-config` directory.
109106

110107
### Takeaway
111108

@@ -145,7 +142,7 @@ Let's see what happens if we run that.
145142

146143
### 1.2. Run the workflow without Docker
147144

148-
We are now launching the `main.nf` workflow from inside the `projectC` directory.
145+
We are now launching the `main.nf` workflow from inside the `hello-config` directory.
149146

150147
```bash
151148
nextflow run main.nf
@@ -156,7 +153,7 @@ As expected, the run fails with an error message that looks like this:
156153
```console title="Output"
157154
N E X T F L O W ~ version 24.02.0-edge
158155

159-
┃ Launching `projectC/main.nf` [silly_ramanujan] DSL2 - revision: 9129bc4618
156+
┃ Launching `hello-config/main.nf` [silly_ramanujan] DSL2 - revision: 9129bc4618
160157

161158
executor > local (3)
162159
[93/4417d0] SAMTOOLS_INDEX (1) [ 0%] 0 of 3
@@ -319,7 +316,7 @@ This will take a bit longer than usual the first time, and you might see the con
319316
[- ] SAMTOOLS_INDEX -
320317
[- ] GATK_HAPLOTYPECALLER -
321318
[- ] GATK_JOINTGENOTYPING -
322-
Creating env using conda: bioconda::samtools=1.20 [cache /workspace/gitpod/hello-nextflow/projectC/work/conda/env-6684ea23d69ceb1742019ff36904f612]
319+
Creating env using conda: bioconda::samtools=1.20 [cache /workspace/gitpod/hello-nextflow/hello-config/work/conda/env-6684ea23d69ceb1742019ff36904f612]
323320
```
324321

325322
That's because Nextflow has to retrieve the Conda packages and create the environment, which takes a bit of work behind the scenes. The good news is that you don't need to deal with any of it yourself!
@@ -401,7 +398,7 @@ Let's try running the workflow with Conda.
401398
nextflow run main.nf -profile conda_on
402399
```
403400

404-
It works!
401+
It works! Convenient, isn't it?
405402

406403
```
407404
N E X T F L O W ~ version 24.02.0-edge
@@ -491,7 +488,7 @@ nextflow
491488
ERROR ~ Error executing process > 'SAMTOOLS_INDEX (3)'
492489

493490
Caused by:
494-
java.io.IOException: Cannot run program "sbatch" (in directory "/workspace/gitpod/hello-nextflow/projectC/work/eb/2962ce167b3025a41ece6ce6d7efc2"): error=2, No such file or directory
491+
java.io.IOException: Cannot run program "sbatch" (in directory "/workspace/gitpod/hello-nextflow/hello-config/work/eb/2962ce167b3025a41ece6ce6d7efc2"): error=2, No such file or directory
495492

496493
Command executed:
497494

@@ -500,7 +497,7 @@ Command executed:
500497

501498
However, it did produce what we are looking for: the `.command.run` file that Nextflow tried to submit to Slurm via the `sbatch` command.
502499

503-
Let's take a look inside. **TODO: UPDATE NEXTFLOW VERSION SO WE CAN HAVE THIS SWEET OUTPUT**
500+
Let's take a look inside. <!-- **TODO: UPDATE NEXTFLOW VERSION SO WE CAN HAVE THIS SWEET OUTPUT** -->
504501

505502
```bash title=".command.run" linenums="1"
506503
#!/bin/bash
@@ -723,19 +720,20 @@ nextflow run main.nf -profile my_laptop -with-report report-config-1.html
723720
```
724721

725722
The report is an html file, which you can download and open in your browser.
723+
726724
Take a few minutes to look through the report and see if you can identify some opportunities for adjusting resources.
727725
Make sure to click on the tabs that show the utilization results as a percentage of what was allocated.
728726
There is some [documentation](https://www.nextflow.io/docs/latest/reports.html) describing all the available features.
729727

730-
**TODO: insert images**
728+
<!-- TODO: insert images -->
731729

732730
One observation is that the `GATK_JOINTGENOTYPING` seems to be very hungry for CPU, which makes sense since it performs a lot of complex calculations.
733731
So we could try boosting that and see if it cuts down on runtime.
734732

735733
However, we seem to have overshot the mark with the memory allocations; all processes are only using a fraction of what we're giving them.
736734
We should dial that back down and save some resources.
737735

738-
### 4.3. Adjust resource allocations for a specific process
736+
### 4.4. Adjust resource allocations for a specific process
739737

740738
We can specify resource allocations for a given process using the `withName` directive.
741739
The syntax looks like this when it's by itself in a process block:
@@ -765,7 +763,7 @@ process {
765763
With that specified, the default settings will apply to all processes **except** the `GATK_JOINTGENOTYPING` process, which is a special snowflake that gets a lot more CPU.
766764
Hopefully that should have an effect.
767765

768-
### 4.4. Run again with the modified configuration
766+
### 4.5. Run again with the modified configuration
769767

770768
Let's run the workflow again with the modified configuration and with the reporting flag turned on, but notice we're giving the report a different name so we can differentiate them.
771769

@@ -778,7 +776,7 @@ Once again, you probably won't notice a substantial difference in runtime, becau
778776
However, the second report shows that our resource utilization is more balanced now, and the runtime of the `GATK_JOINTGENOTYPING` process has been cut in half.
779777
We probably didn't need to go all the way to 8 CPUs, but since there's only one call to that process, it's not a huge drain.
780778

781-
**TODO: screenshots?**
779+
<!-- **TODO: screenshots?** -->
782780

783781
As you can see, this approach is useful when your processes have different resource requirements. It empowers you to can right-size the resource allocations you set up for each process based on actual data, not guesswork.
784782

@@ -792,7 +790,7 @@ As you can see, this approach is useful when your processes have different resou
792790

793791
That being said, there may be some constraints on what you can (or must) allocate depending on what computing executor and compute infrastructure you're using. For example, your cluster may require you to stay within certain limits that don't apply when you're running elsewhere.
794792

795-
### 4.5. Add resource limits to an HPC profile
793+
### 4.6. Add resource limits to an HPC profile
796794

797795
You can use the `resourceLimits` directive to set the relevant limitations. The syntax looks like this when it's by itself in a process block:
798796

@@ -982,7 +980,7 @@ executor > local (7)
982980

983981
However, you may be thinking, well, did we really override the configuration? How would we know, since those were the same files?
984982

985-
### 5.6. Remove or generalize default values from `nextflow.config`
983+
### 5.5. Remove or generalize default values from `nextflow.config`
986984

987985
Let's strip out all the file paths from the `params` block in `nextflow.config`, replacing them with `null`, and replace the `cohort_name` value with something more generic.
988986

@@ -1029,7 +1027,7 @@ This is great because, with the parameter file in hand, we'll now be able to pro
10291027

10301028
That being said, it was nice to be able to demo the workflow without having to keep track of filenames and such. Let's see if we can use a profile to replicate that behavior.
10311029

1032-
### 5.7. Create a demo profile
1030+
### 5.6. Create a demo profile
10331031

10341032
Yes we can! We just need to retrieve the default parameter declarations as they were written in the original workflow (with the `params.*` syntax) and copy them into a new profile that we'll call `demo`.
10351033

@@ -1100,7 +1098,7 @@ profiles {
11001098

11011099
As long as we distribute the data bundle with the workflow code, this will enable anyone to quickly try out the workflow without having to supply their own inputs or pointing to the parameter file.
11021100

1103-
### 5.8. Run with the demo profile
1101+
### 5.7. Run with the demo profile
11041102

11051103
Let's try that out:
11061104

0 commit comments

Comments
 (0)