Version 2.5.0

armintoepfer · armintoepfer · commit 7ec72896533d · 2022-02-23T13:09:50.000+01:00
diff --git a/docs/changelog.md b/docs/changelog.md
@@ -6,7 +6,17 @@ nav_order: 99
 
 # Version changelog
 
- * **2.4.0**:
+ * **2.5.0**:
+   * Upcoming SMRT Link release
+   * Add [`lima-undo` functionality](/faq/undo)
+   * Support methylation tag clipping
+   * Add progress and ETA for `--log-level INFO`
+   * Rename `--preset` to [`--hifi-preset`](/faq/hifi-presets)
+   * Add barcoded adapter `--hifi-preset SYMMETRIC-ADAPTERS`
+   * Fixes to support stranded HiFi BAM input
+   * Do not abort on empty input, but warn only
+
+ * 2.4.0:
    * Fix fasta/q input and `--guess`
    * Output empty files for missing barcode pairs `--output-missing-pairs`
    * Output each barcode into its own sub-directory `--split-subdirs`
diff --git a/docs/faq/Speed.md b/docs/faq/Speed.md
@@ -5,22 +5,19 @@ title: Speed
 ---
 
 ## How fast is fast?
-Example: 200 barcodes, asymmetric mode (try each barcode forward and
-reverse-complement), 300,000 CCS reads. On my 2014 iMac with 4 cores + HT:
+Example: 64 barcodes / asymmetric mode / 1.9M HiFi reads on a dual 64c EPYC system:
 
-    503.57s user 11.74s system 725% cpu 1:11.01 total
+    Processed : 1912155
+    Throughput: 2393135/min
+    Run Time  : 48s 306ms
+    CPU Time  : 2h 14m
 
-Those 1:11 minutes translate into 0.233 milliseconds per ZMW,
-1.16 microseconds per barcode for both sides aligning forward and reverse-complement,
-and 291 nanoseconds per alignment. This includes IO.
+That's 2.4M HiFi reads processed per minute on 128 physical CPU cores, including
+IO.
 
-## Why doesn't *lima* utilize the maximum number of provided cores?
-This might be a simple IO bottleneck. With a barcode.fasta containing only a few
-barcodes, most of the time is spent reading and writing BAM files, as the barcode
-identification is too fast. Starting version 2.2.0, you can enable multi-threaded
-BAM reading by setting the number of threads via an environment variable
+## Is there a way to show the progress?
+Yes, please use `--log-level INFO`. If there is a `.pbi` file present, the
+estimated time will be shown. Otherwise, it will show progress as number of
+reads every 5 seconds.
 
-    export PB_BAMREADER_THREADS=2
 
-## Is there a way to show the progress?
-No. Please run `wc -l prefix.report` to get the number of already processed ZMWs.
diff --git a/docs/faq/barcoded-adapter.md b/docs/faq/barcoded-adapter.md
@@ -0,0 +1,20 @@
+---
+layout: default
+parent: FAQ
+title: Barcoded Adapter
+---
+
+## Barcoded Adapter
+The most convenient way to barcode a sample is the use of barcoded adapters, as
+depicted in the [barcode design overview](barcode-design). One minor
+disadvantage is that the ligation might not be as efficient as with standard
+SMRTbell adapters, leaving some molecules only with one adapter. As barcoded
+adapter designs are inherently symmetric, we implemented ways to recover the
+demultiplexed yield from one-sided barcoded molecules with ease.
+
+As the first step, generate HiFi data with *ccs* v6.3.0 or later. This version
+will store [additional tags per
+records](https://ccs.how/faq/missing-adapters.html), indicating if the molecule
+has missing adapters on either side. The second step is to use the new
+`--hifi-preset SYMMETRIC-ADAPTERS` introduced with *lima* v2.5.0, [described
+here](/faq/hifi-presets). That's it.
diff --git a/docs/faq/biosample.md b/docs/faq/biosample.md
@@ -22,3 +22,18 @@ relevant. Example:
 Provide this CSV to lima via `--biosample-csv input.csv`.
 
 This will associate the bio sample name to the read group using the `SM` tag.
+
+## UUID passthrough
+Since *lima* v2.5.0, the functionality has been enhanced to allow specifying
+UUIDs for the resulting XML files; for this, use `--reuse-uuids` in addition to
+the extended csv for `--biosample-csv`. Example:
+
+    Barcodes,UUID,Bio Sample
+    bc1001--bc1001,11111111-1111-1aaa-0111-111111111111,Alfred
+    bc1002--bc1002,22222222-2222-2bbb-8222-222222222222,Berthold
+    bc1003--bc1003,33333333-3333-3ccc-9222-333333333333,Constantin
+    bc1008--bc1008,e04f12c9-7b2e-45fd-ab49-1bc2f75d653a,Holger
+
+Ensure that the UUID matches the regex
+
+    [0-9a-f]{8}-[0-9a-f]{4}-[0-5][0-9a-f]{3}-[089ab][0-9a-f]{3}-[0-9a-f]{12}
diff --git a/docs/faq/hifi-presets.md b/docs/faq/hifi-presets.md
@@ -0,0 +1,22 @@
+---
+layout: default
+parent: FAQ
+title: HiFi Presets
+---
+
+## HiFi presets
+With v2.5.0 we introduced the concept of recommended parameter presets called
+`--hifi-preset`. All preset use
+
+    --ccs --min-score 80 --min-end-score 50 --min-ref-span 0.75
+
+in addition they differ as following
+
+|        Preset        |              Definition               |
+| -------------------- | ------------------------------------- |
+| `SYMMMETRIC`         | `--same`                              |
+| `SYMMETRIC-ADAPTERS` | `--same --ignore-missing-adapters`    |
+| `ASYMMETRIC`         | `--different --min-scoring-regions 2` |
+
+For barcoded adapter libraries, `SYMMETRIC-ADAPTERS` will increase demultiplexed
+yield. More info under [barcoded adapter FAQ](/faq/barcoded-adapter)
diff --git a/docs/faq/how-to-run.md b/docs/faq/how-to-run.md
@@ -20,18 +20,15 @@ Run on CCS / HiFi data:
     $ lima <movie>.ccs.bam <barcodes>.fasta <demux>.bam
     $ lima <movie>.consensusreadset.xml <barcodes>.barcodeset.xml <demux>.consensusreadset.xml
 
-If you do not need to import the demultiplexed data into SMRT Link, it is advised
-to use `--no-pbi`, omit the pbi index file, to minimize time to result.
-
 ### *Symmetric* or *Tailed* options
 
     CLR: --same
-    CCS: --same --ccs
+    CCS: --preset-hifi SYMMETRIC
 
 ### *Asymmetric* options
 
     CLR: --different
-    CCS: --different --ccs
+    CCS: --preset-hifi ASYMMETRIC
 
 ### Example execution
 
diff --git a/docs/faq/primer.md b/docs/faq/primer.md
@@ -5,4 +5,5 @@ title: Primer removal
 ---
 
 ## Can I remove PCR primers after demultiplexing?
-Yes! After demultiplexing, just lima on the output again with your PCR primer(s).
+Yes! After demultiplexing, just call *lima* on the output again with your PCR
+primer(s).
diff --git a/docs/faq/split-output.md b/docs/faq/split-output.md
@@ -9,7 +9,7 @@ You can either iterate over the `prefix.bam` file N times or use
 `--split-bam`. Each barcode has its own BAM file called
 `prefix.idxBest--idxCombined.bam`, e.g., `prefix.0--0.bam`.
 
-The optional parameter `--split-bam-named`, names the files by their barcode names instead
+The optional parameter `--split-named`, names the files by their barcode names instead
 of their barcode indices. Non-word characters, anything except [A-Za-z0-9_],
 in barcode names are replaced with an underscore in the file name.
 
@@ -26,3 +26,11 @@ sequence is barcode `0` and the last barcode `numBarcodes - 1`.
 If you use output BAM splitting, it can happen that you get a lot of output files.
 Using `--files-per-directory N` creates subdirectories and outputs at most `N`
 barcodes per directory.
+
+## Split barcodes into own sub-directories
+Since v2.5.0 each barcode can be stored in its own sub-directory: `--split-subdirs`.
+A parent XML will point to all of the barcoded files.
+
+## Output missing barcodes
+If you have provided bio samples with barcode pairs, option `--output-missing-pairs`
+allows to create empty barcode files in all split modes.
diff --git a/docs/faq/undo.md b/docs/faq/undo.md
@@ -0,0 +1,43 @@
+---
+layout: default
+parent: FAQ
+title: Undo
+---
+
+## Undo demultiplexing
+With the introduction of *lima* v2.5.0, it is possible to undo all
+demultiplexing steps for **HiFi data**. For this, the bioconda package contains a
+new `lima-undo` binary.
+
+Example:
+
+    lima movie.hifi_reads.bam demux.consensusreadset.xml --hifi-preset SYMMETRIC --store-unbarcoded
+    lima-undo demux.consensusreadset.xml undo.bam
+
+Let's unroll what's happening. In the first line, we explicitly request to store
+the unbarcoded reads. Without this, we would not be able to recover unbarcoded
+reads. The `XML` contains all the file paths to the `BAM` files. The second call is
+to the new *lima-undo* binary that takes a `XML` or `BAM` file as input and
+ouput.
+
+Optionally, you can also provide multiple input `BAM` files with one output `BAM`:
+
+    lima-undo demux.bam demux.unbarcoded.bam undo.bam
+
+This works also with split BAM files:
+
+    lima-undo demux.bc1001-bc1001.bam demux.bc1002-bc1002.bam demux.unbarcoded.bam undo.bam
+
+## How does it work?
+*lima* v2.5.0 and later stores everything that got clipped in an internal binary
+structure in the `ls` tag. Multiple demultiplexing rounds are supported. Once
+*lima-undo* gets called, for each read the individual demultiplexing steps get
+reverted until the read is identical to the original HiFi read.
+
+## How can I check if undo results are correct?
+How to check that the result is identical:
+
+    samtools sort --no-PG -t "zm" undo.bam -o sorted.undo.bam
+    samtools view --no-PG sorted.undo.bam > undo.sam
+    samtools view --no-PG movie.hifi_reads.bam > original.sam
+    diff original.sam undo.sam
diff --git a/docs/get-started.md b/docs/get-started.md
@@ -73,11 +73,11 @@ For CCS / HiFi data, use following compatibility matrix:
 
 HiFi run from *BAM* with **symmetric** barcodes:
 
-    lima <movie>.hifi_reads.bam barcodes.fasta <movie>.demux.bam --same --ccs --min-score 80
+    lima <movie>.hifi_reads.bam barcodes.fasta <movie>.demux.bam --hifi-prefix SYMMETRICS
 
 HiFi run from *FASTQ* with **asymmetric** barcodes:
 
-    lima <movie>.hifi_reads.fq.gz barcodes.fasta <movie>.demux.fastq --different --ccs --min-score 80
+    lima <movie>.hifi_reads.fq.gz barcodes.fasta <movie>.demux.fastq --hifi-prefix ASYMMETRIC
 
 CLR run from *XML* with **symmetric** barcodes:
 
diff --git a/docs/img/lima_card_2022.png b/docs/img/lima_card_2022.png
diff --git a/docs/index.md b/docs/index.md
@@ -7,7 +7,7 @@ permalink: /
 ---
 
 <p align="center">
-  <img src="img/lima_card.png" alt="lima logo" width="650px"/>
+  <img src="img/lima_card_2022.png" alt="lima logo" width="650px"/>
 </p>
 
 ***
@@ -23,11 +23,11 @@ Please refer to our [official pbbioconda page](https://github.com/PacificBioscie
 for information on Installation, Support, License, Copyright, and Disclaimer.
 
 ## Latest Version
-Version **2.4.0**: [Full changelog here](/changelog)
+Version **2.5.0**: [Full changelog here](/changelog)
 
-## What's new!
-New documentation is up, a 1:1 port from the original GitHub docs with minor
-enhancements. Expect major enhancements in upcoming releases.
+## What's new
+ * Recommended parameters via [`--hifi-preset`](/faq/hifi-presets)
+ * Undo demultiplexing via [`lima-undo`](/faq/undo)
 
 ## Get started
 If you are new to demultiplexing barcoded samples, check out the [Get Started guide](/get-started).
diff --git a/docs/output/removed.md b/docs/output/removed.md
@@ -1,9 +1,9 @@
 ---
 layout: default
 parent: Output files
-title: Removed
+title: Unbarcoded
 ---
 
-## Removed records
-Using the option `--dump-removed`, records that did not pass provided thresholds
+## Unbarcoded records
+Using the option `--store-unbarcoded`, records that did not pass provided thresholds
 or are without barcodes, are stored in the file `prefix.removed.bam`.