PacificBiosciences
diff --git a/‎README.md
Lines changed: 76 additions & 20 deletions b/‎README.md
Lines changed: 76 additions & 20 deletions
diff --git a/‎img/ccs_only_scores.png
163 KB b/‎img/ccs_only_scores.png
163 KB
diff --git a/‎img/ccs_only_subread_scores.png
147 KB b/‎img/ccs_only_subread_scores.png
147 KB
diff --git a/‎img/subread_only_scores.png
168 KB b/‎img/subread_only_scores.png
168 KB
diff --git a/‎img/venn_diagramm_1.png
77.1 KB b/‎img/venn_diagramm_1.png
77.1 KB
diff --git a/‎img/venn_diagramm_2.png
75.6 KB b/‎img/venn_diagramm_2.png
75.6 KB
@@ -35,7 +35,7 @@ by PCR or ligation:
 In addition, there are three different barcode library designs.
 In order to describe a barcode library design, one can view it
 from a SMRTbell or read perspective.
-As *lima* supports raw subread and CCS read demultiplexing,
+As *lima* supports CLR subread and CCS read demultiplexing,
 the following terminology is based on the per (sub-)read view.
 
 <img src="img/barcode_overview.png" width="886px">
@@ -63,28 +63,28 @@ The sort order is defined by the barcode indices, lowest first.
 ## Features
 
 *Lima* offers the following features:
- * Process both, raw subreads and CCS reads
+ * Process both, CLR subreads and CCS reads
  * BAM in- and output
  * Extensive reports that allow in-depth quality control
  * Clip barcode sequences and annotate `bq` and `bc` tags
  * Agnostic of input barcode sequence orientation
  * Split output BAM files by barcode
- * No scraps.bam needed
  * Full PacBio dataset support
  * Peek into the first N ZMWs and get average barcode score
  * Guess the subset of barcodes used in an input Barcode Set given a mean barcode score threshold
  * Enhanced filtering options to remove ambiguous calls
+ * Double demux to remove PCR primers after barcode demultiplexing
 
 ## Latest Version
-Version **1.10.0**: [Full changelog here](#full-changelog)
+Version **1.11.0**: [Full changelog here](#full-changelog)
 
 ## Execution
 
 **Note:** Any existing output files will be overwritten after execution.
 
 **Note:** Always use `--peek-guess` to remove spurious barcode hits.
 
-Run on raw subread data:
+Run on CLR subread data:
 
     lima movie.subreads.bam barcodes.fasta prefix.bam
     lima movie.subreadset.xml barcodes.barcodeset.xml prefix.subreadset.xml
@@ -99,12 +99,12 @@ to use `--no-pbi`, omit the pbi index file, to minimize time to result.
 
 ### *Symmetric* or *Tailed* options
 
-    Raw: --same
+    CLR: --same
     CCS: --same --ccs
 
 ### *Asymmetric* options
 
-    Raw: --different
+    CLR: --different
     CCS: --different --ccs
 
 ### Example execution
@@ -114,7 +114,7 @@ to use `--no-pbi`, omit the pbi index file, to minimize time to result.
 
 
 ## Input data
-Input data is either raw unaligned subreads, straight from a Sequel, or
+Input data is either CLR unaligned subreads, straight from a Sequel I/II, or
 unaligned CCS reads, generated by [CCS](https://github.com/PacificBiosciences/ccs);
 both in the PacBio enhanced BAM format. If you want to demux RSII data, first
 use SMRT Link or bax2bam to convert h5 to BAM. In addition, a `datastore.json`
@@ -189,24 +189,24 @@ how ZMWs many are *same/different*, and how many reads have been filtered.
     Below min passes              : 0 (0%)
     Below min score lead          : 11656 (32%)
     Below min ref span            : 3124 (8%)
-    Without adapter               : 25094 (68%)
+    Without SMRTbell adapter      : 25094 (68%)
     With bad adapter              : 10349 (28%) <- Only with --bad-adapter-ratio
     Undesired hybrids             : xxx (xx%) <- Only with --peek-guess
-    Undesired same barcode pairs  : xxx (xx%) <- Only with --different
-    Undesired diff barcode pairs  : xxx (xx%) <- Only with --same
+    Undesired same pairs          : xxx (xx%) <- Only with --different
+    Undesired diff pairs          : xxx (xx%) <- Only with --same
     Undesired 5p--5p pairs        : xxx (xx%) <- Only with --isoseq
     Undesired 3p--3p pairs        : xxx (xx%) <- Only with --isoseq
     Undesired single side         : xxx (xx%) <- Only with --isoseq
     Undesired no hit              : xxx (xx%) <- Only with --isoseq
 
     ZMWs for (B):
-    With same barcode             : 162244 (92%)
-    With different barcodes       : 14112 (8%)
+    With same pair                : 162244 (92%)
+    With different pair           : 14112 (8%)
     Coefficient of correlation    : 32.79%
 
     ZMWs for (A):
-    Allow diff barcode pair       : 157264 (74%)
-    Allow same barcode pair       : 188026 (88%)
+    Allow diff pair               : 157264 (74%)
+    Allow same pair               : 188026 (88%)
     Bad adapter yield loss        : 10112 (5%) <- Only with --bad-adapter-ratio
     Bad adapter impurity          : 10348 (5%) <- Only without --bad-adapter-ratio
 
@@ -446,7 +446,7 @@ Only use reads flanked by adapters on both sides for barcode identification,
 full-pass reads.
 
 ### `--keep-idx-order`
-Per default, the two identified barcode idx are sorted ascending, as in raw data,
+Per default, the two identified barcode idx are sorted ascending, as in CLR data,
 the correct order cannot be determined. This affects the `bc` tag, `prefix.counts`
 file, and `--split-bam` file names; `prefix.report` columns `IdxLowest`,
 `IdxHighest`, `IdxLowestNamed`, `IdxHighestNamed` will have the same order as
@@ -455,7 +455,7 @@ such as CCS.
 
 If you are using an asymmetric barcode design with `NxN` pairs
 and your input is CCS, you can use `--keep-idx-order` to preserve
-the order. If your input is raw subreads and you use `NxN` asymmetric pairs,
+the order. If your input is CLR subreads and you use `NxN` asymmetric pairs,
 there is no way to distinguish between pairs `bc1001--bc1002` and `bc1002--bc1001`.
 
 ### `--per-read`
@@ -970,7 +970,7 @@ The score lead measures how close the best barcode call is to the second best.
 Possible solutions without seeing your data:
  * Is that sample actually barcoded?
  * Are your barcode sequences genetically too close for SMRT sequencing?
-   Try CCS2 calling first and demultiplex with `--ccs`.
+   Try CCS calling first and demultiplex with `--ccs`.
  * Are the synthesized products clean and not degenerate?
  * Did the sequencing run perform optimally, is the accuracy in the expected range?
  * Did you run lima twice, first on the original and then on the already
@@ -982,7 +982,7 @@ false positives.
 
 ### What is different in *lima* to *bam2bam*?
  * CCS read support
- * Barcodes of every adapter gets scored for raw subreads
+ * Barcodes of every adapter gets scored for CLR subreads
  * Does not enforce symmetric barcode pairing, which increases PPV
  * For asymmetric barcodes, `lima` can report the identified order, instead of
    ascending sorting
@@ -1013,10 +1013,66 @@ then your XML input contains BioSamples with different barcode names than the
 provided `barcode.fasta` file. Please check that you've used the correct
 barcodes. You can ignore barcodes specified in the XML with `--ignore-biosamples`.
 
+### CCS or demux first?
+Many people have been wondering, what is the recommended order for a multiplexed
+HiFi pool:
+  1) first ccs and then demux
+  2) first demux and then ccs
+
+#### Experiment
+Use 2k ecoli amplicons with barcoded overhang adapters, symmetric. Workflow steps:
+  1) Generate CCS
+  2) Demuxe subreads and whiteliste on CCS hole numbers
+  3) Demuxe CCS
+  4) Compare both sets of hole numbers
+
+#### Results
+Verbatim results for one chip:
+
+    Generated CCS reads : 274185
+    Demuxed CCS reads   : 269919 (98.44%)
+    Demuxed subreads    : 271068 (98.86%)
+
+Venn diagrams for two chips:
+
+<img src="img/venn_diagramm_1.png" width="400px">
+<img src="img/venn_diagramm_2.png" width="400px">
+
+Just based on those numbers, one would say, pick subread demuxing.
+Here comes the but. Demuxing subreads is very IO heavy and takes ~100x longer
+than demuxing CCS.
+For the sake of time to result and disk space,
+**perform CCS first and demux afterwards**.
+
+#### Discussion
+Q: Is there any systematic reason for reads that get correctly called by subread demux but not ccs or vice versa?
+
+Let's plot subread barcode scores, grouped by if they were only called in subreads (blue) or not (red)
+<img src="img/subread_only_scores.png" width="600px">
+
+Majority of what is subread output only is on the verge of being called at all.
+The problem with the current CCS draft stage is that it sometimes trims a few
+bases, which is generally not a big issue for demuxing, but if the barcode is
+molecularly damaged, too short or of low quality, a few missing bases lead to
+being uncallable.
+
+Vice versa, only called by CCS and not in subreads:
+<img src="img/ccs_only_scores.png" width="600px">
+
+Again something that is on the verge being called. The reason for the ~300 reads
+at 100 score, no idea so far. In general, this is 0.1% of the data.
+Let's investigate those ~300 calls and plot their subread demux barcode scores.
+<img src="img/ccs_only_subread_scores.png" width="600px">
+
+It's curious why they didn't get called, but for 0.1% not worth changing
+any parameters now, but worth future investigation.
 
 ## Full Changelog
 
- * **1.10.0**:
+ * **1.11.0**:
+   * Add barcode to read groups, use one barcode pair per RG
+   * Fix double demux, used to clip wrongly for the second round of demuxing
+ * 1.10.0:
    * Output N barcodes per subdirectory with `--files-per-directory N` and output splitting
    * BioSample awareness for XML input and split output and allow ignoring them with `--ignore-biosamples`
    * Increase `--window-size-mult` to `3` to allow longer spacers