Skip to content

Commit d49e93f

Browse files
committed
Version 1.11.0
1 parent 1aa407e commit d49e93f

File tree

6 files changed

+76
-20
lines changed

6 files changed

+76
-20
lines changed

README.md

Lines changed: 76 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -35,7 +35,7 @@ by PCR or ligation:
3535
In addition, there are three different barcode library designs.
3636
In order to describe a barcode library design, one can view it
3737
from a SMRTbell or read perspective.
38-
As *lima* supports raw subread and CCS read demultiplexing,
38+
As *lima* supports CLR subread and CCS read demultiplexing,
3939
the following terminology is based on the per (sub-)read view.
4040

4141
<img src="img/barcode_overview.png" width="886px">
@@ -63,28 +63,28 @@ The sort order is defined by the barcode indices, lowest first.
6363
## Features
6464

6565
*Lima* offers the following features:
66-
* Process both, raw subreads and CCS reads
66+
* Process both, CLR subreads and CCS reads
6767
* BAM in- and output
6868
* Extensive reports that allow in-depth quality control
6969
* Clip barcode sequences and annotate `bq` and `bc` tags
7070
* Agnostic of input barcode sequence orientation
7171
* Split output BAM files by barcode
72-
* No scraps.bam needed
7372
* Full PacBio dataset support
7473
* Peek into the first N ZMWs and get average barcode score
7574
* Guess the subset of barcodes used in an input Barcode Set given a mean barcode score threshold
7675
* Enhanced filtering options to remove ambiguous calls
76+
* Double demux to remove PCR primers after barcode demultiplexing
7777

7878
## Latest Version
79-
Version **1.10.0**: [Full changelog here](#full-changelog)
79+
Version **1.11.0**: [Full changelog here](#full-changelog)
8080

8181
## Execution
8282

8383
**Note:** Any existing output files will be overwritten after execution.
8484

8585
**Note:** Always use `--peek-guess` to remove spurious barcode hits.
8686

87-
Run on raw subread data:
87+
Run on CLR subread data:
8888

8989
lima movie.subreads.bam barcodes.fasta prefix.bam
9090
lima movie.subreadset.xml barcodes.barcodeset.xml prefix.subreadset.xml
@@ -99,12 +99,12 @@ to use `--no-pbi`, omit the pbi index file, to minimize time to result.
9999

100100
### *Symmetric* or *Tailed* options
101101

102-
Raw: --same
102+
CLR: --same
103103
CCS: --same --ccs
104104

105105
### *Asymmetric* options
106106

107-
Raw: --different
107+
CLR: --different
108108
CCS: --different --ccs
109109

110110
### Example execution
@@ -114,7 +114,7 @@ to use `--no-pbi`, omit the pbi index file, to minimize time to result.
114114

115115

116116
## Input data
117-
Input data is either raw unaligned subreads, straight from a Sequel, or
117+
Input data is either CLR unaligned subreads, straight from a Sequel I/II, or
118118
unaligned CCS reads, generated by [CCS](https://github.com/PacificBiosciences/ccs);
119119
both in the PacBio enhanced BAM format. If you want to demux RSII data, first
120120
use SMRT Link or bax2bam to convert h5 to BAM. In addition, a `datastore.json`
@@ -189,24 +189,24 @@ how ZMWs many are *same/different*, and how many reads have been filtered.
189189
Below min passes : 0 (0%)
190190
Below min score lead : 11656 (32%)
191191
Below min ref span : 3124 (8%)
192-
Without adapter : 25094 (68%)
192+
Without SMRTbell adapter : 25094 (68%)
193193
With bad adapter : 10349 (28%) <- Only with --bad-adapter-ratio
194194
Undesired hybrids : xxx (xx%) <- Only with --peek-guess
195-
Undesired same barcode pairs : xxx (xx%) <- Only with --different
196-
Undesired diff barcode pairs : xxx (xx%) <- Only with --same
195+
Undesired same pairs : xxx (xx%) <- Only with --different
196+
Undesired diff pairs : xxx (xx%) <- Only with --same
197197
Undesired 5p--5p pairs : xxx (xx%) <- Only with --isoseq
198198
Undesired 3p--3p pairs : xxx (xx%) <- Only with --isoseq
199199
Undesired single side : xxx (xx%) <- Only with --isoseq
200200
Undesired no hit : xxx (xx%) <- Only with --isoseq
201201

202202
ZMWs for (B):
203-
With same barcode : 162244 (92%)
204-
With different barcodes : 14112 (8%)
203+
With same pair : 162244 (92%)
204+
With different pair : 14112 (8%)
205205
Coefficient of correlation : 32.79%
206206

207207
ZMWs for (A):
208-
Allow diff barcode pair : 157264 (74%)
209-
Allow same barcode pair : 188026 (88%)
208+
Allow diff pair : 157264 (74%)
209+
Allow same pair : 188026 (88%)
210210
Bad adapter yield loss : 10112 (5%) <- Only with --bad-adapter-ratio
211211
Bad adapter impurity : 10348 (5%) <- Only without --bad-adapter-ratio
212212

@@ -446,7 +446,7 @@ Only use reads flanked by adapters on both sides for barcode identification,
446446
full-pass reads.
447447

448448
### `--keep-idx-order`
449-
Per default, the two identified barcode idx are sorted ascending, as in raw data,
449+
Per default, the two identified barcode idx are sorted ascending, as in CLR data,
450450
the correct order cannot be determined. This affects the `bc` tag, `prefix.counts`
451451
file, and `--split-bam` file names; `prefix.report` columns `IdxLowest`,
452452
`IdxHighest`, `IdxLowestNamed`, `IdxHighestNamed` will have the same order as
@@ -455,7 +455,7 @@ such as CCS.
455455

456456
If you are using an asymmetric barcode design with `NxN` pairs
457457
and your input is CCS, you can use `--keep-idx-order` to preserve
458-
the order. If your input is raw subreads and you use `NxN` asymmetric pairs,
458+
the order. If your input is CLR subreads and you use `NxN` asymmetric pairs,
459459
there is no way to distinguish between pairs `bc1001--bc1002` and `bc1002--bc1001`.
460460

461461
### `--per-read`
@@ -970,7 +970,7 @@ The score lead measures how close the best barcode call is to the second best.
970970
Possible solutions without seeing your data:
971971
* Is that sample actually barcoded?
972972
* Are your barcode sequences genetically too close for SMRT sequencing?
973-
Try CCS2 calling first and demultiplex with `--ccs`.
973+
Try CCS calling first and demultiplex with `--ccs`.
974974
* Are the synthesized products clean and not degenerate?
975975
* Did the sequencing run perform optimally, is the accuracy in the expected range?
976976
* Did you run lima twice, first on the original and then on the already
@@ -982,7 +982,7 @@ false positives.
982982

983983
### What is different in *lima* to *bam2bam*?
984984
* CCS read support
985-
* Barcodes of every adapter gets scored for raw subreads
985+
* Barcodes of every adapter gets scored for CLR subreads
986986
* Does not enforce symmetric barcode pairing, which increases PPV
987987
* For asymmetric barcodes, `lima` can report the identified order, instead of
988988
ascending sorting
@@ -1013,10 +1013,66 @@ then your XML input contains BioSamples with different barcode names than the
10131013
provided `barcode.fasta` file. Please check that you've used the correct
10141014
barcodes. You can ignore barcodes specified in the XML with `--ignore-biosamples`.
10151015

1016+
### CCS or demux first?
1017+
Many people have been wondering, what is the recommended order for a multiplexed
1018+
HiFi pool:
1019+
1) first ccs and then demux
1020+
2) first demux and then ccs
1021+
1022+
#### Experiment
1023+
Use 2k ecoli amplicons with barcoded overhang adapters, symmetric. Workflow steps:
1024+
1) Generate CCS
1025+
2) Demuxe subreads and whiteliste on CCS hole numbers
1026+
3) Demuxe CCS
1027+
4) Compare both sets of hole numbers
1028+
1029+
#### Results
1030+
Verbatim results for one chip:
1031+
1032+
Generated CCS reads : 274185
1033+
Demuxed CCS reads : 269919 (98.44%)
1034+
Demuxed subreads : 271068 (98.86%)
1035+
1036+
Venn diagrams for two chips:
1037+
1038+
<img src="img/venn_diagramm_1.png" width="400px">
1039+
<img src="img/venn_diagramm_2.png" width="400px">
1040+
1041+
Just based on those numbers, one would say, pick subread demuxing.
1042+
Here comes the but. Demuxing subreads is very IO heavy and takes ~100x longer
1043+
than demuxing CCS.
1044+
For the sake of time to result and disk space,
1045+
**perform CCS first and demux afterwards**.
1046+
1047+
#### Discussion
1048+
Q: Is there any systematic reason for reads that get correctly called by subread demux but not ccs or vice versa?
1049+
1050+
Let's plot subread barcode scores, grouped by if they were only called in subreads (blue) or not (red)
1051+
<img src="img/subread_only_scores.png" width="600px">
1052+
1053+
Majority of what is subread output only is on the verge of being called at all.
1054+
The problem with the current CCS draft stage is that it sometimes trims a few
1055+
bases, which is generally not a big issue for demuxing, but if the barcode is
1056+
molecularly damaged, too short or of low quality, a few missing bases lead to
1057+
being uncallable.
1058+
1059+
Vice versa, only called by CCS and not in subreads:
1060+
<img src="img/ccs_only_scores.png" width="600px">
1061+
1062+
Again something that is on the verge being called. The reason for the ~300 reads
1063+
at 100 score, no idea so far. In general, this is 0.1% of the data.
1064+
Let's investigate those ~300 calls and plot their subread demux barcode scores.
1065+
<img src="img/ccs_only_subread_scores.png" width="600px">
1066+
1067+
It's curious why they didn't get called, but for 0.1% not worth changing
1068+
any parameters now, but worth future investigation.
10161069

10171070
## Full Changelog
10181071

1019-
* **1.10.0**:
1072+
* **1.11.0**:
1073+
* Add barcode to read groups, use one barcode pair per RG
1074+
* Fix double demux, used to clip wrongly for the second round of demuxing
1075+
* 1.10.0:
10201076
* Output N barcodes per subdirectory with `--files-per-directory N` and output splitting
10211077
* BioSample awareness for XML input and split output and allow ignoring them with `--ignore-biosamples`
10221078
* Increase `--window-size-mult` to `3` to allow longer spacers

img/ccs_only_scores.png

163 KB
Loading

img/ccs_only_subread_scores.png

147 KB
Loading

img/subread_only_scores.png

168 KB
Loading

img/venn_diagramm_1.png

77.1 KB
Loading

img/venn_diagramm_2.png

75.6 KB
Loading

0 commit comments

Comments
 (0)