@@ -35,7 +35,7 @@ by PCR or ligation:
35
35
In addition, there are three different barcode library designs.
36
36
In order to describe a barcode library design, one can view it
37
37
from a SMRTbell or read perspective.
38
- As * lima* supports raw subread and CCS read demultiplexing,
38
+ As * lima* supports CLR subread and CCS read demultiplexing,
39
39
the following terminology is based on the per (sub-)read view.
40
40
41
41
<img src =" img/barcode_overview.png " width =" 886px " >
@@ -63,28 +63,28 @@ The sort order is defined by the barcode indices, lowest first.
63
63
## Features
64
64
65
65
* Lima* offers the following features:
66
- * Process both, raw subreads and CCS reads
66
+ * Process both, CLR subreads and CCS reads
67
67
* BAM in- and output
68
68
* Extensive reports that allow in-depth quality control
69
69
* Clip barcode sequences and annotate ` bq ` and ` bc ` tags
70
70
* Agnostic of input barcode sequence orientation
71
71
* Split output BAM files by barcode
72
- * No scraps.bam needed
73
72
* Full PacBio dataset support
74
73
* Peek into the first N ZMWs and get average barcode score
75
74
* Guess the subset of barcodes used in an input Barcode Set given a mean barcode score threshold
76
75
* Enhanced filtering options to remove ambiguous calls
76
+ * Double demux to remove PCR primers after barcode demultiplexing
77
77
78
78
## Latest Version
79
- Version ** 1.10 .0** : [ Full changelog here] ( #full-changelog )
79
+ Version ** 1.11 .0** : [ Full changelog here] ( #full-changelog )
80
80
81
81
## Execution
82
82
83
83
** Note:** Any existing output files will be overwritten after execution.
84
84
85
85
** Note:** Always use ` --peek-guess ` to remove spurious barcode hits.
86
86
87
- Run on raw subread data:
87
+ Run on CLR subread data:
88
88
89
89
lima movie.subreads.bam barcodes.fasta prefix.bam
90
90
lima movie.subreadset.xml barcodes.barcodeset.xml prefix.subreadset.xml
@@ -99,12 +99,12 @@ to use `--no-pbi`, omit the pbi index file, to minimize time to result.
99
99
100
100
### * Symmetric* or * Tailed* options
101
101
102
- Raw : --same
102
+ CLR : --same
103
103
CCS: --same --ccs
104
104
105
105
### * Asymmetric* options
106
106
107
- Raw : --different
107
+ CLR : --different
108
108
CCS: --different --ccs
109
109
110
110
### Example execution
@@ -114,7 +114,7 @@ to use `--no-pbi`, omit the pbi index file, to minimize time to result.
114
114
115
115
116
116
## Input data
117
- Input data is either raw unaligned subreads, straight from a Sequel, or
117
+ Input data is either CLR unaligned subreads, straight from a Sequel I/II , or
118
118
unaligned CCS reads, generated by [ CCS] ( https://github.com/PacificBiosciences/ccs ) ;
119
119
both in the PacBio enhanced BAM format. If you want to demux RSII data, first
120
120
use SMRT Link or bax2bam to convert h5 to BAM. In addition, a ` datastore.json `
@@ -189,24 +189,24 @@ how ZMWs many are *same/different*, and how many reads have been filtered.
189
189
Below min passes : 0 (0%)
190
190
Below min score lead : 11656 (32%)
191
191
Below min ref span : 3124 (8%)
192
- Without adapter : 25094 (68%)
192
+ Without SMRTbell adapter : 25094 (68%)
193
193
With bad adapter : 10349 (28%) <- Only with --bad-adapter-ratio
194
194
Undesired hybrids : xxx (xx%) <- Only with --peek-guess
195
- Undesired same barcode pairs : xxx (xx%) <- Only with --different
196
- Undesired diff barcode pairs : xxx (xx%) <- Only with --same
195
+ Undesired same pairs : xxx (xx%) <- Only with --different
196
+ Undesired diff pairs : xxx (xx%) <- Only with --same
197
197
Undesired 5p--5p pairs : xxx (xx%) <- Only with --isoseq
198
198
Undesired 3p--3p pairs : xxx (xx%) <- Only with --isoseq
199
199
Undesired single side : xxx (xx%) <- Only with --isoseq
200
200
Undesired no hit : xxx (xx%) <- Only with --isoseq
201
201
202
202
ZMWs for (B):
203
- With same barcode : 162244 (92%)
204
- With different barcodes : 14112 (8%)
203
+ With same pair : 162244 (92%)
204
+ With different pair : 14112 (8%)
205
205
Coefficient of correlation : 32.79%
206
206
207
207
ZMWs for (A):
208
- Allow diff barcode pair : 157264 (74%)
209
- Allow same barcode pair : 188026 (88%)
208
+ Allow diff pair : 157264 (74%)
209
+ Allow same pair : 188026 (88%)
210
210
Bad adapter yield loss : 10112 (5%) <- Only with --bad-adapter-ratio
211
211
Bad adapter impurity : 10348 (5%) <- Only without --bad-adapter-ratio
212
212
@@ -446,7 +446,7 @@ Only use reads flanked by adapters on both sides for barcode identification,
446
446
full-pass reads.
447
447
448
448
### ` --keep-idx-order `
449
- Per default, the two identified barcode idx are sorted ascending, as in raw data,
449
+ Per default, the two identified barcode idx are sorted ascending, as in CLR data,
450
450
the correct order cannot be determined. This affects the ` bc ` tag, ` prefix.counts `
451
451
file, and ` --split-bam ` file names; ` prefix.report ` columns ` IdxLowest ` ,
452
452
` IdxHighest ` , ` IdxLowestNamed ` , ` IdxHighestNamed ` will have the same order as
@@ -455,7 +455,7 @@ such as CCS.
455
455
456
456
If you are using an asymmetric barcode design with ` NxN ` pairs
457
457
and your input is CCS, you can use ` --keep-idx-order ` to preserve
458
- the order. If your input is raw subreads and you use ` NxN ` asymmetric pairs,
458
+ the order. If your input is CLR subreads and you use ` NxN ` asymmetric pairs,
459
459
there is no way to distinguish between pairs ` bc1001--bc1002 ` and ` bc1002--bc1001 ` .
460
460
461
461
### ` --per-read `
@@ -970,7 +970,7 @@ The score lead measures how close the best barcode call is to the second best.
970
970
Possible solutions without seeing your data:
971
971
* Is that sample actually barcoded?
972
972
* Are your barcode sequences genetically too close for SMRT sequencing?
973
- Try CCS2 calling first and demultiplex with ` --ccs ` .
973
+ Try CCS calling first and demultiplex with ` --ccs ` .
974
974
* Are the synthesized products clean and not degenerate?
975
975
* Did the sequencing run perform optimally, is the accuracy in the expected range?
976
976
* Did you run lima twice, first on the original and then on the already
@@ -982,7 +982,7 @@ false positives.
982
982
983
983
### What is different in * lima* to * bam2bam* ?
984
984
* CCS read support
985
- * Barcodes of every adapter gets scored for raw subreads
985
+ * Barcodes of every adapter gets scored for CLR subreads
986
986
* Does not enforce symmetric barcode pairing, which increases PPV
987
987
* For asymmetric barcodes, ` lima ` can report the identified order, instead of
988
988
ascending sorting
@@ -1013,10 +1013,66 @@ then your XML input contains BioSamples with different barcode names than the
1013
1013
provided ` barcode.fasta ` file. Please check that you've used the correct
1014
1014
barcodes. You can ignore barcodes specified in the XML with ` --ignore-biosamples ` .
1015
1015
1016
+ ### CCS or demux first?
1017
+ Many people have been wondering, what is the recommended order for a multiplexed
1018
+ HiFi pool:
1019
+ 1 ) first ccs and then demux
1020
+ 2 ) first demux and then ccs
1021
+
1022
+ #### Experiment
1023
+ Use 2k ecoli amplicons with barcoded overhang adapters, symmetric. Workflow steps:
1024
+ 1 ) Generate CCS
1025
+ 2 ) Demuxe subreads and whiteliste on CCS hole numbers
1026
+ 3 ) Demuxe CCS
1027
+ 4 ) Compare both sets of hole numbers
1028
+
1029
+ #### Results
1030
+ Verbatim results for one chip:
1031
+
1032
+ Generated CCS reads : 274185
1033
+ Demuxed CCS reads : 269919 (98.44%)
1034
+ Demuxed subreads : 271068 (98.86%)
1035
+
1036
+ Venn diagrams for two chips:
1037
+
1038
+ <img src =" img/venn_diagramm_1.png " width =" 400px " >
1039
+ <img src =" img/venn_diagramm_2.png " width =" 400px " >
1040
+
1041
+ Just based on those numbers, one would say, pick subread demuxing.
1042
+ Here comes the but. Demuxing subreads is very IO heavy and takes ~ 100x longer
1043
+ than demuxing CCS.
1044
+ For the sake of time to result and disk space,
1045
+ ** perform CCS first and demux afterwards** .
1046
+
1047
+ #### Discussion
1048
+ Q: Is there any systematic reason for reads that get correctly called by subread demux but not ccs or vice versa?
1049
+
1050
+ Let's plot subread barcode scores, grouped by if they were only called in subreads (blue) or not (red)
1051
+ <img src =" img/subread_only_scores.png " width =" 600px " >
1052
+
1053
+ Majority of what is subread output only is on the verge of being called at all.
1054
+ The problem with the current CCS draft stage is that it sometimes trims a few
1055
+ bases, which is generally not a big issue for demuxing, but if the barcode is
1056
+ molecularly damaged, too short or of low quality, a few missing bases lead to
1057
+ being uncallable.
1058
+
1059
+ Vice versa, only called by CCS and not in subreads:
1060
+ <img src =" img/ccs_only_scores.png " width =" 600px " >
1061
+
1062
+ Again something that is on the verge being called. The reason for the ~ 300 reads
1063
+ at 100 score, no idea so far. In general, this is 0.1% of the data.
1064
+ Let's investigate those ~ 300 calls and plot their subread demux barcode scores.
1065
+ <img src =" img/ccs_only_subread_scores.png " width =" 600px " >
1066
+
1067
+ It's curious why they didn't get called, but for 0.1% not worth changing
1068
+ any parameters now, but worth future investigation.
1016
1069
1017
1070
## Full Changelog
1018
1071
1019
- * ** 1.10.0** :
1072
+ * ** 1.11.0** :
1073
+ * Add barcode to read groups, use one barcode pair per RG
1074
+ * Fix double demux, used to clip wrongly for the second round of demuxing
1075
+ * 1.10.0:
1020
1076
* Output N barcodes per subdirectory with ` --files-per-directory N ` and output splitting
1021
1077
* BioSample awareness for XML input and split output and allow ignoring them with ` --ignore-biosamples `
1022
1078
* Increase ` --window-size-mult ` to ` 3 ` to allow longer spacers
0 commit comments