Version 2.0.0

armintoepfer · armintoepfer · commit fe09b8fccc6a · 2020-09-23T11:58:06.000+02:00
diff --git a/README.md b/README.md
@@ -64,19 +64,19 @@ The sort order is defined by the barcode indices, lowest first.
 
 *Lima* offers the following features:
  * Process both, CLR subreads and CCS reads
- * BAM in- and output
+ * BAM, FASTA, FASTQ in- and output
  * Extensive reports that allow in-depth quality control
  * Clip barcode sequences and annotate `bq` and `bc` tags
  * Agnostic of input barcode sequence orientation
- * Split output BAM files by barcode
+ * Split output files by barcode
  * Full PacBio dataset support
  * Peek into the first N ZMWs and get average barcode score
  * Guess the subset of barcodes used in an input Barcode Set given a mean barcode score threshold
  * Enhanced filtering options to remove ambiguous calls
  * Double demux to remove PCR primers after barcode demultiplexing
 
 ## Latest Version
-Version **1.11.0**: [Full changelog here](#full-changelog)
+Version **2.0.0**: [Full changelog here](#full-changelog)
 
 ## Execution
 
@@ -86,13 +86,13 @@ Version **1.11.0**: [Full changelog here](#full-changelog)
 
 Run on CLR subread data:
 
-    lima movie.subreads.bam barcodes.fasta prefix.bam
-    lima movie.subreadset.xml barcodes.barcodeset.xml prefix.subreadset.xml
+    $ lima movie.subreads.bam barcodes.fasta prefix.bam
+    $ lima movie.subreadset.xml barcodes.barcodeset.xml prefix.subreadset.xml
 
 Run on CCS data:
 
-    lima --ccs movie.ccs.bam barcodes.fasta prefix.bam
-    lima --ccs movie.consensusreadset.xml barcodes.barcodeset.xml prefix.consensusreadset.xml
+    $ lima --ccs movie.ccs.bam barcodes.fasta prefix.bam
+    $ lima --ccs movie.consensusreadset.xml barcodes.barcodeset.xml prefix.consensusreadset.xml
 
 If you do not need to import the demultiplexed data into SMRT Link, it is advised
 to use `--no-pbi`, omit the pbi index file, to minimize time to result.
@@ -109,8 +109,8 @@ to use `--no-pbi`, omit the pbi index file, to minimize time to result.
 
 ### Example execution
 
-    lima m54317_180718_075644.subreadset.xml Sequel_RSII_384_barcodes_v1.barcodeset.xml \
-         m54317_180718_075644.demux.subreadset.xml --different --peek-guess
+    $ lima m54317_180718_075644.subreadset.xml Sequel_RSII_384_barcodes_v1.barcodeset.xml \
+           m54317_180718_075644.demux.subreadset.xml --different --peek-guess
 
 
 ## Input data
@@ -119,6 +119,8 @@ unaligned CCS reads, generated by [CCS](https://github.com/PacificBiosciences/cc
 both in the PacBio enhanced BAM format. If you want to demux RSII data, first
 use SMRT Link or bax2bam to convert h5 to BAM. In addition, a `datastore.json`
 with one file entry, either a SubreadSet or ConsensusReadSet, is also allowed.
+In addition, CCS reads input are also supported as FASTA or FASTQ, optionally
+gzipped.
 
 Barcodes are provided as a FASTA file, one entry per barcode sequence,
 **no duplicate** sequences, only upper-case bases,
@@ -159,14 +161,46 @@ prefix as the output file, omitting suffixes `.bam`, `.subreadset.xml`, and
 `.consensusreadset.xml`. The report infix is `lima`.
 Example:
 
-    lima m54007_170702_064558.subreads.bam barcode.fasta /my/path/m54007_170702_064558_demux.subreadset.xml
+    $ lima m54007_170702_064558.subreads.bam barcode.fasta /my/path/m54007_170702_064558_demux.subreadset.xml
 
 For all output files, the prefix will be `/my/path/m54007_170702_064558_demux.`
 
 ### BAM
 The first file `prefix.bam` contains clipped records, annotated with
 barcode tags, that passed filters.
 
+### FASTA/Q
+Alternatively, if output file is fasta or fastq, the header of each sequence
+contains all tags, separated by a single whitespace, that would be present in
+the BAM format. Example FASTQ header:
+
+    @m54006_171006_044150/4588126/ccs bc=3,3 bl=CGCGCGTGTGTGCGTG bq=100 bt=CGCGCGTGTGTGCGTG bx=16,16 cx=12 qe=2235 ql=p\tttropqorrtnnH qs=16 qt=G^\IGR]K8S>>^\^p
+
+### In- and output compatibility matrix:
+
+For CLR data, only XML and BAM are valid in- and output file types.
+
+For CCS data, use following compatibility matrix:
+
+| In/Out | XML | BAM | FASTQ | FASTA |
+| ------ | :-: | :-: | :---: | :---: |
+| XML    | YES | YES |  YES  |  YES  |
+| BAM    | YES | YES |  YES  |  YES  |
+| FASTQ  | no  | no  |  YES  |  YES  |
+| FASTA  | no  | no  |  no   |  YES  |
+
+This means, you can use CCS FASTQ reads as input and FASTA as output, but
+not BAM as output.
+
+Working example:
+
+    $ lima movie.Q20.fastq Sequel_RSII_384_barcodes_v1.fasta demuxed.fastq --same
+
+Failing example:
+
+    $ lima movie.Q20.fastq Sequel_RSII_384_barcodes_v1.fasta demuxed.bam --same
+    FATAL -|- Unsupported combination of FASTQ input and BAM output.
+
 ### Report
 The second file is `prefix.lima.report`, a tab-separated file about each ZMW, unfiltered.
 This report contains any information necessary to investigate the demultiplexing
@@ -1069,7 +1103,10 @@ any parameters now, but worth future investigation.
 
 ## Full Changelog
 
- * **1.11.0**:
+ * **2.0.0**:
+   * Add support for FASTA and FASTQ
+   * Fix `-k` with by-strand HiFi reads
+ * 1.11.0:
    * Add barcode to read groups, use one barcode pair per RG
    * Fix double demux, used to clip wrongly for the second round of demuxing
  * 1.10.0: