-
Notifications
You must be signed in to change notification settings - Fork 5
Description
Hello, hope you are well!
I am very excited to try out the low-memory and fast searches created by RocksDB :) (Also, I will definitely be making use of pairwise!)
On my way there, I encountered some unexpected behavior. I had an enormous sequence file (e.g. UniRef50, 65M protein sequences) and cut it up into chunks of 100k sequences to do sourmash scripts manysketch -p protein,scaled=1,k=10,abund without running out of resources.
Then, I wanted to index these many files before searching them, but sourmash scripts index didn't work on a list of manifest files.
Here's a minimal reproduction, using the data in src/python/tests/test-data:
# Make input csv files
echo 'name,genome_filename,protein_filename\nshort,short.fa,' > short.csv
echo 'name,genome_filename,protein_filename\nshort,short2.fa,' > short2.csv
echo 'name,genome_filename,protein_filename\nshort,short3.fa,' > short3.csv
# Make sketches
sourmash scripts manysketch short.csv -o short.fa.zip -p dna,k=31,scaled=1
sourmash scripts manysketch short2.csv -o short2.fa.zip -p dna,k=31,scaled=1
sourmash scripts manysketch short3.csv -o short3.fa.zip -p dna,k=31,scaled=1
# Make list of sketches (but they're actually manifests?)
for ZIP in short*.zip; do echo $ZIP >> short_siglist.txt; done
Then, sourmash scripts index fails
$ sourmash scripts index --ksize 31 --scaled 1 -o short_index.rocksdb short_siglist.txt
== This is sourmash version 4.8.8. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==
ksize: 31 / scaled: 1 / moltype: DNA
indexing all sketches in 'short_siglist.txt'
Loading siglist
Reading signature(s) from: 'short_siglist.txt'
Sketch loading error: expected value at line 1 column 1
WARNING: could not load sketches from path 'short2.fa.zip'
Sketch loading error: expected value at line 1 column 1
WARNING: could not load sketches from path 'short.fa.zip'
Sketch loading error: expected value at line 1 column 1
WARNING: could not load sketches from path 'short3.fa.zip'
No valid signatures found in signature pathlist 'short_siglist.txt'
WARNING: 3 signature paths failed to load. See error messages above.
Error: Signatures failed to load. Exiting.
I'm realizing now that short.zip are manifests and not sigs, but I was confused that sourmash scripts index wasn't able to work with them, because all the parameters matched when doing sourmash sig describe:
$ sourmash sig describe short.fa.zip
== This is sourmash version 4.8.8. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==
---
signature filename: /Users/olgabot/code/sourmash_plugin_branchwater/src/python/tests/test-data/short.fa.zip
signature: short
source file: short.fa
md5: 9191284a3a23a913d8d410f3d53ce8f0
k=31 molecule=DNA num=0 scaled=1 seed=42 track_abundance=0
size: 970
sum hashes: 970
signature license: CC0
loaded 1 signatures total, from 1 files
The workaround is using sourmash sig cat to combine the signatures into one file, but I was hoping not to do this until index creation since the input files are so big.
sourmash sig cat short*.zip -o combined_short.zip
sourmash scripts index combined_short.zip --ksize 31 --scaled 1 -o short_index.rocksdb
Let me know if I'm not thinking about this problem correctly and there's a better way to do it.
Hope this was informative! Thank you!