Skip to content

index doesn't work with a text file list of manifests #347

@olgabot

Description

@olgabot

Hello, hope you are well!

I am very excited to try out the low-memory and fast searches created by RocksDB :) (Also, I will definitely be making use of pairwise!)

On my way there, I encountered some unexpected behavior. I had an enormous sequence file (e.g. UniRef50, 65M protein sequences) and cut it up into chunks of 100k sequences to do sourmash scripts manysketch -p protein,scaled=1,k=10,abund without running out of resources.

Then, I wanted to index these many files before searching them, but sourmash scripts index didn't work on a list of manifest files.

Here's a minimal reproduction, using the data in src/python/tests/test-data:

# Make input csv files
echo 'name,genome_filename,protein_filename\nshort,short.fa,' > short.csv 
echo 'name,genome_filename,protein_filename\nshort,short2.fa,' > short2.csv
echo 'name,genome_filename,protein_filename\nshort,short3.fa,' > short3.csv

# Make sketches
sourmash scripts manysketch short.csv -o short.fa.zip -p dna,k=31,scaled=1 
sourmash scripts manysketch short2.csv -o short2.fa.zip -p dna,k=31,scaled=1
sourmash scripts manysketch short3.csv -o short3.fa.zip -p dna,k=31,scaled=1

# Make list of sketches (but they're actually manifests?)
for ZIP in short*.zip; do echo $ZIP >> short_siglist.txt; done

Then, sourmash scripts index fails

$ sourmash scripts index --ksize 31 --scaled 1 -o short_index.rocksdb short_siglist.txt   

== This is sourmash version 4.8.8. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

ksize: 31 / scaled: 1 / moltype: DNA 
indexing all sketches in 'short_siglist.txt'
Loading siglist
Reading signature(s) from: 'short_siglist.txt'
Sketch loading error: expected value at line 1 column 1
WARNING: could not load sketches from path 'short2.fa.zip'
Sketch loading error: expected value at line 1 column 1
WARNING: could not load sketches from path 'short.fa.zip'
Sketch loading error: expected value at line 1 column 1
WARNING: could not load sketches from path 'short3.fa.zip'
No valid signatures found in signature pathlist 'short_siglist.txt'
WARNING: 3 signature paths failed to load. See error messages above.
Error: Signatures failed to load. Exiting.

I'm realizing now that short.zip are manifests and not sigs, but I was confused that sourmash scripts index wasn't able to work with them, because all the parameters matched when doing sourmash sig describe:

$ sourmash sig describe short.fa.zip

== This is sourmash version 4.8.8. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

---
signature filename: /Users/olgabot/code/sourmash_plugin_branchwater/src/python/tests/test-data/short.fa.zip
signature: short
source file: short.fa
md5: 9191284a3a23a913d8d410f3d53ce8f0
k=31 molecule=DNA num=0 scaled=1 seed=42 track_abundance=0
size: 970
sum hashes: 970
signature license: CC0

loaded 1 signatures total, from 1 files

The workaround is using sourmash sig cat to combine the signatures into one file, but I was hoping not to do this until index creation since the input files are so big.

sourmash sig cat short*.zip -o combined_short.zip 
sourmash scripts index combined_short.zip --ksize 31 --scaled 1 -o short_index.rocksdb 

Let me know if I'm not thinking about this problem correctly and there's a better way to do it.

Hope this was informative! Thank you!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions