Skip to content

Pairwise sequence identity matrix is not symmetric #971

@weitse-hsu

Description

@weitse-hsu

As instructed by the user manual and this blogpost, I used the following script to compute pairwise sequence identity for a dataset (called bindingnet here):

fake_pref() {
QDB="$1"
TDB="$2"
RES="$3"

# create link to data file which contains a list of all targets that should be aligned
ln -s "${TDB}.index" "${RES}"

# create new index repeatedly pointing to same entry
INDEX_SIZE="$(echo $(wc -c < "${TDB}.index"))"
awk -v size=$INDEX_SIZE '{ print $1"\t0\t"size; }' "${QDB}.index" > "${RES}.index"

# create dbtype (7)
awk 'BEGIN { printf("%c%c%c%c",7,0,0,0); exit; }' > "${RES}.dbtype"
}
mmseqs createdb bindingnet.fasta bindingnet --shuffle 0
fake_pref bindingnet bindingnet allvsallpref
mmseqs align bindingnet bindingnet allvsallpref allvsallaln -e inf --alignment-mode 3 --seq-id-mode 1 --cov-mode 5
mmseqs convertalis bindingnet bindingnet allvsallaln results.m8

However, I found that some pairs did not have the same sequence identities, i.e. the pairwise sequence identity matrix was not symmetric. For example, in results.m8 I had:

bindingnet_target_235	bindingnet_target_470	0.011	6	5	0	328	333	134	139	5.815E+03	16
...
bindingnet_target_470	bindingnet_target_235	0.033	48	42	0	173	216	444	491	2.536E+03	16

which showed different alignment lengths (6 v.s. 48) and different identity values (0.011 v.s. 0.033).

I took a closer look at the 235-470 pair with the following script:

mmseqs createdb target_235.fasta DB235
mmseqs createdb target_470.fasta DB470

ln -s DB470.index allvsallpref
awk -v size=$(wc -c < DB470.index) '{print $1"\t0\t"size}' DB235.index > allvsallpref.index
awk 'BEGIN {printf("%c%c%c%c",7,0,0,0)}' > allvsallpref.dbtype

mmseqs align DB235 DB470 allvsallpref DB235_vs_DB470_aln -e inf --alignment-mode 3 --seq-id-mode 1
mmseqs convertalis DB235 DB470 DB235_vs_DB470_aln result_235_vs_470.m8

# Repeat in reverse direction to verify symmetry explicitly
ln -s DB235.index allvsallpref_rev
mmseqs align DB470 DB235 allvsallpref_rev DB470_vs_DB235_aln -e inf --alignment-mode 3 --seq-id-mode 1
mmseqs convertalis DB470 DB235 DB235_vs_DB470_aln result_470_vs_235.m8

And this time I got the following (collected from the two output m8 files)

bindingnet_target_235	bindingnet_target_470	0.011	6	5	0	328	333	134	139	5.038E+00	16
bindingnet_target_470	bindingnet_target_235	0.011	6	5	0	328	333	134	139	5.038E+00	16

I am a little lost as I thought the two approaches should be equivalent, except that in the first case bindingnet.fasta contains multiple sequences from the dataset. For your reference, here I am attaching bindingnet.fasta I used (bindingnet.txt).

I am still fairly new to mmseqs2 and I wonder if I did anything wrong when processing the dataset. Any suggestion would be highly appreciated!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions