Pairwise sequence identity matrix is not symmetric

As instructed by the user manual and this [blogpost](https://www.blopig.com/blog/2023/05/pairwise-sequence-identity-and-tanimoto-similarity-in-pdbbind/), I used the following script to compute pairwise sequence identity for a dataset (called bindingnet here):
```
fake_pref() {
QDB="$1"
TDB="$2"
RES="$3"

# create link to data file which contains a list of all targets that should be aligned
ln -s "${TDB}.index" "${RES}"

# create new index repeatedly pointing to same entry
INDEX_SIZE="$(echo $(wc -c < "${TDB}.index"))"
awk -v size=$INDEX_SIZE '{ print $1"\t0\t"size; }' "${QDB}.index" > "${RES}.index"

# create dbtype (7)
awk 'BEGIN { printf("%c%c%c%c",7,0,0,0); exit; }' > "${RES}.dbtype"
}
mmseqs createdb bindingnet.fasta bindingnet --shuffle 0
fake_pref bindingnet bindingnet allvsallpref
mmseqs align bindingnet bindingnet allvsallpref allvsallaln -e inf --alignment-mode 3 --seq-id-mode 1 --cov-mode 5
mmseqs convertalis bindingnet bindingnet allvsallaln results.m8
```
However, I found that some pairs did not have the same sequence identities, i.e. the pairwise sequence identity matrix was not symmetric. For example, in `results.m8` I had:
```
bindingnet_target_235	bindingnet_target_470	0.011	6	5	0	328	333	134	139	5.815E+03	16
...
bindingnet_target_470	bindingnet_target_235	0.033	48	42	0	173	216	444	491	2.536E+03	16
```
which showed different alignment lengths (6 v.s. 48) and different identity values (0.011 v.s. 0.033). 

I took a closer look at the 235-470 pair with the following script:
```
mmseqs createdb target_235.fasta DB235
mmseqs createdb target_470.fasta DB470

ln -s DB470.index allvsallpref
awk -v size=$(wc -c < DB470.index) '{print $1"\t0\t"size}' DB235.index > allvsallpref.index
awk 'BEGIN {printf("%c%c%c%c",7,0,0,0)}' > allvsallpref.dbtype

mmseqs align DB235 DB470 allvsallpref DB235_vs_DB470_aln -e inf --alignment-mode 3 --seq-id-mode 1
mmseqs convertalis DB235 DB470 DB235_vs_DB470_aln result_235_vs_470.m8

# Repeat in reverse direction to verify symmetry explicitly
ln -s DB235.index allvsallpref_rev
mmseqs align DB470 DB235 allvsallpref_rev DB470_vs_DB235_aln -e inf --alignment-mode 3 --seq-id-mode 1
mmseqs convertalis DB470 DB235 DB235_vs_DB470_aln result_470_vs_235.m8
```
And this time I got the following (collected from the two output `m8` files)
```
bindingnet_target_235	bindingnet_target_470	0.011	6	5	0	328	333	134	139	5.038E+00	16
bindingnet_target_470	bindingnet_target_235	0.011	6	5	0	328	333	134	139	5.038E+00	16
```
I am a little lost as I thought the two approaches should be equivalent, except that in the first case `bindingnet.fasta` contains multiple sequences from the dataset. For your reference, here I am attaching `bindingnet.fasta` I used ([bindingnet.txt](https://github.com/user-attachments/files/19233592/bindingnet.txt)).

I am still fairly new to mmseqs2 and I wonder if I did anything wrong when processing the dataset. Any suggestion would be highly appreciated!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Pairwise sequence identity matrix is not symmetric #971

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Pairwise sequence identity matrix is not symmetric #971

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions