-
Notifications
You must be signed in to change notification settings - Fork 265
Description
As instructed by the user manual and this blogpost, I used the following script to compute pairwise sequence identity for a dataset (called bindingnet here):
fake_pref() {
QDB="$1"
TDB="$2"
RES="$3"
# create link to data file which contains a list of all targets that should be aligned
ln -s "${TDB}.index" "${RES}"
# create new index repeatedly pointing to same entry
INDEX_SIZE="$(echo $(wc -c < "${TDB}.index"))"
awk -v size=$INDEX_SIZE '{ print $1"\t0\t"size; }' "${QDB}.index" > "${RES}.index"
# create dbtype (7)
awk 'BEGIN { printf("%c%c%c%c",7,0,0,0); exit; }' > "${RES}.dbtype"
}
mmseqs createdb bindingnet.fasta bindingnet --shuffle 0
fake_pref bindingnet bindingnet allvsallpref
mmseqs align bindingnet bindingnet allvsallpref allvsallaln -e inf --alignment-mode 3 --seq-id-mode 1 --cov-mode 5
mmseqs convertalis bindingnet bindingnet allvsallaln results.m8
However, I found that some pairs did not have the same sequence identities, i.e. the pairwise sequence identity matrix was not symmetric. For example, in results.m8 I had:
bindingnet_target_235 bindingnet_target_470 0.011 6 5 0 328 333 134 139 5.815E+03 16
...
bindingnet_target_470 bindingnet_target_235 0.033 48 42 0 173 216 444 491 2.536E+03 16
which showed different alignment lengths (6 v.s. 48) and different identity values (0.011 v.s. 0.033).
I took a closer look at the 235-470 pair with the following script:
mmseqs createdb target_235.fasta DB235
mmseqs createdb target_470.fasta DB470
ln -s DB470.index allvsallpref
awk -v size=$(wc -c < DB470.index) '{print $1"\t0\t"size}' DB235.index > allvsallpref.index
awk 'BEGIN {printf("%c%c%c%c",7,0,0,0)}' > allvsallpref.dbtype
mmseqs align DB235 DB470 allvsallpref DB235_vs_DB470_aln -e inf --alignment-mode 3 --seq-id-mode 1
mmseqs convertalis DB235 DB470 DB235_vs_DB470_aln result_235_vs_470.m8
# Repeat in reverse direction to verify symmetry explicitly
ln -s DB235.index allvsallpref_rev
mmseqs align DB470 DB235 allvsallpref_rev DB470_vs_DB235_aln -e inf --alignment-mode 3 --seq-id-mode 1
mmseqs convertalis DB470 DB235 DB235_vs_DB470_aln result_470_vs_235.m8
And this time I got the following (collected from the two output m8 files)
bindingnet_target_235 bindingnet_target_470 0.011 6 5 0 328 333 134 139 5.038E+00 16
bindingnet_target_470 bindingnet_target_235 0.011 6 5 0 328 333 134 139 5.038E+00 16
I am a little lost as I thought the two approaches should be equivalent, except that in the first case bindingnet.fasta contains multiple sequences from the dataset. For your reference, here I am attaching bindingnet.fasta I used (bindingnet.txt).
I am still fairly new to mmseqs2 and I wonder if I did anything wrong when processing the dataset. Any suggestion would be highly appreciated!