Potential optimization #968

yanj14jy15 · 2025-03-06T05:39:27Z

Hi, I wonder if it's possible to add a deduplication step before calculating MSAs for colabfold. I noticed that when generating MSAs for a large batch of alphafold2-multimer-v3 analyses, there are quite some common proteins across different protein:protein pairs, and the MSAs of those common proteins got calculated repeatedly each time they appear. For example, if I duplicate proteinA:proteinB 1000 times, then colabfold will use mmseqs to calculate the MSA of proteinA 1000 times and the MSAs of proteinB 1000 times, while generating MSA for each of proteinA/B should suffice.

Additionally, when using mmseqs with multiple GPUs, I noticed that the precalculated indices will be loaded and split across GPUs. For a 80GB A100 or H100, the entire database can fit in one GPU pretty nicely. So I wonder if it's possible to adjust how the databases are loaded into GPU based on the size of GPU memory. For example, would it be possible to keep a copy of the database in each of the A100/H100 to reduce the communication time, especially if multiple GPUs are not connected by NVLINK? Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Potential optimization #968

Potential optimization #968

yanj14jy15 commented Mar 6, 2025

Potential optimization #968

Potential optimization #968

Comments

yanj14jy15 commented Mar 6, 2025