Skip to content

Potential optimization #968

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
yanj14jy15 opened this issue Mar 6, 2025 · 0 comments
Open

Potential optimization #968

yanj14jy15 opened this issue Mar 6, 2025 · 0 comments

Comments

@yanj14jy15
Copy link

Hi, I wonder if it's possible to add a deduplication step before calculating MSAs for colabfold. I noticed that when generating MSAs for a large batch of alphafold2-multimer-v3 analyses, there are quite some common proteins across different protein:protein pairs, and the MSAs of those common proteins got calculated repeatedly each time they appear. For example, if I duplicate proteinA:proteinB 1000 times, then colabfold will use mmseqs to calculate the MSA of proteinA 1000 times and the MSAs of proteinB 1000 times, while generating MSA for each of proteinA/B should suffice.

Additionally, when using mmseqs with multiple GPUs, I noticed that the precalculated indices will be loaded and split across GPUs. For a 80GB A100 or H100, the entire database can fit in one GPU pretty nicely. So I wonder if it's possible to adjust how the databases are loaded into GPU based on the size of GPU memory. For example, would it be possible to keep a copy of the database in each of the A100/H100 to reduce the communication time, especially if multiple GPUs are not connected by NVLINK? Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant