Skip to content

Open-Omics-Acceleration-Framework-3.0

Latest
Compare
Choose a tag to compare
@sanchit-misra sanchit-misra released this 13 Dec 11:58
· 4 commits to main since this release
3cfa489

This v3.0 release expands the footprint of Open Omics to several drug discovery tasks – including protein design, molecular docking and De novo drug molecule design, while also adding more tools for transcriptomics and protein folding. In total, this adds nine GenAI based methods for various drug discovery tasks in addition to one aligner for RNA-seq reads and two molecular docking tools. To ensure smooth building and deployment, we provide Dockerfiles for all the workloads that rely on multiple packages. More specifically, this release adds the following new tools & pipelines:

  • Transcriptomics:
    • STAR aligner v2.7.11b: STAR is a popular RNASeq sequence aligner. It takes paired fastq(.gz) file(s) with RNA reads as input and aligns those reads to the reference genome. STAR outputs the alignments in a SAM file.
  • Protein folding – containerized versions of:
    • AlphaFold2 multimer v2.3.2: takes as input the sequences of one or more protein complexes in fasta files and outputs their predicted structures in pdb and auxiliary outputs in pkl files.
    • ESMFold v1.0.3: takes one or more proteins in fasta files as input and outputs their predicted structures in pdb files.
  • Protein Design – containerized versions of:
    • RFDiffusion v1.1.0: this diffusion based computational tool generates de novo protein structure. It takes as input protein structure specifications in a pdb file and outputs the generated structure in a pdb file.
    • ESM-2 embedding v1.0.3: takes one or more protein sequences in fasta files and outputs their generated embeddings in pt files for downstream analysis.
    • LM-design v1.0.3: Language models trained on sequences of natural proteins to generate de novo proteins. This includes two tasks: (i) Fixed-backbone design: Generates protein sequences for a given structure provided in a pdb file, and (ii) Free-generation: Takes the sequence length as input and generates a sequence of that length.
    • ESM2-inv v1.0.3: This inverse folding model is designed to predict protein sequences from protein structure backbone. This includes two tasks: (i) Sequence Design: Generates protein sequences for a given structure. The input can be either a pdb file or a cif file, and the output is saved as a fasta file. (ii) Sequence Scoring: Evaluates and scores sequences for compatibility with a given structure. The input requires protein structure in a pdb file and a sequence in a fasta file, and the output is saved as a csv file containing the scores.
    • ProtGPT2: ProtGPT2 is a popular deep language model that generates de novo protein sequences. The code and the model are hosted on HuggingFace. We used commit #4425556 as a base for our optimizations. It generates the user-provided number of sequences with the specified sequence length as the output in a fasta file.
    • ProteinMPNN v1.0.1: ProteinMPNN offers multiple functionalities, e.g. (i) generates the amino acid sequences given protein structure backbone, (ii) enables design of de novo proteins and optimizations of existing ones. It takes the protein structure in pdb file format and generates corresponding amino acid sequence(s) in fasta file format.
  • Molecular Docking – containerized versions of:
    • AutoDock v1.4: AutoDock is a tool used for predicting how ligand molecules bind to a protein receptor of known 3D structure. It takes protein map in fld file and ligand in pdbqt file as input and generates an output dlg file which contains the final docked pose and its energy value.
    • AutoDock-Vina v1.2.2: AutoDock-Vina is an improved version of AutoDock that doesn’t require protein map fld files as input. It takes protein pdbqt file, ligand pdbqt file, and dimensions of the box where the docking is to be performed as input and generates a docking result pdbqt file that contains multiple ranked docked poses.
  • De novo drug molecule search – containerized version of
    • MoFlow v.1.0: MoFlow is a flow-based graph generative model designed to generate chemically valid molecular graphs efficiently and accurately. It supports following tasks: molecular graph generation and reconstruction, visualization of the continuous latent space, property optimization, and constrained property optimization. It takes as input a set of parame ters and gives the molecular graph as output.