Skip to content

Future improvements #414

@ChristopherMancuso

Description

@ChristopherMancuso

This list will contain items that won't be worked on in the immediate future, but might be implemented later. If an item will be worked on remove it from this list and create a new issue

  • Add a clustering method that works directly on the embedding (data) matrix like dbscan.
  • When clustering is run, add a set of results that aggregates the results across clusters (gene predictions, pre-trained similarities, etc)
    • This would use tao-rank or max/avg prob calculation. Then would probably have this be considered another model like All-Genes, Cluster-01 and be called Cluster-Integration
  • To ascertain how good a pre-trained model is, when generating weights also get a CV metric. These could be added to similarity tables.
  • For DOMINO add functionality to add genes back in
  • Re-introduce single species networks
  • Re-introduce adjacency matrix implementation
  • Remove IMP as it is very large and doesn't really support many species. Then maybe have STRING and BioGRID include more species.
  • Think about a different metric besides APOP
  • figure out what we need to get geneplexus on a conda channel and not just pip
  • add parallel processing to some function calls as with may clusters it can be pretty slow
  • consider another place for backend data besides Zenodo
    • Zenodo can be very slow and spotty, but other places require paying per download
  • Add permutation tests for deriving a p-value in probability calculations
  • for backend edgelists, maybe remove the edge weight threshold (need to drop IMP), or figure a good way to set it
  • can we give a user a numerical value for the connectivity of their input gene set
    • how to do this is embedding is chosen? would this score be interpretable?
  • when creating the similarity tables, is there a way to highlight/show the genes in each term that are most important in the context of the original set? For embedding could use integrate info from data matrix as well.
    • If two terms aren't similar, does that matter?
  • See if calculating the cosine similarity matrix on the fly would be feasible to do
  • Right now extra pytest data (- Edgelist__Human__STRING.edg) is saved in test folder, but should be moved to Zenodo with rest of pytest data and remove shutil.copy from test_geneplexus.py
  • maybe fix things with tests
    • does GitHub action need both pull request and push?
    • does tox and github actions do redundant things?
    • can we store pytest data in github cache so it doesn't need to be downloaded every time
  • make more biologically relevant examples for different species and clustering. Save these examples on Zenodo so users can pull them

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions