Skip to content

Efficient computation of gene similarity matrix for large datasets in SLICE #10

@francescopatane96

Description

@francescopatane96

Thank you for the great work on SLICE. I'm currently applying getEntropy() to a Seurat object with ~9,000 cells and ~30,000 genes. As suggested, I'm attempting to use a gene similarity matrix (km) as input instead of precomputed clusters.

However, computing a full gene-gene similarity matrix (e.g., kappa or Jaccard) over 30,000 genes is computationally infeasible due to memory and time constraints (900M entries). I have a few questions:

What is the most efficient way to compute the similarity matrix (km) in this context?
Is there a recommended method for approximating kappa similarity (e.g., sparse binary matrices, nearest neighbors)?
Can SLICE work with a partial or sparsified matrix (e.g., top-N neighbors per gene)?
Is it possible to directly use HVGs (highly variable genes) to reduce the matrix size without losing biological meaning?
Would it be acceptable to use Jaccard similarity instead of kappa?
I am currently using proxyC::simil() on binarized expression, which is fast but may not match SLICE expectations.
I’d appreciate any advice or best practices you could share for scaling SLICE to large single-cell datasets.

Best regards,

Francesco

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions