Efficient computation of gene similarity matrix for large datasets in SLICE

Thank you for the great work on SLICE. I'm currently applying getEntropy() to a Seurat object with ~9,000 cells and ~30,000 genes. As suggested, I'm attempting to use a gene similarity matrix (km) as input instead of precomputed clusters.

However, computing a full gene-gene similarity matrix (e.g., kappa or Jaccard) over 30,000 genes is computationally infeasible due to memory and time constraints (900M entries). I have a few questions:

What is the most efficient way to compute the similarity matrix (km) in this context?
Is there a recommended method for approximating kappa similarity (e.g., sparse binary matrices, nearest neighbors)?
Can SLICE work with a partial or sparsified matrix (e.g., top-N neighbors per gene)?
Is it possible to directly use HVGs (highly variable genes) to reduce the matrix size without losing biological meaning?
Would it be acceptable to use Jaccard similarity instead of kappa?
I am currently using proxyC::simil() on binarized expression, which is fast but may not match SLICE expectations.
I’d appreciate any advice or best practices you could share for scaling SLICE to large single-cell datasets.

Best regards,

Francesco 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Efficient computation of gene similarity matrix for large datasets in SLICE #10

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Efficient computation of gene similarity matrix for large datasets in SLICE #10

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions