Skip to content

Finish implementing indexer #11

Open
@cmdevries

Description

@cmdevries

It would be ideal for the indexer to output integer valued document vectors with term frequencies. These can be optionally written to disk in a compressed format using https://github.com/lemire/FastPFOR to allow for easy experimentation with different representation approaches. It would also output term collection statistics.

This would allow quick processing to convert the vectors to weighted vectors with TF-IDF, BM25, etc, or conversion to signatures.

It would be great to do the same with bi-grams and invent or reuse a weighting scheme that uses pointwise mutual information (http://nlpwp.org/book/chap-ngrams.xhtml#chap-ngrams-bigrams) in the weighting calculation.

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions