Open
Description
It would be ideal for the indexer to output integer valued document vectors with term frequencies. These can be optionally written to disk in a compressed format using https://github.com/lemire/FastPFOR to allow for easy experimentation with different representation approaches. It would also output term collection statistics.
This would allow quick processing to convert the vectors to weighted vectors with TF-IDF, BM25, etc, or conversion to signatures.
It would be great to do the same with bi-grams and invent or reuse a weighting scheme that uses pointwise mutual information (http://nlpwp.org/book/chap-ngrams.xhtml#chap-ngrams-bigrams) in the weighting calculation.