Skip to content

removal_least_used parameter is improperly used in document-similarity-s1-rank_filter #430

@marekhorst

Description

@marekhorst

After comparing two different ranking scripts: document-similarity-s1-rank_filter.pig and document-similarity-s1-ship-rank_filter.pig and deeper inspection of the document-similarity-s1-rank_filter.pig script it seems the removal_least_used is improperly used: it should be compared against the number of referenced docs ($1) instead of the rank position ($0).

Currently far less terms are filtered out because of this bug. In most cases only terms referenced once are discarded because the rank index is not dense and there are almost always more than 20 terms with single document reference. In current OpenAIRE documents similarity configuration removal_least_used was set to 20 so all the terms referenced in less than 20 documents should be filtered out.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions