Skip to content
40 changes: 23 additions & 17 deletions backend/tests/regression/search_quality/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,9 @@

This Python script evaluates the search results for a list of queries.

Unlike the script in answer_quality, this script is much less customizable and runs using currently ingested documents, though it allows for quick testing of search parameters on a bunch of test queries that don't have well-defined answers.
This script will likely get refactored in the future as an API endpoint.
In the meanwhile, it is used to evaluate the search quality using locally ingested documents.
The key differentating factor with `answer_quality` is that it can evaluate results without explicit "ground truth" using the reranker as a reference.

## Usage

Expand All @@ -25,32 +27,36 @@ This can be checked/modified by opening the admin panel, going to search setting
cd path/to/onyx/backend/tests/regression/search_quality
```

5. Copy `search_queries.json.template` to `search_queries.json` and add/remove test queries in it
5. Copy `test_queries.json.template` to `test_queries.json` and add/remove test queries in it. The possible fields are:

6. Run `generate_search_queries.py` to generate the modified queries for the search pipeline
- `question: str` the query
- `question_keyword: Optional[str]` modified query specifically for the retriever
- `ground_truth: Optional[list[GroundTruth]]` a ranked list of expected search results with fields:
- `doc_source: str` document source (e.g., Web, Drive, Linear), currently unused
- `doc_link: str` link associated with document, used to find corresponding document in local index
- `categories: Optional[list[str]]` list of categories, used to aggregate evaluation results

```
python generate_search_queries.py
```
6. Copy `search_eval_config.yaml.template` to `search_eval_config.yaml` and specify the search and eval parameters

7. Copy `search_eval_config.yaml.template` to `search_eval_config.yaml` and specify the search and eval parameters
8. Run `run_search_eval.py` to evaluate the search results against the reranked results
7. Run `run_search_eval.py` to run the search and evaluate the search results

```
python run_search_eval.py
```

9. Repeat steps 7 and 8 to test and compare different search parameters
8. Optionally, save the generated `test_queries.json` in the export folder to reuse the generated `question_keyword`, and rerun the search evaluation with alternative search parameters.

## Metrics
- Jaccard Similarity: the ratio between the intersect and the union between the topk search and rerank results. Higher is better
- Average Rank Change: The average absolute rank difference of the topk reranked chunks vs the entire search chunks. Lower is better
- Average Missing Chunk Ratio: The number of chunks in the topk reranked chunks not in the topk search chunks, over topk. Lower is better
There are two main metrics currently implemented:
- ratio_topk: the ratio of documents in the comparison set that are in the topk search results (higher is better, 0~1)
- avg_rank_delta: the average rank difference between the comparison set and search results (lower is better, 0~inf)

Ratio topk gives a general idea on whether the most relevant documents are appearing first in the search results. Decreasing `eval_topk` will make this metric stricter, requiring relevant documents to appear in a narrow window.

Avg rank delta is another metric which can give insight on the performance of documents not in the topk search results. If none of the comparison documents are in the topk, `ratio topk` will only show a 0, whereas `avg rank delta` will show a higher value the worse the search results gets.

Note that all of these metrics are affected by very narrow search results.
E.g., if topk is 20 but there is only 1 relevant document, the other 19 documents could be ordered arbitrarily, resulting in a lower score.
Furthermore, there are two versions of the metrics: ground truth, and soft truth.

The ground truth includes documents explicitly listed as relevant in the test dataset. The ground truth metrics will only be computed if a ground truth set is provided for the question and exists in the index.

To address this limitation, there are score adjusted versions of the metrics.
The score adjusted version does not use a fixed topk, but computes the optimum topk based on the rerank scores.
This generally works in determining how many documents are relevant, although note that this approach isn't perfect.
The soft truth is built on top of the ground truth (if provided), filling the remaining entries with results from the reranker. The soft truth metrics will only be computed if `skip_rerank` is false. Computing the soft truth metric can be extremely slow, especially for large `num_returned_hits`. However, it can provide a good basis when there are many relevant documents in no particular order, or for running quick tests without explicitly having to mention which documents are relevant.
Loading
Loading