Skip to content

adding percentile and percentile rank docs #10201

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 10 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 9 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
299 changes: 271 additions & 28 deletions _aggregations/metric/percentile-ranks.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,64 +9,307 @@

# Percentile rank aggregations

Percentile rank is the percentile of values at or below a threshold grouped by a specified value. For example, if a value is greater than or equal to 80% of the values, it has a percentile rank of 80.
The `percentile_ranks` aggregation estimates the percentage of observed values that fall below or at given thresholds. This is useful for understanding the relative standing of a particular value within a distribution of values.

Check warning on line 12 in _aggregations/metric/percentile-ranks.md

View workflow job for this annotation

GitHub Actions / style-job

[vale] reported by reviewdog 🐶 [OpenSearch.DirectionAboveBelow] Use 'following or later' instead of 'below' for versions or orientation within a document. Use 'above' and 'below' only for physical space or screen descriptions. Raw Output: {"message": "[OpenSearch.DirectionAboveBelow] Use 'following or later' instead of 'below' for versions or orientation within a document. Use 'above' and 'below' only for physical space or screen descriptions.", "location": {"path": "_aggregations/metric/percentile-ranks.md", "range": {"start": {"line": 12, "column": 90}}}, "severity": "WARNING"}

For example, if you want to know how a transaction amount of `45` compares to other transaction values in a dataset, a percentile rank aggregation will return a value like `82.3`, which means 82.3% of transactions were less than or equal to `45`.

## Parameters

The `percentile_ranks` aggregation takes the following parameters.

| Parameter | Data type | Required/Optional | Description |
| ---------------------------------------- | ---------------- | ----------------- | ----------------------------------------------------------------------------------------------------------------------------------- |
| `field` | String | Required | The numeric field used to compute percentile ranks on. |
| `values` | Array of doubles | Required | The values used to calculate percentile ranks. |
| `keyed` | Boolean | Optional | If set to `false`, returns results as an array. Otherwise returns results as a JSON object. Default is `true`. |
| `tdigest.compression` | Double | Optional | Controls accuracy and memory usage of the `tdigest` algorithm. See [precision tuning with tdigest](#precision-tuning-with-tdigest). |

Check failure on line 25 in _aggregations/metric/percentile-ranks.md

View workflow job for this annotation

GitHub Actions / style-job

[vale] reported by reviewdog 🐶 [OpenSearch.Spelling] Error: tdigest. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks. Raw Output: {"message": "[OpenSearch.Spelling] Error: tdigest. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks.", "location": {"path": "_aggregations/metric/percentile-ranks.md", "range": {"start": {"line": 25, "column": 4}}}, "severity": "ERROR"}
| `hdr.number_of_significant_value_digits` | Integer | Optional | The precision setting for the HDR histogram. See [HDR histogram](#hdr-histogram). |
| `missing` | Number | Optional | The default value used when the target field is missing in a document. |
| `script` | Object | Optional | The script used to compute custom values instead of using a field. Supports inline or stored scripts. |


## Examples

See following examples covering multiple approaches to using `percentile_ranks`.

### Add sample data

First, create a sample index:

```json
PUT /transaction_data
{
"mappings": {
"properties": {
"amount": {
"type": "double"
}
}
}
}
```
{% include copy-curl.html %}

Add sample numeric values to illustrate percentile rank calculations:

```json
POST /transaction_data/_bulk
{ "index": {} }
{ "amount": 10 }
{ "index": {} }
{ "amount": 20 }
{ "index": {} }
{ "amount": 30 }
{ "index": {} }
{ "amount": 40 }
{ "index": {} }
{ "amount": 50 }
{ "index": {} }
{ "amount": 60 }
{ "index": {} }
{ "amount": 70 }
```
{% include copy-curl.html %}

### Percentile rank aggregation

Run a `percentile_ranks` aggregation to calculate how certain values compare to the overall distribution:

```json
GET /transaction_data/_search
{
"size": 0,
"aggs": {
"rank_check": {
"percentile_ranks": {
"field": "amount",
"values": [25, 55]
}
}
}
}
```
{% include copy-curl.html %}

The response demonstrates that 28.6% of the values are less than or equal to `25`, and 71.4% are less than or equal to `55`.

```json
{
...
"hits": {
"total": {
"value": 7,
"relation": "eq"
},
"max_score": null,
"hits": []
},
"aggregations": {
"rank_check": {
"values": {
"25.0": 28.57142857142857,
"55.0": 71.42857142857143
}
}
}
}
```

### Keyed response

You can change the format of the aggregation response by setting the `keyed` parameter to `false`:

```json
GET opensearch_dashboards_sample_data_ecommerce/_search
GET /transaction_data/_search
{
"size": 0,
"aggs": {
"percentile_rank_taxful_total_price": {
"rank_check": {
"percentile_ranks": {
"field": "taxful_total_price",
"values": [
10,
15
]
"field": "amount",
"values": [25, 55],
"keyed": false
}
}
}
}
```
{% include copy-curl.html %}

#### Example response
The response includes an array instead of an object:

```json
...
"aggregations" : {
"percentile_rank_taxful_total_price" : {
"values" : {
"10.0" : 0.055096056411283456,
"15.0" : 0.0830092961834656
{
...
"hits": {
"total": {
"value": 7,
"relation": "eq"
},
"max_score": null,
"hits": []
},
"aggregations": {
"rank_check": {
"values": [
{
"key": 25,
"value": 28.57142857142857
},
{
"key": 55,
"value": 71.42857142857143
}
]
Comment on lines +144 to +163
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The response here should not be keyed, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed here

}
}
}
}
```

This response indicates that the value `10` is at the `5.5`th percentile and the value `15` is at the `8.3`rd percentile.
### Precision tuning with tdigest

Check failure on line 169 in _aggregations/metric/percentile-ranks.md

View workflow job for this annotation

GitHub Actions / style-job

[vale] reported by reviewdog 🐶 [OpenSearch.Spelling] Error: tdigest. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks. Raw Output: {"message": "[OpenSearch.Spelling] Error: tdigest. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks.", "location": {"path": "_aggregations/metric/percentile-ranks.md", "range": {"start": {"line": 169, "column": 27}}}, "severity": "ERROR"}

As with the `percentiles` aggregation, you can control the level of approximation by setting the optional `tdigest.compression` field. A larger value increases the precision of the approximation but uses more heap space. The default value is 100.
Percentile ranks are calculated using the `tdigest` algorithm by default. You can control the trade-off between accuracy and memory usage by adjusting the `tdigest.compression` configuration. Higher values provide better accuracy, however require more memory. For more information about how tdigest works see [precision tuning with tdigest]({{site.url}}{{site.baseurl}}/aggregations/metric/percentile/#precision-tuning-with-tdigest)

Check failure on line 171 in _aggregations/metric/percentile-ranks.md

View workflow job for this annotation

GitHub Actions / style-job

[vale] reported by reviewdog 🐶 [OpenSearch.Spelling] Error: tdigest. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks. Raw Output: {"message": "[OpenSearch.Spelling] Error: tdigest. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks.", "location": {"path": "_aggregations/metric/percentile-ranks.md", "range": {"start": {"line": 171, "column": 44}}}, "severity": "ERROR"}

For example, use the following request to set `compression` to `200`:
The following example is configured with `tdigest.compression` set to `200`:

```json
GET opensearch_dashboards_sample_data_ecommerce/_search
GET /transaction_data/_search
{
"size": 0,
"aggs": {
"percentile_rank_taxful_total_price": {
"rank_check": {
"percentile_ranks": {
"field": "taxful_total_price",
"values": [
10,
15
],
"tdigest": {
"field": "amount",
"values": [25, 55],
"tdigest": {
"compression": 200
}
}
}
}
}
}
```
{% include copy-curl.html %}

### HDR histogram

As an alternative to `tdigest`, you can use the High Dynamic Range (HDR) histogram algorithm, which is better suited for large numbers of buckets and fast processing. For further details regarding how HDR histogram works see [HDR histogram]({{site.url}}{{site.baseurl}}/aggregations/metric/percentile/#hdr-histogram)

You should use HDR if you:

* Are aggregating across many buckets.
* Don't require extreme precision in the tail percentiles.
* Have sufficient memory available.

You should avoid HDR if:

* Tail accuracy is important.
* You're analyzing skewed or sparse data distributions.

The following example is configured with `hdr.number_of_significant_value_digits` set to `3`:

```json
GET /transaction_data/_search
{
"size": 0,
"aggs": {
"rank_check": {
"percentile_ranks": {
"field": "amount",
"values": [25, 55],
"hdr": {
"number_of_significant_value_digits": 3
}
}
}
}
}
```
{% include copy-curl.html %}

### Missing values

If some documents are missing the target field, you can instruct the query to use a fallback value by setting the `missing` parameter. The following example ensures that documents without an amount field will be treated as if the value were `0`, and included in the percentile ranks computation:

```json
GET /transaction_data/_search
{
"size": 0,
"aggs": {
"rank_check": {
"percentile_ranks": {
"field": "amount",
"values": [25, 55],
"missing": 0
}
}
}
}
```
{% include copy-curl.html %}

### Script

Instead of specifying a field, you can dynamically compute the value using a script. This is useful when you need to apply transformations, such as converting currencies or applying weights.

#### Inline script

The following example uses inline script to calculate the percentile ranks of the transformed values `30` and `60`, against values from the amount field multiplied by 10%:

```json
GET /transaction_data/_search
{
"size": 0,
"aggs": {
"rank_check": {
"percentile_ranks": {
"values": [30, 60],
"script": {
"source": "doc['amount'].value * 1.1"
}
}
}
}
}
```
{% include copy-curl.html %}

#### Stored script

Stored scripts can also be used.

To use a stored script first create it using the following command:

```json
POST _scripts/percentile_script
{
"script": {
"lang": "painless",
"source": "doc[params.field].value * params.multiplier"
}
}
```
{% include copy-curl.html %}

Use the stored script in the `percentile_ranks` aggregation:

```json
GET /transaction_data/_search
{
"size": 0,
"aggs": {
"rank_check": {
"percentile_ranks": {
"values": [30, 60],
"script": {
"id": "percentile_script",
"params": {
"field": "amount",
"multiplier": 1.1
}
}
}
}
}
}
```
Loading
Loading