diff --git a/_aggregations/metric/percentile-ranks.md b/_aggregations/metric/percentile-ranks.md index 11ad5bb895..124192ecdd 100644 --- a/_aggregations/metric/percentile-ranks.md +++ b/_aggregations/metric/percentile-ranks.md @@ -9,20 +9,78 @@ redirect_from: # Percentile rank aggregations -Percentile rank is the percentile of values at or below a threshold grouped by a specified value. For example, if a value is greater than or equal to 80% of the values, it has a percentile rank of 80. +The `percentile_ranks` aggregation estimates the percentage of observed values that fall below or at given thresholds. This is useful for understanding the relative standing of a particular value within a distribution of values. + +For example, you can use a percentile rank aggregation to learn how a transaction amount of `45` compares to other transaction values in a dataset. The percentile rank aggregation returns a value like `82.3`, which means 82.3% of transactions are less than or equal to `45`. + +## Parameters + +The `percentile_ranks` aggregation takes the following parameters. + +| Parameter | Data type | Required/Optional | Description | +| ---------------------------------------- | ---------------- | ----------------- | ----------------------------------------------------------------------------------------------------------------------------------- | +| `field` | String | Required | The numeric field used to compute percentile ranks. | +| `values` | Array of doubles | Required | The values used to calculate percentile ranks. | +| `keyed` | Boolean | Optional | If set to `false`, returns results as an array. Otherwise returns results as a JSON object. Default is `true`. | +| `tdigest.compression` | Double | Optional | Controls accuracy and memory usage of the `tdigest` algorithm. See [Precision tuning with tdigest](#precision-tuning-with-tdigest). | +| `hdr.number_of_significant_value_digits` | Integer | Optional | The precision setting for the HDR histogram. See [HDR histogram](#hdr-histogram). | +| `missing` | Number | Optional | The default value used when the target field is missing in a document. | +| `script` | Object | Optional | The script used to compute custom values instead of using a field. Supports inline and stored scripts. | + + +## Example + + + +First, create a sample index: + +```json +PUT /transaction_data +{ + "mappings": { + "properties": { + "amount": { + "type": "double" + } + } + } +} +``` +{% include copy-curl.html %} + +Add sample numeric values to illustrate percentile rank calculations: + +```json +POST /transaction_data/_bulk +{ "index": {} } +{ "amount": 10 } +{ "index": {} } +{ "amount": 20 } +{ "index": {} } +{ "amount": 30 } +{ "index": {} } +{ "amount": 40 } +{ "index": {} } +{ "amount": 50 } +{ "index": {} } +{ "amount": 60 } +{ "index": {} } +{ "amount": 70 } +``` +{% include copy-curl.html %} + + +Run a `percentile_ranks` aggregation to calculate how certain values compare to the overall distribution: ```json -GET opensearch_dashboards_sample_data_ecommerce/_search +GET /transaction_data/_search { "size": 0, "aggs": { - "percentile_rank_taxful_total_price": { + "rank_check": { "percentile_ranks": { - "field": "taxful_total_price", - "values": [ - 10, - 15 - ] + "field": "amount", + "values": [25, 55] } } } @@ -30,40 +88,97 @@ GET opensearch_dashboards_sample_data_ecommerce/_search ``` {% include copy-curl.html %} -#### Example response +The response demonstrates that 28.6% of the values are less than or equal to `25` and 71.4% are less than or equal to `55`: ```json -... -"aggregations" : { - "percentile_rank_taxful_total_price" : { - "values" : { - "10.0" : 0.055096056411283456, - "15.0" : 0.0830092961834656 +{ + ... + "hits": { + "total": { + "value": 7, + "relation": "eq" + }, + "max_score": null, + "hits": [] + }, + "aggregations": { + "rank_check": { + "values": { + "25.0": 28.57142857142857, + "55.0": 71.42857142857143 + } } } - } } ``` -This response indicates that the value `10` is at the `5.5`th percentile and the value `15` is at the `8.3`rd percentile. +## Keyed response -As with the `percentiles` aggregation, you can control the level of approximation by setting the optional `tdigest.compression` field. A larger value increases the precision of the approximation but uses more heap space. The default value is 100. +You can change the format of the returned aggregation from a JSON object to a list of key-value pairs by setting the `keyed` parameter to `false`: -For example, use the following request to set `compression` to `200`: +```json +GET /transaction_data/_search +{ + "size": 0, + "aggs": { + "rank_check": { + "percentile_ranks": { + "field": "amount", + "values": [25, 55], + "keyed": false + } + } + } +} +``` +{% include copy-curl.html %} + +The response includes an array instead of an object: ```json -GET opensearch_dashboards_sample_data_ecommerce/_search +{ + ... + "hits": { + "total": { + "value": 7, + "relation": "eq" + }, + "max_score": null, + "hits": [] + }, + "aggregations": { + "rank_check": { + "values": [ + { + "key": 25, + "value": 28.57142857142857 + }, + { + "key": 55, + "value": 71.42857142857143 + } + ] + } + } +} +``` + +## Precision tuning with tdigest + +By default, percentile ranks are calculated using the `tdigest` algorithm. You can control the trade-off between accuracy and memory usage by specifying the `tdigest.compression` parameter. Higher values provide better accuracy but require more memory. For more information about how tdigest works, see [Precision tuning with tdigest]({{site.url}}{{site.baseurl}}/aggregations/metric/percentile/#precision-tuning-with-tdigest). + +The following example is configured with `tdigest.compression` set to `200`: + +```json +GET /transaction_data/_search { "size": 0, "aggs": { - "percentile_rank_taxful_total_price": { + "rank_check": { "percentile_ranks": { - "field": "taxful_total_price", - "values": [ - 10, - 15 - ], - "tdigest": { + "field": "amount", + "values": [25, 55], + "tdigest": { "compression": 200 } } @@ -72,3 +187,126 @@ GET opensearch_dashboards_sample_data_ecommerce/_search } ``` {% include copy-curl.html %} + +### HDR histogram + +As an alternative to `tdigest`, you can use the High Dynamic Range (HDR) histogram algorithm, which is better suited for large numbers of buckets and fast processing. For more information about how the HDR histogram works, see [HDR histogram]({{site.url}}{{site.baseurl}}/aggregations/metric/percentile/#hdr-histogram). + +You should use HDR if you: + +* Are aggregating across many buckets. +* Don't require extreme precision in the tail percentiles. +* Have sufficient memory available. + +You should avoid HDR if: + +* Tail accuracy is important. +* You're analyzing skewed or sparse data distributions. + +The following example is configured with `hdr.number_of_significant_value_digits` set to `3`: + +```json +GET /transaction_data/_search +{ + "size": 0, + "aggs": { + "rank_check": { + "percentile_ranks": { + "field": "amount", + "values": [25, 55], + "hdr": { + "number_of_significant_value_digits": 3 + } + } + } + } +} +``` +{% include copy-curl.html %} + +### Missing values + +If some documents are missing the target field, you can instruct the query to use a fallback value by setting the `missing` parameter. The following example ensures that documents without an `amount` field are treated as if their values are `0` and are included in the percentile ranks computation: + +```json +GET /transaction_data/_search +{ + "size": 0, + "aggs": { + "rank_check": { + "percentile_ranks": { + "field": "amount", + "values": [25, 55], + "missing": 0 + } + } + } +} +``` +{% include copy-curl.html %} + +### Script + +Instead of specifying a field, you can dynamically compute the value using a script. This is useful when you need to apply transformations, such as converting currencies or applying weights. + +#### Inline script + +The following example uses an inline script to calculate the percentile ranks of the transformed values `30` and `60` against values from the `amount` field, increased by 10%: + +```json +GET /transaction_data/_search +{ + "size": 0, + "aggs": { + "rank_check": { + "percentile_ranks": { + "values": [30, 60], + "script": { + "source": "doc['amount'].value * 1.1" + } + } + } + } +} +``` +{% include copy-curl.html %} + +#### Stored script + + +To use a stored script, first create it using the following request: + +```json +POST _scripts/percentile_script +{ + "script": { + "lang": "painless", + "source": "doc[params.field].value * params.multiplier" + } +} +``` +{% include copy-curl.html %} + +Then use the stored script in the `percentile_ranks` aggregation: + +```json +GET /transaction_data/_search +{ + "size": 0, + "aggs": { + "rank_check": { + "percentile_ranks": { + "values": [30, 60], + "script": { + "id": "percentile_script", + "params": { + "field": "amount", + "multiplier": 1.1 + } + } + } + } + } +} +``` +{% include copy-curl.html %} diff --git a/_aggregations/metric/percentile.md b/_aggregations/metric/percentile.md index d9168e4539..e9d0d0b3cb 100644 --- a/_aggregations/metric/percentile.md +++ b/_aggregations/metric/percentile.md @@ -9,22 +9,121 @@ redirect_from: # Percentile aggregations -Percentile is the percentage of the data that's at or below a certain threshold value. +The `percentiles` aggregation estimates the value at a given percentile of a numeric field. This is useful for understanding distribution boundaries. -The `percentile` metric is a multi-value metric aggregation that lets you find outliers in your data or figure out the distribution of your data. +For example, a 95th percentile of `load_time` = `120ms` means that 95% of values are less than or equal to 120 ms. -Like the `cardinality` metric, the `percentile` metric is also approximate. +Similarly to the [`cardinality`]({{site.url}}{{site.baseurl}}/aggregations/metric/cardinality/) metric, the `percentile` metric is approximate. -The following example calculates the percentile in relation to the `taxful_total_price` field: +## Parameters + +The `percentiles` aggregation takes the following parameters. + +| Parameter | Data type | Required/Optional | Description | +| ---------------------------------------- | ---------------- | -------- | --------------------------------------------------------------------------------------------------------------------------- | +| `field` | String | Required | The numeric field used to compute percentiles. | +| `percents` | Array of doubles | Optional | The list of percentiles returned in the response. Default is `[1, 5, 25, 50, 75, 95, 99]`. | +| `keyed` | Boolean | Optional | If set to `false`, returns results as an array. Otherwise, returns results as a JSON object. Default is `true`. | +| `tdigest.compression` | Double | Optional | Controls accuracy and memory usage of the `tdigest` algorithm. See [Precision tuning with tdigest](#precision-tuning-with-tdigest). | +| `hdr.number_of_significant_value_digits` | Integer | Optional | The precision setting for the HDR histogram. See [HDR histogram](#hdr-histogram). | +| `missing` | Number | Optional | The default value used when the target field is missing in a document. | +| `script` | Object | Optional | The script used to compute custom values instead of using a field. Supports inline and stored scripts. | + +## Example + + + +First, create an index: + +```json +PUT /latency_data +{ + "mappings": { + "properties": { + "load_time": { + "type": "double" + } + } + } +} +``` +{% include copy-curl.html %} + +Add sample numeric values to illustrate percentile calculations: + +```json +POST /latency_data/_bulk +{ "index": {} } +{ "load_time": 20 } +{ "index": {} } +{ "load_time": 40 } +{ "index": {} } +{ "load_time": 60 } +{ "index": {} } +{ "load_time": 80 } +{ "index": {} } +{ "load_time": 100 } +{ "index": {} } +{ "load_time": 120 } +{ "index": {} } +{ "load_time": 140 } +``` + +{% include copy-curl.html %} + +### Percentiles aggregation + +The following example calculates the default set of percentiles for the `load_time` field: + +```json +GET /latency_data/_search +{ + "size": 0, + "aggs": { + "load_time_percentiles": { + "percentiles": { + "field": "load_time" + } + } + } +} +``` +{% include copy-curl.html %} + +By default, the 1st, 5th, 25th, 50th, 75th, 95th, and 99th percentiles are returned: + +```json +{ + ... + "aggregations": { + "load_time_percentiles": { + "values": { + "1.0": 20, + "5.0": 20, + "25.0": 40, + "50.0": 80, + "75.0": 120, + "95.0": 140, + "99.0": 140 + } + } + } +} +``` + +## Custom percentiles + +You can specify the exact percentiles using the `percents` array: ```json -GET opensearch_dashboards_sample_data_ecommerce/_search +GET /latency_data/_search { "size": 0, "aggs": { - "percentile_taxful_total_price": { + "load_time_percentiles": { "percentiles": { - "field": "taxful_total_price" + "field": "load_time", + "percents": [50, 90, 99] } } } @@ -32,39 +131,117 @@ GET opensearch_dashboards_sample_data_ecommerce/_search ``` {% include copy-curl.html %} -#### Example response +The response includes only the three requested percentile aggregations: ```json -... -"aggregations" : { - "percentile_taxful_total_price" : { - "values" : { - "1.0" : 21.984375, - "5.0" : 27.984375, - "25.0" : 44.96875, - "50.0" : 64.22061688311689, - "75.0" : 93.0, - "95.0" : 156.0, - "99.0" : 222.0 +{ + ... + "aggregations": { + "load_time_percentiles": { + "values": { + "50.0": 80, + "90.0": 140, + "99.0": 140 + } } } - } } ``` -You can control the level of approximation using the optional `tdigest.compression` field. A larger value indicates that the data structure that approximates percentiles is more accurate but uses more heap space. The default value is 100. +### Keyed response -For example, use the following request to set `compression` to `200`: +You can change the format of the returned aggregation from a JSON object to a list of key-value pairs by setting the `keyed` parameter to `false`: ```json -GET opensearch_dashboards_sample_data_ecommerce/_search +GET /latency_data/_search { "size": 0, "aggs": { - "percentile_taxful_total_price": { + "load_time_percentiles": { "percentiles": { - "field": "taxful_total_price", - "tdigest": { + "field": "load_time", + "keyed": false + } + } + } +} +``` +{% include copy-curl.html %} + +The response provides percentiles as an array of values: + +```json +{ + ... + "aggregations": { + "load_time_percentiles": { + "values": [ + { + "key": 1, + "value": 20 + }, + { + "key": 5, + "value": 20 + }, + { + "key": 25, + "value": 40 + }, + { + "key": 50, + "value": 80 + }, + { + "key": 75, + "value": 120 + }, + { + "key": 95, + "value": 140 + }, + { + "key": 99, + "value": 140 + } + ] + } + } +} +``` + +### Precision tuning with tdigest + +The `tdigest` algorithm is the default method used to calculate percentiles. It provides a memory-efficient way to estimate percentile ranks, especially when working with floating-point data such as response times or latencies. + +Unlike exact percentile calculations, `tdigest` uses a probabilistic approach that groups values into _centroids_---small clusters that summarize the distribution. This method enables accurate estimates for most percentiles without needing to store all the raw data in memory. + +The algorithm is designed to be highly accurate near the tails of the distribution---the low percentiles (such as 1st) and high percentiles (such as 99th)---which are often the most important for performance analysis. You can control the precision of the results using the `compression` parameter. + +A higher `compression` value means that more centroids are used, which increases accuracy (especially in the tails) but requires more memory and CPU. A lower `compression` value reduces memory usage and speeds up execution, but the results may be less accurate. + + +Use `tdigest` when: + +* Your data includes floating-point values, such as response times, latency, or duration. +* You need accurate results in the extreme percentiles, for example, the 1st or 99th. + +Avoid `tdigest` when: + +* You are working only with integer data and want maximum speed. +* You care less about accuracy in the distribution tails and prefer faster aggregation (consider using [`hdr`](#hdr-histogram) instead). + + The following example sets `tdigest.compression` to `200`: + +```json +GET /latency_data/_search +{ + "size": 0, + "aggs": { + "load_time_percentiles": { + "percentiles": { + "field": "load_time", + "tdigest": { "compression": 200 } } @@ -72,18 +249,52 @@ GET opensearch_dashboards_sample_data_ecommerce/_search } } ``` +{% include copy-curl.html %} + +### HDR histogram + +The High Dynamic Range (HDR) histogram is an alternative to [`tdigest`](#precision-tuning-with-tdigest) for calculating percentiles. It is especially useful when dealing with large datasets and latency measurements. It is designed for speed and supports a wide dynamic range of values while maintaining a fixed, configurable level of precision. + +Unlike [`tdigest`](#precision-tuning-with-tdigest), which offers more accuracy in the tails of a distribution (extreme percentiles), HDR prioritizes speed and uniform accuracy across the range. It works best when the number of buckets is large and extreme precision in rare values is not required. -The default percentiles returned are `1, 5, 25, 50, 75, 95, 99`. You can specify other percentiles in the optional `percents` field. For example, to get the 99.9th and 99.99th percentiles, run the following request: +For example, if you're measuring response times ranging from 1 microsecond to 1 hour and configure HDR with 3 significant digits, it will record values with a precision of ±1 microsecond for values up to 1 millisecond and ±3.6 seconds for values near 1 hour. + +This trade-off makes HDR much faster and more memory-intensive than [`tdigest`](#precision-tuning-with-tdigest). + +The following table presents the breakdown of HDR significant digits. + +| Significant digits | Relative precision (max error) | +| ------------------ | ------------------------------ | +| 1 | 1 part in 10 = 10% | +| 2 | 1 part in 100 = 1% | +| 3 | 1 part in 1,000 = 0.1% | +| 4 | 1 part in 10,000 = 0.01% | +| 5 | 1 part in 100,000 = 0.001% | + +You should use HDR if you: + +* Are aggregating across many buckets. +* Don't require extreme precision in the tail percentiles. +* Have sufficient memory available. + +You should avoid HDR if: + +* Tail accuracy is important. +* You are analyzing skewed or sparse data distributions. + +The following example is configured with `hdr.number_of_significant_value_digits` set to `3`: ```json -GET opensearch_dashboards_sample_data_ecommerce/_search +GET /latency_data/_search { "size": 0, "aggs": { - "percentile_taxful_total_price": { + "load_time_percentiles": { "percentiles": { - "field": "taxful_total_price", - "percents": [99.9, 99.99] + "field": "load_time", + "hdr": { + "number_of_significant_value_digits": 3 + } } } } @@ -91,4 +302,90 @@ GET opensearch_dashboards_sample_data_ecommerce/_search ``` {% include copy-curl.html %} -The specified value overrides the default percentiles, so only the percentiles you specify are returned. +### Missing values + +Use the `missing` setting to configure a fallback value for documents that do not contain the target field: + +```json +GET /latency_data/_search +{ + "size": 0, + "aggs": { + "load_time_percentiles": { + "percentiles": { + "field": "load_time", + "missing": 0 + } + } + } +} +``` +{% include copy-curl.html %} + +## Script + +Instead of specifying a field, you can dynamically compute the value using a script. This is useful when you need to apply transformations, such as converting currencies or applying weights. + +### Inline script + +Use a script to compute derived values: + +```json +GET /latency_data/_search +{ + "size": 0, + "aggs": { + "adjusted_percentiles": { + "percentiles": { + "script": { + "source": "doc['load_time'].value * 1.2" + }, + "percents": [50, 95] + } + } + } +} +``` +{% include copy-curl.html %} + +### Stored script + + +First, create a sample script using the following request: + +```json +POST _scripts/load_script +{ + "script": { + "lang": "painless", + "source": "doc[params.field].value * params.multiplier" + } +} +``` +{% include copy-curl.html %} +{% include copy-curl.html %} + +Then use the stored script in the `percentiles` aggregation, providing the `params` required by the stored script: + +```json +GET /latency_data/_search +{ + "size": 0, + "aggs": { + "adjusted_percentiles": { + "percentiles": { + "script": { + "id": "load_script", + "params": { + "field": "load_time", + "multiplier": 1.2 + } + }, + "percents": [50, 95] + } + } + } +} +``` +{% include copy-curl.html %} +{% include copy-curl.html %}