You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: explore-analyze/machine-learning/anomaly-detection/ml-delayed-data-detection.md
+64Lines changed: 64 additions & 0 deletions
Original file line number
Diff line number
Diff line change
@@ -17,3 +17,67 @@ When you create a {{dfeed}}, you can specify a [`query_delay`](https://www.elast
17
17
::::{important}
18
18
If you get an error that says `Datafeed missed XXXX documents due to ingest latency`, consider increasing the value of query_delay. If it doesn’t help, investigate the ingest latency and its cause. You can do this by comparing event and ingest timestamps. High latency is often caused by bursts of ingested documents, misconfiguration of the ingest pipeline, or misalignment of system clocks.
19
19
::::
20
+
21
+
## Why worry about delayed data?
22
+
23
+
If data are delayed randomly (and consequently are missing from analysis), the
24
+
results of certain types of functions are not really affected. In these
25
+
situations, it all comes out okay in the end as the delayed data is distributed
26
+
randomly. An example would be a `mean` metric for a field in a large collection
27
+
of data. In this case, checking for delayed data may not provide much benefit.
28
+
If data are consistently delayed, however, {{anomaly-jobs}} with a `low_count`
29
+
function may provide false positives. In this situation, it would be useful to
30
+
see if data comes in after an anomaly is recorded so that you can determine a
31
+
next course of action.
32
+
33
+
## How do we detect delayed data?
34
+
35
+
In addition to the `query_delay` field, there is a delayed data check config,
36
+
which enables you to configure the datafeed to look in the past for delayed data.
37
+
Every 15 minutes or every `check_window`, whichever is smaller, the datafeed
38
+
triggers a document search over the configured indices. This search looks over a
39
+
time span with a length of `check_window` ending with the latest finalized bucket.
40
+
That time span is partitioned into buckets, whose length equals the bucket span
41
+
of the associated {{anomaly-job}}. The `doc_count` of those buckets are then
42
+
compared with the job's finalized analysis buckets to see whether any data has
43
+
arrived since the analysis. If there is indeed missing data due to their ingest
44
+
delay, the end user is notified. For example, you can see annotations in {{kib}}
0 commit comments