Metric Correlations
The Metric Correlations feature helps you quickly identify metrics and charts relevant to a specific time window of interest, allowing for faster root cause analysis.
By filtering your standard Netdata dashboard to display only the most relevant charts, Metric Correlations make it easier for you to pinpoint anomalies and investigate issues.
Since it leverages every available metric in your infrastructure with up to 1-second granularity, Metric Correlations provides you with highly accurate insights.
Using Metric Correlations
When viewing the Metrics tab or a single-node dashboard, you'll find the Metric Correlations button in the top-right corner.
To start:
- Click Metric Correlations.
- Highlight a selection of metrics on a single chart. The selected timeframe must be at least 15 seconds.
- The menu displays details about your selected area and reference baseline. Metric Correlations compares your highlighted window to a reference baseline, which is four times its length and precedes it immediately.
- Click Find Correlations.
This button is only active if you've selected a valid timeframe.
- The process evaluates all your available metrics and returns a filtered Netdata dashboard showing only the most changed metrics between the baseline and your highlighted window.
- If needed, select another window and press Find Correlations again to refine your analysis.
Integration with Anomaly Detection
You can combine Metric Correlations with Anomaly Detection for powerful troubleshooting:
When you notice an anomaly in your system, use Metric Correlations with the Anomaly Rate data type to quickly identify which metrics are contributing to the anomalous behavior.
How to Use Together
This workflow helps you move from detecting that "something is wrong" to understanding exactly which components are behaving abnormally, significantly reducing your troubleshooting time.
API Access
You can access anomaly detection data and use it with metric correlations through Netdata's API:
Querying Anomaly Bits
To get the anomaly bits for any metric, add the options=anomaly-bit
parameter to your API query:
https://your-netdata-node/api/v1/data?chart=system.cpu&dimensions=user&after=-60&options=anomaly-bit
Sample response:
{
"labels": [
"time",
"user"
],
"data": [
[
1684852570,
0
],
[
1684852569,
0
],
[
1684852568,
0
],
[
1684852567,
0
],
[
1684852566,
0
],
[
1684852565,
0
],
[
1684852564,
0
],
[
1684852563,
0
],
[
1684852562,
0
],
[
1684852561,
0
]
]
}
Querying Anomaly Rates
For anomaly rates over a time window, use the same parameter but with aggregated data:
https://your-netdata-node/api/v1/data?chart=system.cpu&dimensions=user&after=-600&before=0&points=10&options=anomaly-bit
Sample response showing the percentage of time each metric was anomalous:
{
"labels": [
"time",
"user"
],
"data": [
[
1684852770,
0
],
[
1684852710,
20
],
[
1684852650,
0
],
[
1684852590,
10
],
[
1684852530,
0
],
[
1684852470,
0
],
[
1684852410,
30
],
[
1684852350,
0
],
[
1684852290,
0
],
[
1684852230,
0
]
]
}
You can programmatically access this data to build custom dashboards or alerts based on anomaly patterns in your infrastructure.
Metric Correlations Options
Metric Correlations offer adjustable parameters for deeper data exploration. Since different data types and incidents require different approaches, these settings allow for flexible analysis.
Method
Two algorithms are available for scoring metrics based on changes between the baseline and highlight windows:
KS2
(Kolmogorov-Smirnov Test): A statistical method comparing distributions between the highlighted and baseline windows to detect significant changes. Implementation details.Volume
: A heuristic approach based on percentage change in averages, designed to handle edge cases. Implementation details.
Aggregation
To accommodate different window lengths, Netdata aggregates your raw data as needed. The default aggregation method is Average
, but you can also choose Median
, Min
, Max
, or Stddev
.
Data Type
Netdata assigns an Anomaly Bit to each of your metrics in real-time, flagging whether it deviates significantly from normal behavior. You can analyze either raw data or anomaly rates:
Metrics
: Runs Metric Correlations on your raw metric values.Anomaly Rate
: Runs Metric Correlations on anomaly rates for each of your metrics.
Metric Correlations on the Agent
Metric Correlations (MC) requests to Netdata Cloud are handled in two ways:
- If MC is enabled on any of your nodes, the request is routed to the highest-level node (a Parent node or the node itself).
- If MC is not enabled on any of your nodes, Netdata Cloud processes the request by collecting data from your nodes and computing correlations on its backend.
Interpreting Combined Results
When you use Metric Correlations together with Anomaly Detection, you'll want to understand how to interpret the results:
High anomaly rates combined with significant metric changes often indicate genuine issues rather than false positives.
Here's how to interpret different scenarios:
Anomaly Rate | Metric Correlation | Interpretation |
---|---|---|
High | Strong | Likely a significant issue affecting system behavior |
High | Weak | Possible edge case or intermittent issue |
Low | Strong | Normal but significant change in system behavior |
Low | Weak | Likely normal system operation |
By examining both the anomaly rate and the correlation strength, you can prioritize your troubleshooting efforts more effectively.
Usage Tips
When running Metric Correlations from the Metrics tab across multiple nodes, refine your results by grouping by node:
- Run MC on all your nodes if you're unsure which ones are relevant.
- Group the most interesting charts by node to determine whether changes affect all your nodes or just a subset.
- If a subset of your nodes stands out, filter for those nodes and rerun MC to get more precise results.
Choose the Volume
algorithm for sparse metrics (e.g., request latency with few requests). Otherwise, use KS2
.
KS2
is ideal for detecting complex distribution changes in your metrics, such as shifts in variance.Volume
is better for detecting your metrics that were inactive and then spiked (or vice versa).
Example:
Volume
can highlight network traffic suddenly turning on in your system.KS2
can detect entropy distribution changes in your data missed byVolume
.
Combine Volume
and Anomaly Rate
to identify the most anomalous metrics within your selected timeframe. Expand the anomaly rate chart to visualize results more clearly.
Do you have any feedback for this page? If so, you can open a new issue on our netdata/learn repository.