The hash from which we query cpu metrics contains 'iowait' as well as
'wait'. The first one is the total amount of time that was spent
waiting on IO, the second one is the percentage of time spent on waiting
on IO in a certain time frame.
For the metrics returned by the /cluster/metrics/export endpoint we want
the second one.
Reported-by: Dominik Csapak <d.csapak@proxmox.com>
Signed-off-by: Lukas Wagner <l.wagner@proxmox.com>
This new endpoint returns node, storage and guest metrics in JSON
format. The endpoint supports history/max-age parameters, allowing
the caller to query the recent metric history as recorded by the
PVE::PullMetric module.
The returned data format is quite simple, being an array of
metric records, including a value, a metric name, an id to identify
the object (e.g. qemu/100, node/foo), a timestamp and a type
('gauge', 'derive', ...). The latter property makes the format
self-describing and aids the metric collector in choosing a
representation for storing the metric data.
[
...
{
"metric": "cpu_avg1",
"value": 0.12,
"timestamp": 170053205,
"id": "node/foo",
"type": "gauge"
},
...
]
Some experiments were made in regards to making the format
more 'efficient', e.g. by grouping based on timestamps/ids, resulting
in a much more nested/complicated data format. While that
certainly reduces the size of the raw JSON response by quite a bit,
after GZIP compression the differences are negligible (the
simple, flat data format as described above compresses by a factor
of 25 for large clusters!). Also, the slightly increased CPU load
of compressing the larger amount of data when e.g. polling once a
minute is so small that it's indistinguishable from noise in relation
to a usual hypervisor workload. Thus the simpler, format was
chosen. One benefit of this format is that it is more or less already
the exact same format as the one Prometheus uses, but in JSON format -
so adding a Prometheus metric scraping endpoint should not be much
work at all.
The API endpoint collects metrics for the whole cluster by calling
the same endpoint for all cluster nodes. To avoid endless request
recursion, the 'local-only' request parameter is provided. If this
parameter is set, the endpoint implementation will only return metrics
for the local node, avoiding a loop.
Signed-off-by: Lukas Wagner <l.wagner@proxmox.com>
[WB: remove unused $start_time leftover from benchmarks]
Signed-off-by: Wolfgang Bumiller <w.bumiller@proxmox.com>
This commit adds a new module PVE::PullMetric. This module allows
us to store the status data of various subsystems, including status
data for the most recent pvestatd update loops. Right now, we
store 6 old generations - including the most recent values, that gives
70 seconds of stat history (based on a 10 second pvestatd update loop
interval).
This cache allows us to add support for pull-style metric collection
systems, be it Prometheus/OpenMetrics or some custom, JSON based
metric format.
This patch raises the required lib{proxmox,pve}-perl-rs version
requirements, since we need the new bindings for proxmox-shared-cache.
Signed-off-by: Lukas Wagner <l.wagner@proxmox.com>
[WB: actually bump *runtime* deps in d/control]
Signed-off-by: Wolfgang Bumiller <w.bumiller@proxmox.com>