K8s 指标收集方案对比
文章目录
- 1. Background
- 2. Solutions
- 2.1. MetricBeat
- 2.2 Telegraf
- 2.3 MetricServer
- 2.4 Kubelet cAdvisor
- 3. Comparison
1. Background
Megacloud Portal needs to add monitoring for K8S. The current demand is
Obtain the CPU/Memory metrics of Node and Pod in K8S, and display TopN after processing.
To achieve this function, the server works as follows:
- Collects resource metrics of K8S Node and Pod
- ETL processing and storage of collected data
- Implement API for front end to accquire data
The key is ** the collection and processing of metric data**. The following is a brief introduction and comparison of related collection schemes.
2. Solutions
2.1. MetricBeat
MetricBeat is a metric collection tool provided by Elasic. It can collect metrics from many open source software including Kubernetes, and can send data to ElasticSearch, Kafka, Redis, and Logstash for processing or storage.
The following is the data format collected from K8S Node and Pod in MetricBeat.
-Node Data Format (information other than CPU/Memory has been omitted)
{"@timestamp": "2017-04-06T15:29:27.150Z","beat": {"hostname": "beathost","name": "beathost","version": "6.0.0-alpha1"},"kubernetes": {"node": {"cpu": {"usage": {"core" : {"ns": 7247863769557035},"nanocores": 1662117892}},"memory": {"available": {"bytes": 134202847232},"majorpagefaults": 1044,"pagefaults": 83482928,"rss": {"bytes": 178053120},"usage": {"bytes": 67062091776},"workingset": {"bytes": 51496206336}},"name": "localhost","start_time": "2017-02-08T10:33:38Z"}},"metricset": {"host": "localhost:10255","module": "kubernetes","name": "node","rtt": 650741},"type": "metricsets"
}{"beat": {"hostname": "X1","name": "X1","version": "6.0.0-alpha1"},"kubernetes": {"node": {"cpu": {"allocatable": {"cores": 2},"capacity": {"cores": 2}},"memory": {"allocatable": {"bytes": 2097786880},"capacity": {"bytes": 2097786880}},"name": "minikube","pod": {"allocatable": {"total": 110},"capacity": {"total": 110}},"status": {"ready": "true","unschedulable": false}}},"metricset": {"host": "192.168.99.100:18080"}
}
- Pod Data Format
{"@timestamp": "2017-04-06T15:29:27.150Z","beat": {"hostname": "beathost","name": "beathost","version": "6.0.0-alpha1"},"kubernetes": {"namespace": "ns","node": {"name": "localhost",},"pod": {"name": "nginx-3137573019-pcfzh","uid": "b89a812e-18cd-11e9-b333-080027190d51","network": {"rx": {"bytes": 18999261,"errors": 0},"tx": {"bytes": 28580621,"errors": 0}},"start_time": "2017-04-06T12:09:05Z"}},"metricset": {"host": "localhost:10255","module": "kubernetes","name": "pod","rtt": 636230},"type": "metricsets"
}
- Container Data Format
{"@timestamp": "2017-04-06T15:29:27.150Z","beat": {"hostname": "beathost","name": "beathost","version": "6.0.0-alpha1"},"kubernetes": {"container": {"cpu": {"usage": {"core": {"ns": 3305756719},"nanocores": 5992}},"memory": {"available": {"bytes": 0},"majorpagefaults": 47,"pagefaults": 2298,"rss": {"bytes": 1441792},"usage": {"bytes": 7643136},"workingset": {"bytes": 1466368}},"name": "nginx",},"namespace": "ns","node": {"name": "localhost"},"pod": {"name": "nginx-3137573019-pcfzh",}},"metricset": {"host": "localhost:10255","module": "kubernetes","name": "container","rtt": 650739},"type": "metricsets"
}
MetricBeat only has CPU/Memory indicator data for Node and container. If we use MetricBeat for collection, we need to do the following:
- Deploy MetricBeat on K8S. We may need to do a lot of manual operations
- We ne
2.2 Telegraf
Telegraf is an open source software written in Go for metric collection. Like MetricBeat, it provides numerous plugins to collect data from multiple sources.
For Kubernetes, Telegraf provides a Kubernetes plugin
to collect data. It gets data through Kubelet’s stats/sumary
API. It can also be used with the Prometheus plugin to collect more metric data.
The following are some metric data formats. Like MertricBeat, it does not provide Pod-level CPU/Memory statistics, and needs to be aggregated based on container data.
type NodeMetrics struct {NodeName string `json:"nodeName"`SystemContainers []ContainerMetrics `json:"systemContainers"`StartTime time.Time `json:"startTime"`CPU CPUMetrics `json:"cpu"`Memory MemoryMetrics `json:"memory"`Network NetworkMetrics `json:"network"`FileSystem FileSystemMetrics `json:"fs"`Runtime RuntimeMetrics `json:"runtime"`
}// PodMetrics contains metric data on a given pod
type PodMetrics struct {PodRef PodReference `json:"podRef"`StartTime *time.Time `json:"startTime"`Containers []ContainerMetrics `json:"containers"`Network NetworkMetrics `json:"network"`Volumes []VolumeMetrics `json:"volume"`
}// ContainerMetrics represents the metric data collect about a container from the kubelet
type ContainerMetrics struct {Name string `json:"name"`StartTime time.Time `json:"startTime"`CPU CPUMetrics `json:"cpu"`Memory MemoryMetrics `json:"memory"`RootFS FileSystemMetrics `json:"rootfs"`LogsFS FileSystemMetrics `json:"logs"`
}
2.3 MetricServer
MetricServer also obtains metric data through the /stats/summary
API provided by Kubelet. MetricServer stores the data in memory, and then provides API based on the kube-Aggregator mechanism to provide external access.
The fixed URL prefix of the API provided by MetricSever is
/apis/metrics/v1alpha1/
, and then combined with the following APIs for external access to metric data, all APIs only support the GET method:
/nodes
-Get all Node’s metric data./nodes/(node)
-Get metric data of the specified Node./namespaces/(namespace)/pods
-Get all Pod metrics under a certain namespace./namespaces/(namespace)/pods/(pod)
- Get the metric data of the specified Pod.
In addition, We can view the CPU/Memory metrics of Node and Pod through the kubectl top
command on the terminal.
$ kubectl top nodes
NAME CPU(cores) CPU% MEMORY(bytes) MEMORY%
tk01 217m 10% 5296Mi 68%
vm-0-2-ubuntu 84m 4% 1189Mi 32%$ kubectl top pods --all-namespaces
NAMESPACE NAME CPU(cores) MEMORY(bytes)
kube-system coredns-f9fd979d6-jzv8q 4m 10Mi
kube-system coredns-f9fd979d6-tx9m4 4m 10Mi
kube-system etcd-tk01 14m 50Mi
kube-system kube-apiserver-tk01 31m 293Mi
K8S provides libraries to access the above APIs. Now We has implemented the first version of metric-server-collector
, which is based on MetricServer API to obtain the CPU/Memory metrics of Node and Pod and convert them into the data format we need.
2.4 Kubelet cAdvisor
Kubelet integrates cAdvisor to collect statistics on the CPU, memory, file system, and network usage of the container on the node. It provides the /stats/summary
API for externally obtaining the metric data, The above-mentioned Telegraf and MetricServer schemes all obtain metric data through this API. Therefore, we can access Kubelet’s API directly to obtain metric data instead of use above tools.
For this solution, the following changes need to be made based on the current metric-server-collector:
- Rewrite the scraper code, replace the MetricServer API, and access the kubelet API instead to obtain indicator data.
- Convert the data returned by Kubelet to the required data format.
3. Comparison
Based on the above information, the development, operation and maintenance work of each program is compared as follows
Solution | Development task | Development complexity | Deployment operation | Deployment complexity | Others |
---|---|---|---|---|---|
MetricBeat | Develop ETL tools to process data | ★ ★ ★ | MetricBeat + ETL tools + data transmission & storage components. | ★ ★ ★ | |
Telegraf | Develop Telegraf processor or related ETL tools to process data | ★ ★ ★ | Deploy Telegraf + ETL tools | ★ ★ ★ | |
MetricServer | No additional development required | ☆ | MetricServer + collector | ★ | The data is stored in memory, which consumes resources when the amount of data is large |
Kubelet cAdvisor | Re-implement the collector, expected one week | ★ ★ | Only need to deploy collector | ☆ |
Based on the above comparison, the preliminary conclusions are as follows:
-
The data format of MetricBeat does not meet the requirements and requires more additional processing, so it is not considered.
-
On the premise of only collecting CPU/Memory metrics for Node and Pod in K8S, Telegraf is a bit of a slasher, and requires additional development and operation and maintenance work. It can be temporarily stopped without collecting more K8S metrics. consider.
-
Compared with the above two schemes, it is very simple to deploy
MetricSeverr
on K8S, and it can be combined with themetric-server-collector
to meet the requirements. -
The collector implementation based on the Kubelet API can be regarded as an optimization based on the MetricServer implementation, which reduces unnecessary component operation and resource consumption. We can get the most primitive data for conversion on demand.
Reference
- MetricBeat Kubernetes Module
- Telegraf Kubernetes Input Plugin
- Metrics API design
- Metrics Server design
Appendix
- Kubelet
/stats/summary
API response data
{"node": {"nodeName": "tk01", "systemContainers": [{"name": "kubelet", "startTime": "2021-01-26T05:10:01Z", "cpu": {"time": "2021-01-26T05:10:22Z", "usageNanoCores": 108726826, "usageCoreNanoSeconds": 2009799168}, "memory": {"time": "2021-01-26T05:10:22Z", "usageBytes": 61022208, "workingSetBytes": 52633600, "rssBytes": 32174080, "pageFaults": 45899, "majorPageFaults": 253}}], "startTime": "2020-09-12T12:43:58Z", "cpu": {"time": "2021-01-26T05:10:27Z", "usageNanoCores": 1078216675, "usageCoreNanoSeconds": 1758203828797878}, "memory": {"time": "2021-01-26T05:10:27Z", "availableBytes": 2564227072, "usageBytes": 7543758848, "workingSetBytes": 5632106496, "rssBytes": 3926122496, "pageFaults": 2304667, "majorPageFaults": 859}, "network": {"time": "2021-01-26T05:10:27Z", "name": "eth0", "rxBytes": 166587714846, "rxErrors": 0, "txBytes": 192097080030, "txErrors": 0, "interfaces": [{"name": "eth0", "rxBytes": 166587714846, "rxErrors": 0, "txBytes": 192097080030, "txErrors": 0}]}, "fs": {"time": "2021-01-26T05:10:27Z", "availableBytes": 23964233728, "capacityBytes": 52776349696, "usedBytes": 26562187264, "inodesFree": 2916648, "inodes": 3276800, "inodesUsed": 360152}, "runtime": {"imageFs": {"time": "2021-01-26T05:10:27Z", "availableBytes": 23964233728, "capacityBytes": 52776349696, "usedBytes": 1377648552, "inodesFree": 2916648, "inodes": 3276800, "inodesUsed": 360152}}, "rlimit": {"time": "2021-01-26T05:10:32Z", "maxpid": 32768, "curproc": 905}}, "pods": [{"podRef": {"name": "etcd-tk01", "namespace": "kube-system", "uid": "2e8885329cb9c936db545fcd71666003"}, "startTime": "2021-01-26T05:10:08Z", "containers": [{"name": "etcd", "startTime": "2021-01-26T05:10:09Z", "cpu": {"time": "2021-01-26T05:10:22Z", "usageNanoCores": 208950881, "usageCoreNanoSeconds": 2582358831}, "memory": {"time": "2021-01-26T05:10:22Z", "usageBytes": 37478400, "workingSetBytes": 37081088, "rssBytes": 36110336, "pageFaults": 11531, "majorPageFaults": 10}, "rootfs": {"time": "2021-01-26T05:10:22Z", "availableBytes": 23964233728, "capacityBytes": 52776349696, "usedBytes": 36864, "inodesFree": 2916648, "inodes": 3276800, "inodesUsed": 8}, "logs": {"time": "2021-01-26T05:10:22Z", "availableBytes": 23964233728, "capacityBytes": 52776349696, "usedBytes": 28672, "inodesFree": 2916648, "inodes": 3276800, "inodesUsed": 360152}}], "cpu": {"time": "2021-01-26T05:10:24Z", "usageNanoCores": 161540656, "usageCoreNanoSeconds": 34928899852771}, "memory": {"time": "2021-01-26T05:10:24Z", "usageBytes": 71917568, "workingSetBytes": 67014656, "rssBytes": 36155392, "pageFaults": 0, "majorPageFaults": 0}, "network": {}, "ephemeral-storage": {"time": "2021-01-26T05:10:27Z", "availableBytes": 23964233728, "capacityBytes": 52776349696, "usedBytes": 65536, "inodesFree": 2916648, "inodes": 3276800, "inodesUsed": 8}, "process_stats": {"process_count": 0}}]
}