目录

通过cadvistor在Kubeedge监控边缘节点pod信息

KubeEdge 监控三部曲:通过cadvistor在Kubeedge监控边缘节点pod信息

KubeEdge 上的监控,之前我们已经实现了通过 Node-exporter 实现节点信息监控,通过 gpu-exporter 实现GPU的监控,最后还有两个没有实现:1、边缘节点的 pod 信息没有监控 2、对于物理机,没有安装容器的机器,实现节点监控。

可以查看之前的博文:

使用 Prometheus 监控 KubeEdge 边缘节点

使用 Prometheus 在 kubesphere 上监控 kubeedge 边缘节点(Jetson) CPU、GPU 状态

https://zhuyaguang-1308110266.cos.ap-shanghai.myqcloud.com/img/image-20250123152519472.png

1、部署cadvisor,并接入prometheus

用下面的yaml,部署ServiceAccount、DaemonSet、Service、ServiceMonitor

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
apiVersion: v1
kind: ServiceAccount
metadata:
  labels:
    app.kubernetes.io/component: exporter
    app.kubernetes.io/name: cadvisor-exporter
    app.kubernetes.io/part-of: kube-prometheus
    app.kubernetes.io/version: 1.0.0
  name: cadvisor
  namespace: kubesphere-monitoring-system
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
apiVersion: apps/v1
kind: DaemonSet
metadata:
  labels:
    app.kubernetes.io/component: exporter
    app.kubernetes.io/name: cadvisor-exporter
    app.kubernetes.io/part-of: kube-prometheus
    app.kubernetes.io/version: 1.0.0
  name: cadvisor
  namespace: kubesphere-monitoring-system
spec:
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app.kubernetes.io/component: exporter
      app.kubernetes.io/name: cadvisor-exporter
      app.kubernetes.io/part-of: kube-prometheus
  template:
    metadata:
      labels:
        app.kubernetes.io/component: exporter
        app.kubernetes.io/name: cadvisor-exporter
        app.kubernetes.io/part-of: kube-prometheus
        app.kubernetes.io/version: 1.0.0
    spec:
      containers:
        - image: tj.registry.com:5000/sys/cadvisor-arm64:v0.50.0
          imagePullPolicy: IfNotPresent
          name: cadvisor
          ports:
            - containerPort: 8080
              hostPort: 8080
              protocol: TCP
          resources: {}
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
          volumeMounts:
            - mountPath: /rootfs
              name: root
            - mountPath: /var/run
              name: run
            - mountPath: /sys
              name: sys
            - mountPath: /var/lib/containerd
              name: containerd
            - mountPath: /etc/machine-id
              name: machineid
      dnsPolicy: ClusterFirst
      hostNetwork: true
      nodeName: star4-orin01
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      serviceAccount: cadvisor
      serviceAccountName: cadvisor
      terminationGracePeriodSeconds: 30
      tolerations:
        - effect: NoSchedule
          key: node-role.kubernetes.io/master
      volumes:
        - hostPath:
            path: /
            type: ""
          name: root
        - hostPath:
            path: /var/run
            type: ""
          name: run
        - hostPath:
            path: /sys
            type: ""
          name: sys
        - hostPath:
            path: /mnt/cubefs/registry/containerd
            type: ""
          name: containerd
        - hostPath:
            path: /etc/machine-id
            type: ""
          name: machineid
  updateStrategy:
    rollingUpdate:
      maxSurge: 0
      maxUnavailable: 1
    type: RollingUpdate
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
apiVersion: v1
kind: Service
metadata:
  labels:
    app.kubernetes.io/component: exporter
    app.kubernetes.io/name: cadvisor-exporter
    app.kubernetes.io/part-of: kube-prometheus
    app.kubernetes.io/version: 1.0.0
  name: cadvisor-svc
  namespace: kubesphere-monitoring-system
spec:
  internalTrafficPolicy: Cluster
  ipFamilies:
    - IPv4
  ipFamilyPolicy: SingleStack
  ports:
    - name: metrics
      port: 8080
      protocol: TCP
      targetPort: 8080
  selector:
    app.kubernetes.io/component: exporter
    app.kubernetes.io/name: cadvisor-exporter
    app.kubernetes.io/part-of: kube-prometheus
  sessionAffinity: None
  type: ClusterIP
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
apiVersion: v1
kind: Service
metadata:
  name: cadvisor-svc-nodeport
  namespace: kubesphere-monitoring-system
spec:
  ports:
  - port: 8080
    targetPort: 8080
    protocol: TCP
    nodePort: 32144
  sessionAffinity: ClientIP
  sessionAffinityConfig:
    clientIP:
      timeoutSeconds: 10800
  type: NodePort
  selector:
    app.kubernetes.io/component: exporter
    app.kubernetes.io/name: cadvisor-exporter
    app.kubernetes.io/part-of: kube-prometheus
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  labels:
    app.kubernetes.io/component: exporter
    app.kubernetes.io/name: cadvisor-exporter
    app.kubernetes.io/part-of: kube-prometheus
    app.kubernetes.io/vendor: kubesphere
    app.kubernetes.io/version: 1.0.0
  name: cadvisor-exporter
  namespace: kubesphere-monitoring-system
spec:
  selector:
    matchLabels:
      app.kubernetes.io/component: exporter
      app.kubernetes.io/name: cadvisor-exporter
      app.kubernetes.io/part-of: kube-prometheus
  endpoints:
    - port: metrics
      interval: 10s
      honorLabels: true

然后就可以在Prometheus界面看到这个任务已经上线:

https://zhuyaguang-1308110266.cos.ap-shanghai.myqcloud.com/img/1731569090318-e280f0ed-ce59-4c89-836f-88958310a4a4.png

也可以使用上面的nodeport svc 将cadvisor端口暴露出来,然后就可以登录cadvisor web界面

1
2
3
4
5
prometheus访问cadvisor的url:
http://10.107.104.8:32144/metrics

cadvisor的web界面:
http://10.107.104.8:32144/

2、界面显示监控信息

此时在kubesphere界面仍然无法显示edge节点上的pod的性能指标

抓包看,是因为:ks-apiserver查询Prometheus的时候,查询的是kubelet job里面的参数,但是自己部署的cadvisor job名是“cadvisor-svc”,所以查不出来。

报文:query.pcap.txt(去掉后缀.txt后打开)

https://zhuyaguang-1308110266.cos.ap-shanghai.myqcloud.com/img/1731576285039-ae43b335-4bae-4a9e-95c9-f27dbc77943c.png

跳过ks-apiserver,直接调用Prometheus,是可以查询出结果的

举例如下:

https://zhuyaguang-1308110266.cos.ap-shanghai.myqcloud.com/img/1731576431535-aa7821d5-f6be-42f6-885a-f20289f764cf.png

其中query如下:

1
sum by (namespace, pod) (container_memory_working_set_bytes{job="cadvisor-svc", pod!="", image!=""}) * on (namespace, pod) group_left(owner_kind, owner_name) kube_pod_owner{} * on (namespace, pod) group_left(node) kube_pod_info{node="star4-orin01",pod=~"secspace|cadvisor-ncqtf|node-exporter-6srsh|dataagent-deployment-7fb44cc9c8-qgl8n|registry-ds-sfp4t|operator-deployment-7fc59c6c88-24v5l$"}

3、物理机部署node-exporter并接入K8S中的prometheus

部署环境为星上FPGA物理机。

1、部署node-exporter

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
查看K8S中node-exporter的镜像的版本号,下载相同版本的二进制安装包
wget https://github.com/prometheus/node_exporter/releases/download/v1.3.1/node_exporter-1.3.1.linux-arm64.tar.gz

tar -xvzf  node_exporter-1.3.1.linux-arm64.tar.gz
cp node_exporter-1.3.1.linux-arm64/node_exporter /usr/local/bin/

启动node_exporter
node_exporter

即可访问
curl http://IP:9100/metrics

设置开机启动

参考:

https://www.cnblogs.com/wal1317-59/p/12693309.html

https://blog.csdn.net/shb_derek1/article/details/8489112

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
#! /bin/sh
### BEGIN INIT INFO
# Provides:             spacecloud
# Required-Start:       
# Required-Stop:        
# Default-Start:        
# Default-Stop:         
# Short-Description:    spacecloud node-exporter
### END INIT INFO

nohup /usr/local/bin/node_exporter >/dev/null 2>&1 &

然后执行:

1
2
3
4
5
6
7
8
# 添加可执行权限
chmod +x /etc/init.d/node-exporter.sh
# 增加到rc列表中
cd /etc/init.d/
update-rc.d node-exporter.sh defaults 90

# 重启生效
reboot

若要去掉开机启动:

1
2
3
4
5
6
7
8
9
# 从rc列表中删除
cd /etc/init.d/
update-rc.d -f node-exporter.sh remove

# 删除脚本
rm /etc/init.d/node-exporter.sh

# 重启生效
reboot

2、接入prometheus

参考:https://segmentfault.com/a/1190000018940961

创建:Endpoints、Service、ServiceMonitor

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
apiVersion: v1
kind: Endpoints
metadata:
  name: fpga-exporter
  namespace: kubesphere-monitoring-system
  labels:
    app.kubernetes.io/component: exporter
    app.kubernetes.io/name: fpga-exporter
    app.kubernetes.io/part-of: kube-prometheus
    app.kubernetes.io/version: 1.0.0
subsets:
- addresses:
  - ip: 10.15.40.160
  ports:
  - name: metrics
    port: 9100
    protocol: TCP
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
apiVersion: v1
kind: Service
metadata:
  name: fpga-exporter
  namespace: kubesphere-monitoring-system
  labels:
    app.kubernetes.io/component: exporter
    app.kubernetes.io/name: fpga-exporter
    app.kubernetes.io/part-of: kube-prometheus
    app.kubernetes.io/version: 1.0.0
spec:
  type: ExternalName
  externalName: 10.15.40.160
  clusterIP: ""
  ports:
  - name: metrics
    port: 9100
    protocol: TCP
    targetPort: 9100
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: fpga-exporter
  namespace: kubesphere-monitoring-system
  labels:
    app.kubernetes.io/component: exporter
    app.kubernetes.io/name: fpga-exporter
    app.kubernetes.io/part-of: kube-prometheus
    app.kubernetes.io/vendor: kubesphere
    app.kubernetes.io/version: 1.0.0
spec:
  selector:
    matchLabels:
      app.kubernetes.io/component: exporter
      app.kubernetes.io/name: fpga-exporter
      app.kubernetes.io/part-of: kube-prometheus
  endpoints:
  - port: metrics
    interval: 10s
    honorLabels: true

然后将prometheus的svc端口暴露出来,下面的yaml新建了一个svc,通过nodeport的方式将prometheus的端口暴露出来:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
apiVersion: v1
kind: Service
metadata:
  labels:
    app.kubernetes.io/component: prometheus
    app.kubernetes.io/instance: k8s
    app.kubernetes.io/name: prometheus
    app.kubernetes.io/part-of: kube-prometheus
    app.kubernetes.io/version: 2.39.1
  name: prometheus-k8s-nodeport
  namespace: kubesphere-monitoring-system
spec:
  ports:
  - port: 9090
    targetPort: 9090
    protocol: TCP
    nodePort: 32143
  selector:
    app.kubernetes.io/component: prometheus
    app.kubernetes.io/instance: k8s
    app.kubernetes.io/name: prometheus
    app.kubernetes.io/part-of: kube-prometheus
  sessionAffinity: ClientIP
  sessionAffinityConfig:
    clientIP:
      timeoutSeconds: 10800
  type: NodePort

登录:

http://10.11.140.110:32143/

在prometheus的rule里面增加:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
kubectl edit  prometheusrules.monitoring.coreos.com prometheus-k8s-rules -n kubesphere-monitoring-system

---
    - expr: |
        sum by (cpu, instance, job) (node_cpu_seconds_total{job="fpga-exporter",mode=~"user|nice|system|iowait|irq|softirq"})
      record: fpga_cpu_used_seconds_total
    - expr: |
        avg by (instance) (irate(fpga_cpu_used_seconds_total{job="fpga-exporter"}[5m]))
      record: node:fpga_cpu_utilisation:avg1m
---

这个规则是仿照下面的写的:

---
    - expr: |
        sum (node_cpu_seconds_total{job="node-exporter", mode=~"user|nice|system|iowait|irq|softirq"}) by (cpu, instance, job, namespace, pod, cluster)
      record: node_cpu_used_seconds_total
    - expr: |
        avg by (instance) (irate(fpga_cpu_used_seconds_total{job="fpga-exporter"}[5m]))
      record: node:fpga_cpu_utilisation:avg1m
---

之后就可以用node:fpga_cpu_utilisation:avg1m查询fpga的cpu使用率