Skip to content

Latest commit

 

History

History
383 lines (274 loc) · 13.5 KB

README.md

File metadata and controls

383 lines (274 loc) · 13.5 KB

Prometheus on Kubernetes

本项目为Prometheus Operator部署监控相关的Kubernetes定义,快速部署请参考以下Quickstart

镜像列表与离线使用
  • registry.cn-hangzhou.aliyuncs.com/clab-docker/prometheus-operator:v0.30.0
  • registry.cn-hangzhou.aliyuncs.com/clab-docker/configmap-reload:v0.0.1
  • registry.cn-hangzhou.aliyuncs.com/clab-docker/prometheus-config-reloader:v0.30.0
  • registry.cn-hangzhou.aliyuncs.com/clab-docker/prometheus:v2.11.1
  • registry.cn-hangzhou.aliyuncs.com/clab-docker/grafana:7.2.0
  • registry.cn-hangzhou.aliyuncs.com/clab-docker/kafka_exporter:v0.0.1
  • registry.cn-hangzhou.aliyuncs.com/clab-docker/kube-rbac-proxy:v0.4.1
  • registry.cn-hangzhou.aliyuncs.com/clab-docker/kube-state-metrics:v1.5.0
  • registry.cn-hangzhou.aliyuncs.com/clab-docker/addon-resizer:1.8.4
  • registry.cn-hangzhou.aliyuncs.com/clab-docker/node-exporter:v1.3.1
  • registry.cn-hangzhou.aliyuncs.com/clab-docker/statping:v0.0.5
  • registry.cn-hangzhou.aliyuncs.com/clab-docker/alertmanager:v0.17.0
  • registry.cn-hangzhou.aliyuncs.com/clab-docker/prometheus-webhook-dingtalk:1.4.0
  • registry.cn-hangzhou.aliyuncs.com/clab-docker/ding2wechat:v0.1.5
  • registry.cn-hangzhou.aliyuncs.com/clab-docker/cloudera_exporter:v0.0.1
  • registry.cn-hangzhou.aliyuncs.com/clab-docker/yarn_exporter:v0.0.2
  • registry.cn-hangzhou.aliyuncs.com/clab-docker/kudu_exporter:v0.0.4
  • registry.cn-hangzhou.aliyuncs.com/clab-docker/prometheus-es-exporter:0.14.0
  • registry.cn-hangzhou.aliyuncs.com/clab-docker/azure-scheduledevents-exporter:21.01.1
  • registry.cn-hangzhou.aliyuncs.com/clab-docker/mysqld-exporter
  • registry.cn-hangzhou.aliyuncs.com/clab-docker/elasticsearch_exporter:1.1.0rc1
  • registry.cn-hangzhou.aliyuncs.com/clab-docker/mongodbexporter
  • registry.cn-hangzhou.aliyuncs.com/clab-docker/redis_exporter

Package Ver.20200810

oss://devops-components/prometheus/prometheus_20200810.images.gz

Usage

gzip -d -c prometheus_20200810.images.gz | docker load
newRepoPrefix=nexus.xsio.cn/test # 替换为本地仓库地址的前缀
docker images --format "{{.Repository}}:{{.Tag}}" -f reference="registry.cn-hangzhou.aliyuncs.com/clab-docker/*" | while read i;do echo "-> process $i" && docker tag $i ${i/registry.cn-hangzhou.aliyuncs.com\/clab-docker/$newRepoPrefix} && docker push ${i/registry.cn-hangzhou.aliyuncs.com\/clab-docker/$newRepoPrefix};done

Quickstart

  1. 拉取本项目

手动下载 https://gitlab.cd.xsio.cn/devops/prometheus-operator/-/archive/master/prometheus-operator-master.tar.gz,然后上传到远程机器并解压

或者

这里添加Deploy Keys,然后在远程机器上git clone ssh://[email protected]:22222/devops/prometheus-operator.git

(可选)创建钉钉机器人,关键词至少包括service,FIRING,RESOLVED,获取WEBHOOK URL备用

  1. 为prometheus/grafana确认机器和数据目录

需要多节点或高可用,请另行使用NFS、云盘等存储

如果只想使用临时存储,请参考3-prometheus/prometheus-prometheus.yaml7-grafana/grafana-deployment.yaml中的注释进行

以下为本地单点存储

kubectl get nodes # 挑选一台机器
kubectl label nodes xxxxxxxx prometheus-data=true # 为这台机器打标签
ssh xxxxxxxx mkdir -p /opt/prometheus/{prometheus,grafana} # !登录选择的机器创建目录
kubectl apply -f create-local-storage.yaml # 创建对应PV和StorageClass
  1. 启动promethues-operator
kubectl apply -f create-monitoring-namespace.yaml
kubectl apply -f 1-custom-resource-definition
kubectl apply -f 2-prometheus-operator
kubectl -nmonitoring get pod -l apps.kubernetes.io/name=prometheus-operator # 确认状态为Running
  1. 启动promethues与exporter
kubectl apply -f 3-prometheus
kubectl -nmonitoring get pod -l app=prometheus -o wide # 确认状态为Running,节点为之前选择的机器
kubectl apply -f 6-exporter --recursive
# ingress相关报错可忽略,如果是托管版Kubernetes,可以跳过kube-scheduler和kube-controller-manager

此时可以打开http://任意节点IP:30900/targets确认prometheus工作正常,不方便访问的话可以跳过

  1. 启动grafana
kubectl apply -f 7-grafana
kubectl -nmonitoring get pod -l app=grafana # 确认状态为Running

此时可以打开http://任意节点IP:30300确认grafana工作正常,默认密码为admin/admin

  1. 启动其他监控
  • kafka(推荐🌟)
vi 8-custom-exporter/kafka-exporter/kafka-exporter-deployment.yaml
# 编辑kafka地址,如果只有一个节点,请删除多余行
kubectl apply -f 8-custom-exporter/kafka-exporter
kubectl -nmonitoring get pod -l app=kafka-exporter # 确认状态为Running
  • statping服务监控(推荐🌟)
vi 8-custom-exporter/statping/statping-configMap.yaml
# 按需修改services中需要监控的地址,注意批量替换域名和namespace,修改DINGTALK_URL为钉钉机器人地址或企业微信机器人地址
kubectl apply -f 8-custom-exporter/statping
kubectl -nmonitoring get pod -l app=statping # 确认状态为Running

此时可以打开http://任意节点IP:30800确认statping工作正常,默认密码为admin/Convertlab0202

  • 外部机器监控(推荐🌟)

对Kubernetes之外的机器进行监控,在每台机器上执行

sudo yum install -y https://devops-components.oss-cn-hangzhou.aliyuncs.com/prometheus/node_exporter-1.3.1-clab.el7.centos.x86_64.rpm
sudo systemctl start node_exporter
sudo systemctl enable node_exporter

之后,修改配置并启用

vi 8-custom-exporter/external-node-exporter/external-node-exporter-endpoints.yaml
# 修改addresses字段,更新为需要监控的机器IP列表
kubectl apply -f 8-custom-exporter/external-node-exporter

对外部机器的监控需要访问其9100端口,请注意防火墙开通情况

  • CDH(推荐🌟)
vi 8-custom-exporter/cloudera-exporter/cloudera-exporter-configMap.yaml
# 修改host字段为CDH Manager的IP,按实际修改user,最好新建一个只读用户
kubectl apply -f 8-custom-exporter/cloudera-exporter
kubectl -nmonitoring get pod -l app=cloudera-exporter # 确认状态为Running
  • Spring Boot Actuator

需要被监控的应用对应的Kubernetes Service YAML中需要打以下标签

actuator_should_be_scraped: "true"
actuator_scrape_port: "9999"

推荐在发布脚本中配置label,如果不方便,可以手动打label,例如

kubectl -ntest label svc app actuator_should_be_scraped=true
kubectl -ntest label svc app actuator_scrape_port=9999

需要被监控的应用启用spring-boot-starter-actuatormicrometer-registry-prometheus依赖,且在application.yml进行相关配置,推荐最小配置为

management:
  server:
    port: 9999
  endpoint:
    metrics:
      enabled: true
    prometheus:
      enabled: true
  endpoints:
    enabled-by-default: false
    web:
      exposure:
        include: 'prometheus'

默认已包含于6-exporter,新增监控不需要额外操作,已有监控执行

kubectl apply -f 6-exporter/spring-boot-actuator
  • JMX (作为Spring Boot Actuator备用))
点此展开

JMX监控通过prometheus提供的jmx-exporter进行采集,其本身是一个java agent,在应用启动时跟随启动,并暴露供采集的端口。

为了最小化对现有应用的影响,监控相关的链路比较复杂,说明如下:

I. Prometheus Operator端配置ServiceMonitor

默认已包含于6-exporter,不需要额外操作

II. 在ConfigMap中配置启用jmx-exporter java agent

在SRE全局配置中搜索JMX,设置如下并更新配置

JMX_EXPORTER_ENABLED: true

JMX_EXPORTER_BLACKLIST_OBJECT_NAMES: ["kafka.producer:type=producer-topic-metrics,*","kafka.consumer:type=consumer-fetch-manager-metrics,topic=*,*","kafka.consumer:type=consumer-coordinator-metrics,*"]

如果更多需求,具体可参考这里

III. 确认发布脚本中为Service打上标签

需要被监控的应用对应的Kubernetes Service YAML中需要打以下标签

jmx_should_be_scraped: "true"
jmx_scrape_port: "7777"

添加后例如

apiVersion: v1
kind: Service
metadata:
  namespace: "{{ namespace }}"
  name: "{{ service_name }}"
  labels:
    jmx_should_be_scraped: "true"
    jmx_scrape_port: "7777"
spec:
  selector:
    app: "{{ service_name }}"
  ports:
  - name: web
    targetPort: web
    port: {{ server_port }}
{% for port in extra_ports %}
  - name: port{{ loop.index }}
    targetPort: {{ port }}
    port: {{ port }}
{% endfor %}

请确认当前环境对应的发布脚本中service.j2reactflow_service.j2都已相应修改

IV. 重新发布服务

应用启动后,稍等两分钟,如果prometheus尚未采集到数据,排查如下

a. 进入应用容器,netstat -nltp确认7777端口是否监听

b. 如果未监听,确认/opt/下是否存在jmx_prometheus_javaagent目录,如果不存在,说明基础镜像未更新

c. 如果jmx_prometheus_javaagent目录存在,确认start.sh文件中是否包含. .init_jmx_exporter.sh一行,如果不包含,说明应用使用了自定义启动脚本

d. 如果7777已监听,kubectl describe svc确认Service标签是否打上,如果未打上,确认发布脚本工作是否正常

e. 如果标签打上,访问Prometheus网页界面,进入Status-Targets搜索jmx确认采集目标是否正常

  • Azure计划事件(Azure云推荐🌟,特别地建议部署于Kudu节点)

用于获取Azure即将发生的维护事件,Kubernetes节点已默认包括,Kubernetes之外的机器需要执行

sudo yum install -y https://devops-components.oss-cn-hangzhou.aliyuncs.com/tools/azure-scheduledevents-exporter/21.01.1/azure-scheduledevents-exporter-21.01.1-1.el7.x86_64.rpm
sudo systemctl enable azure-scheduledevents-exporter --now

之后,修改配置并启用

vi 8-custom-exporter/azure-scheduledevents-exporter/external-azure-scheduledevents-exporter-endpoints.yaml
# 修改addresses字段,更新为需要监控的Kubernetes外部机器IP列表
kubectl apply -f 8-custom-exporter/azure-scheduledevents-exporter

对Kubernetes外部机器的监控需要访问其9879端口,请注意防火墙开通情况

  • yarn(可选)
vi 8-custom-exporter/yarn-exporter/yarn-exporter-deployment.yaml
# 修改--resource-manager.address后的yarn resource manager地址,一般为8088端口
kubectl apply -f 8-custom-exporter/yarn-exporter
kubectl -nmonitoring get pod -l app=yarn-exporter # 确认状态为Running
  • kudu(可选)

要求kudu最低版本为1.10

vi 8-custom-exporter/kudu-exporter/kudu-exporter-deployment.yaml
# 修改--kudu.bootstrap-address后的kudu master地址,一般为8051端口
kubectl apply -f 8-custom-exporter/kudu-exporter
kubectl -nmonitoring get pod -l app=kudu-exporter # 确认状态为Running
  • 日志报错监控(可选)

5分钟查询一次ES,非绝对准确

vi 8-custom-exporter/elasticsearch-log-exporter/elasticsearch-log-exporter-deployment.yaml
# 修改--es-cluster后的elastic地址,如果为非9200端口,需要添加端口
vi 8-custom-exporter/elasticsearch-log-exporter/elasticsearch-log-exporter-configMap.yaml
# 修改QueryIndices后的index名称,调整为实际使用的前缀
kubectl apply -f 8-custom-exporter/elasticsearch-log-exporter
kubectl -nmonitoring get pod -l app=elasticsearch-log-exporter # 确认状态为Running
  • redis(可选)
vi 8-custom-exporter/redis/redis-exporter-deployment.yaml
# 编辑redis地址,如果有密码,请填写,无则忽略
kubectl apply -f 8-custom-exporter/redis-exporter
kubectl -nmonitoring get pod -l app=redis-exporter # 确认状态为Running
  • mongodb(可选)
vi 8-custom-exporter/mongodb-exporter/mongodb-exporter-deployment.yaml
# 编辑mongodb地址,可在demo中进行修改
kubectl apply -f 8-custom-exporter/mongodb-exporter
kubectl -nmonitoring get pod -l app=mongodb-exporter # 确认状态为Running
  • mysql(可选)
vi 8-custom-exporter/mysql-exporter/mysql-exporter-deployment.yaml
# 编辑mysql地址,可在demo中进行修改
kubectl apply -f 8-custom-exporter/mysql-exporter
kubectl -nmonitoring get pod -l app=mysql-exporter # 确认状态为Running
  • elasticsearch(可选)
vi 8-custom-exporter/elasticsearch-exporter/elasticsearch-exporter-deployment.yaml
# 编辑elasticsearch地址,可在demo中进行修改
vi 8-custom-exporter/elasticsearch-exporter/elasticsearch-log-exporter-deployment.yaml
# 编辑收集日志的elasticsearch地址,如果elasticsearch共用,请排除此文件执行
kubectl apply -f 8-custom-exporter/elasticsearch-exporter
kubectl -nmonitoring get pod -l app=elasticsearch-exporter # 确认状态为Running
  1. 启动alertmanager报警(选择其一)
  • 钉钉
vi 9-webhook/prometheus-webhook-dingtalk/prometheus-webhook-dingtalk-deployment.yaml 
# 修改ding.profile中钉钉机器人地址
kubectl apply -f 5-alertmanager
kubectl apply -f 9-webhook/prometheus-webhook-dingtalk
  • 企业微信
vi 9-webhook/prometheus-webhook-qiyeweixin/prometheus-webhook-qiyeweixin-configMap.yaml
# 修改ding2wechat.yml字段中url定义的企业微信群机器人地址
kubectl apply -f 5-alertmanager
kubectl apply -f 9-webhook/prometheus-webhook-qiyeweixin