1. 程式人生 > 實用技巧 >prometheus告警技術初探(一)【轉】

prometheus告警技術初探(一)【轉】

global:
scrape_interval:15s
evaluation_interval:15s#每過15秒執行一次報警規則,也就是說15秒執行一次報警
alerting:
alertmanagers:
-static_configs:
-targets:["localhost:9093"]#設定報警資訊推送地址,一般而言設定的是alertManager的地址
rule_files:
-"test_rules.yml"#設定報警規則
scrape_configs:
-job_name:'node'#自己定義的監控的job_name
static_configs:
-targets:['localhost:9100']
-job_name:'CDG-MS'
honor_labels:true
metrics_path:'/prometheus'
static_configs:
-targets:['localhost:8089']
relabel_configs:
-target_label:env
replacement:dev
-job_name:'eureka'
file_sd_configs:
-files:
-"/app/enmonster/basic/prometheus/prometheus-2.2.1.linux-amd64/eureka.json"
relabel_configs:
-source_labels:[__job_name__]
regex:(.*)
target_label:job
replacement:${1}
-target_label:env
replacement:dev

由上面可以看到,我們可以設定報警規則的檔案 ,

groups:
-name:example#報警規則組的名字
rules:
-alert:InstanceDown#檢測job的狀態,持續1分鐘metrices不能訪問會發給altermanager進行報警
expr:up==0
for:1m#持續時間,表示持續一分鐘獲取不到資訊,則觸發報警
labels:
serverity:page#自定義標籤
annotations:
summary:"Instance{{$labels.instance}}down"#自定義摘要
description:"{{$labels.instance}}ofjob{{$labels.job}}hasbeendownformorethan1minutes."#自定義具體描述

上面是一個非常通用的一個報警規則,檢測應用是否DOWN掉

修改配置後,可以通過該介面重新載入配置: curl -X POST http://localhost:9090/-/reload

在啟動的時候一定要用這種方式啟動,不然是不可以重新載入配置

./prometheus --config.file=prometheus.yml --web.enable-lifecycle

修改prometheus.yml配置檔案

alerting:
alertmanagers:
-static_configs:
-targets:["localhost:17201"]#設定報警資訊推送地址

當有報警資訊需要通知的時候,會通過上面的配置,推送到localhost:17201 這個服務上去, 推送方式如下:

介面地址:/api/v1/alerts

程式樣例:

@RequestMapping(value="/api/v1/alerts")
publicStringalert(@RequestBodyStringbody){
log.info("/api/v1/alerts={}",body);
return"success";
}

入參結構:

[{
"labels":{
"alertname":"InstanceDown",
"env":"dev",
"instance":"10.208.204.46:19999",
"job":"RMS-MS",
"serverity":"page"

},
"annotations":{
"description":"10.208.204.46:19999ofjobRMS-MShasbeendownformorethan5minutes.",
"summary":"Instance10.208.204.46:19999down"

},
"startsAt":"2018-06-19T17:07:54.140071559+08:00",
"endsAt":"0001-01-01T00:00:00Z",
"generatorURL":"http://localhost.localdomain:9090/graph?g0.expr=up+==+0&g0.tab=1"

},
{
"labels":{
"alertname":"InstanceDown",
"env":"dev",
"instance":"10.208.204.46:19999",
"job":"RMS-MS",
"serverity":"page"

},
"annotations":{
"description":"10.208.204.46:19999ofjobRMS-MShasbeendownformorethan5minutes.",
"summary":"Instance10.208.204.46:19999down"

},
"startsAt":"2018-06-19T17:07:54.140071559+08:00",
"endsAt":"0001-01-01T00:00:00Z",
"generatorURL":"http://localhost.localdomain:9090/graph?g0.expr=up+==+0&g0.tab=1"

},
{
"labels":{
"alertname":"InstanceDown",
"env":"dev",
"instance":"192.168.164.1:18093",
"job":"RMS-MS",
"serverity":"page"

},
"annotations":{
"description":"192.168.164.1:18093ofjobRMS-MShasbeendownformorethan5minutes.",
"summary":"Instance192.168.164.1:18093down"

},
"startsAt":"2018-06-19T17:07:54.140071559+08:00",
"endsAt":"0001-01-01T00:00:00Z",
"generatorURL":"http://localhost.localdomain:9090/graph?g0.expr=up+==+0&g0.tab=1"

}

]

假如說有RMS-MS三臺機器都宕機了的話,那麼prometheus會發送如上資料至localhost:17201/api/v1/alerts這個介面,

如此我們就可以根據以上資料做報警通知了

使用prometheus自帶的報警元件, 當報警被觸發時,prometheus會將報警資料推送給AlertManager , AlertManager 接收到報警資訊之後,會根據他這邊的規則,然後推送報警通知。

global:
resolve_timeout:5m
route:
group_by:['job']
group_wait:30s
#同一組間隔
group_interval:5m#同一組的的告警訊息間隔,在5m分鐘內收到的同一個組的訊息,會彙總統一發送
repeat_interval:1s#相同的告警訊息的重複傳送的間隔時間
receiver:'webhook'#接受者型別
receivers:
-name:'webhook'
webhook_configs:
-url:'http://10.208.204.46:17210/test/alert2'#接收地址

這裡選用的是webhook這種方式,AlertManager 會將報警通知推送至 http://10.208.204.46:17210/test/alert2

資料結構如下:

{
"receiver":"webhook",
"status":"firing",
"alerts":[{
"status":"firing",
"labels":{},
"annotations":{
"description":"10.208.204.46:19999ofjobRMS-MShasbeendownformorethan5minutes.",
"summary":"Instance10.208.204.46:19999down"
},
"startsAt":"2018-06-19T17:25:54.143824172+08:00",
"endsAt":"0001-01-01T00:00:00Z",
"generatorURL":"http://localhost.localdomain:9090/graph?g0.expr=up+==+0&g0.tab=1"
},
{
"status":"firing",
"labels":{
"alertname":"InstanceDown",
"env":"dev",
"instance":"192.168.164.1:18093",
"job":"RMS-MS",
"serverity":"page"
},
"annotations":{
"description":"192.168.164.1:18093ofjobRMS-MShasbeendownformorethan5minutes.",
"summary":"Instance192.168.164.1:18093down"
},
"startsAt":"2018-06-19T17:25:54.143824172+08:00",
"endsAt":"0001-01-01T00:00:00Z",
"generatorURL":"http://localhost.localdomain:9090/graph?g0.expr=up+==+0&g0.tab=1"
}
],
"groupLabels":{
"job":"RMS-MS"
},
"commonLabels":{
"alertname":"InstanceDown",
"env":"dev",
"job":"RMS-MS",
"serverity":"page"
},
"commonAnnotations":{},
"externalURL":"http://localhost.localdomain:9093",
"version":"4",
"groupKey":"{}:{job=\"RMS-MS\"}"
}

假如一個叢集三臺機器都DOWN的話,那麼AlertManager會將三臺機器的資訊做彙總,然後傳送給webhook介面

功能點AlertManager自定義報警
分組 會將同一個分組的報警資訊打包做彙總 需要自研
抑制 抑制是指當警報發出後,停止重複傳送由此警報引發其他錯誤的警報的機制。 需要自研
沉默 簡單的特定時間靜音提醒的機制 需要自研
缺點 不是java開發的,要深入瞭解困難 自研成本高,初期較簡陋
優點 技術成熟 -

推薦使用AlertManager做報警通知的第一道關口,後續使用wehbook的方式推送至我方程式。

轉自

prometheus告警技術初探(一) | sharedCode https://www.shared-code.com/article/82