prometheus告警技術初探(一)【轉】
告警規則
global: scrape_interval:15s evaluation_interval:15s#每過15秒執行一次報警規則,也就是說15秒執行一次報警 alerting: alertmanagers: -static_configs: -targets:["localhost:9093"]#設定報警資訊推送地址,一般而言設定的是alertManager的地址 rule_files: -"test_rules.yml"#設定報警規則 scrape_configs: -job_name:'node'#自己定義的監控的job_name static_configs: -targets:['localhost:9100'] -job_name:'CDG-MS' honor_labels:true metrics_path:'/prometheus' static_configs: -targets:['localhost:8089'] relabel_configs: -target_label:env replacement:dev -job_name:'eureka' file_sd_configs: -files: -"/app/enmonster/basic/prometheus/prometheus-2.2.1.linux-amd64/eureka.json" relabel_configs: -source_labels:[__job_name__] regex:(.*) target_label:job replacement:${1} -target_label:env replacement:dev
由上面可以看到,我們可以設定報警規則的檔案 ,
groups: -name:example#報警規則組的名字 rules: -alert:InstanceDown#檢測job的狀態,持續1分鐘metrices不能訪問會發給altermanager進行報警 expr:up==0 for:1m#持續時間,表示持續一分鐘獲取不到資訊,則觸發報警 labels: serverity:page#自定義標籤 annotations: summary:"Instance{{$labels.instance}}down"#自定義摘要 description:"{{$labels.instance}}ofjob{{$labels.job}}hasbeendownformorethan1minutes."#自定義具體描述
上面是一個非常通用的一個報警規則,檢測應用是否DOWN掉
修改配置後,可以通過該介面重新載入配置: curl -X POST http://localhost:9090/-/reload
在啟動的時候一定要用這種方式啟動,不然是不可以重新載入配置
./prometheus --config.file=prometheus.yml --web.enable-lifecycle
自定義報警通知
修改prometheus.yml配置檔案
alerting:
alertmanagers:
-static_configs:
-targets:["localhost:17201"]#設定報警資訊推送地址
當有報警資訊需要通知的時候,會通過上面的配置,推送到localhost:17201 這個服務上去, 推送方式如下:
介面地址:/api/v1/alerts
程式樣例:
@RequestMapping(value="/api/v1/alerts")
publicStringalert(@RequestBodyStringbody){
log.info("/api/v1/alerts={}",body);
return"success";
}
入參結構:
[{
"labels":{
"alertname":"InstanceDown",
"env":"dev",
"instance":"10.208.204.46:19999",
"job":"RMS-MS",
"serverity":"page"
},
"annotations":{
"description":"10.208.204.46:19999ofjobRMS-MShasbeendownformorethan5minutes.",
"summary":"Instance10.208.204.46:19999down"
},
"startsAt":"2018-06-19T17:07:54.140071559+08:00",
"endsAt":"0001-01-01T00:00:00Z",
"generatorURL":"http://localhost.localdomain:9090/graph?g0.expr=up+==+0&g0.tab=1"
},
{
"labels":{
"alertname":"InstanceDown",
"env":"dev",
"instance":"10.208.204.46:19999",
"job":"RMS-MS",
"serverity":"page"
},
"annotations":{
"description":"10.208.204.46:19999ofjobRMS-MShasbeendownformorethan5minutes.",
"summary":"Instance10.208.204.46:19999down"
},
"startsAt":"2018-06-19T17:07:54.140071559+08:00",
"endsAt":"0001-01-01T00:00:00Z",
"generatorURL":"http://localhost.localdomain:9090/graph?g0.expr=up+==+0&g0.tab=1"
},
{
"labels":{
"alertname":"InstanceDown",
"env":"dev",
"instance":"192.168.164.1:18093",
"job":"RMS-MS",
"serverity":"page"
},
"annotations":{
"description":"192.168.164.1:18093ofjobRMS-MShasbeendownformorethan5minutes.",
"summary":"Instance192.168.164.1:18093down"
},
"startsAt":"2018-06-19T17:07:54.140071559+08:00",
"endsAt":"0001-01-01T00:00:00Z",
"generatorURL":"http://localhost.localdomain:9090/graph?g0.expr=up+==+0&g0.tab=1"
}
]
假如說有RMS-MS三臺機器都宕機了的話,那麼prometheus會發送如上資料至localhost:17201/api/v1/alerts這個介面,
如此我們就可以根據以上資料做報警通知了
AlertManager
使用prometheus自帶的報警元件, 當報警被觸發時,prometheus會將報警資料推送給AlertManager , AlertManager 接收到報警資訊之後,會根據他這邊的規則,然後推送報警通知。
global:
resolve_timeout:5m
route:
group_by:['job']
group_wait:30s
#同一組間隔
group_interval:5m#同一組的的告警訊息間隔,在5m分鐘內收到的同一個組的訊息,會彙總統一發送
repeat_interval:1s#相同的告警訊息的重複傳送的間隔時間
receiver:'webhook'#接受者型別
receivers:
-name:'webhook'
webhook_configs:
-url:'http://10.208.204.46:17210/test/alert2'#接收地址
這裡選用的是webhook這種方式,AlertManager 會將報警通知推送至 http://10.208.204.46:17210/test/alert2 。
資料結構如下:
{
"receiver":"webhook",
"status":"firing",
"alerts":[{
"status":"firing",
"labels":{},
"annotations":{
"description":"10.208.204.46:19999ofjobRMS-MShasbeendownformorethan5minutes.",
"summary":"Instance10.208.204.46:19999down"
},
"startsAt":"2018-06-19T17:25:54.143824172+08:00",
"endsAt":"0001-01-01T00:00:00Z",
"generatorURL":"http://localhost.localdomain:9090/graph?g0.expr=up+==+0&g0.tab=1"
},
{
"status":"firing",
"labels":{
"alertname":"InstanceDown",
"env":"dev",
"instance":"192.168.164.1:18093",
"job":"RMS-MS",
"serverity":"page"
},
"annotations":{
"description":"192.168.164.1:18093ofjobRMS-MShasbeendownformorethan5minutes.",
"summary":"Instance192.168.164.1:18093down"
},
"startsAt":"2018-06-19T17:25:54.143824172+08:00",
"endsAt":"0001-01-01T00:00:00Z",
"generatorURL":"http://localhost.localdomain:9090/graph?g0.expr=up+==+0&g0.tab=1"
}
],
"groupLabels":{
"job":"RMS-MS"
},
"commonLabels":{
"alertname":"InstanceDown",
"env":"dev",
"job":"RMS-MS",
"serverity":"page"
},
"commonAnnotations":{},
"externalURL":"http://localhost.localdomain:9093",
"version":"4",
"groupKey":"{}:{job=\"RMS-MS\"}"
}
假如一個叢集三臺機器都DOWN的話,那麼AlertManager會將三臺機器的資訊做彙總,然後傳送給webhook介面
比較
功能點 | AlertManager | 自定義報警 |
---|---|---|
分組 | 會將同一個分組的報警資訊打包做彙總 | 需要自研 |
抑制 | 抑制是指當警報發出後,停止重複傳送由此警報引發其他錯誤的警報的機制。 | 需要自研 |
沉默 | 簡單的特定時間靜音提醒的機制 | 需要自研 |
缺點 | 不是java開發的,要深入瞭解困難 | 自研成本高,初期較簡陋 |
優點 | 技術成熟 | - |
推薦使用AlertManager做報警通知的第一道關口,後續使用wehbook的方式推送至我方程式。
轉自
prometheus告警技術初探(一) | sharedCode https://www.shared-code.com/article/82