1. 程式人生 > >Trafodion Troubleshooting-org.apache.zookeeper.KeeperException$NoNodeException

Trafodion Troubleshooting-org.apache.zookeeper.KeeperException$NoNodeException

現象

最近在一個客戶環境中發現啟動Trafodion時每次dcsstart後dcsserver立即變為Down的狀態,檢視相關dcs_master和dcs_server的日誌如下,
–dcs master log

2017-08-29 13:05:10,895, ERROR, org.trafodion.dcs.master.ServerManager, Node Number: , CPU: , PID: , Process Name: , , ,exit code [1], stdout [server 21612. Stop it first.]
2017-08-29 13:05:10,895
, INFO, org.trafodion.dcs.util.RetryCounter, Node Number: , CPU: , PID: , Process Name: , , ,Sleeping 2000ms before retry #1... 2017-08-29 13:05:12,896, INFO, org.trafodion.dcs.master.ServerManager, Node Number: , CPU: , PID: , Process Name: , , ,Restarting DcsServer [bdtest04.novalocal:1], script [

–dcs server log

2017-08-29 13:04:28,918, ERROR, org.trafodion.dcs.server.ServerManager, Node Number: , CPU: , PID: , Process Name: , , ,org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /trafodion/dcs/servers/registered/bdtest04.novalocal:1:4
2017-08-29 13:04:28,920, ERROR, org.trafodion.dcs.server.DcsServer, Node Number: , CPU: , PID: , Process Name: , , ,java.util.concurrent.ExecutionException: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /trafodion/dcs/servers/registered/bdtest04.novalocal:1
:4

分析

通過以上錯誤資訊,懷疑可能是zookeeper中/trafodion/dcs/servers/registered/bdtest04.novalocal節點沒有建立成功,因此去zkCLi.sh中檢視,發現節點已經存在

解決

通過CDH Manager發現,由於之前某個節點硬碟損壞,導致其中一個zookeeper節點更換到其他節點,但相應的dcs-site.xml檔案中的”dcs.zookeeper.quorum”並沒有修改,因此修改dcs-site.xml檔案並同步所有Trafodion節點後重啟Trafodion,解決問題

<property>
<name>dcs.zookeeper.quorum</name>
<value>bdtest04.novalocal,bdtest06.novalocal,bdtest03.novalocal</value>
</property>