Flink On Yarn 異常排除過程以及根據位元組碼名字獲取jar檔名字
最初學習Flink,寫了一個簡單的wordcount執行一下,發現報錯,異常資訊如下:
The program finished with the following exception: java.lang.RuntimeException: Error deploying the YARN cluster at org.apache.flink.yarn.cli.FlinkYarnSessionCli.createCluster(FlinkYarnSessionCli.java:556) at org.apache.flink.yarn.cli.FlinkYarnSessionCli.createCluster(FlinkYarnSessionCli.java:72) at org.apache.flink.client.CliFrontend.createClient(CliFrontend.java:962) at org.apache.flink.client.CliFrontend.run(CliFrontend.java:243) at org.apache.flink.client.CliFrontend.parseParameters(CliFrontend.java:1086) at org.apache.flink.client.CliFrontend$2.call(CliFrontend.java:1133) at org.apache.flink.client.CliFrontend$2.call(CliFrontend.java:1130) at org.apache.flink.runtime.security.HadoopSecurityContext$1.run(HadoopSecurityContext.java:43) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548) at org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:40) at org.apache.flink.client.CliFrontend.main(CliFrontend.java:1130) Caused by: java.lang.RuntimeException: Couldn't deploy Yarn cluster at org.apache.flink.yarn.AbstractYarnClusterDescriptor.deploy(AbstractYarnClusterDescriptor.java:443) at org.apache.flink.yarn.cli.FlinkYarnSessionCli.createCluster(FlinkYarnSessionCli.java:554) ... 12 more Caused by: org.apache.flink.yarn.AbstractYarnClusterDescriptor$YarnDeploymentException: The YARN application unexpectedly switched to state FAILED during deployment. Diagnostics from YARN: Application application_1505452408837_0056 failed 1 times due to AM Container for appattempt_1505452408837_0056_000001 exited with exitCode: 31 For more detailed output, check application tracking page:http://node:8099/cluster/app/application_1505452408837_0056Then, click on links to logs of each attempt. Diagnostics: Exception from container-launch. Container id: container_1505452408837_0056_01_000001 Exit code: 31 Stack trace: ExitCodeException exitCode=31: at org.apache.hadoop.util.Shell.runCommand(Shell.java:545) at org.apache.hadoop.util.Shell.run(Shell.java:456) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:722) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:212) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) Container exited with a non-zero exit code 31 Failing this attempt. Failing the application. If log aggregation is enabled on your cluster, use this command to further investigate the issue:
由於Flink資料比較少,google很久也沒有找到,後來一直以為是Flink Bug所致了,故而沒有太重視,後來隨著版本升級發現還是報這個錯誤,開始懷疑自己了。然後開始進行錯誤排查,只能說這個錯誤資訊真能矇蔽人。經過一點點探索,通過檢視yarn命令可以發現如下比較有意義的資訊:
命令列:
yarn logs -applicationId application_1505452408837_0056
異常資訊為:
2017-11-22 22:07:35,539 INFO org.apache.flink.runtime.security.modules.HadoopModule - Hadoop user set to daxin (auth:SIMPLE) 2017-11-22 22:07:35,551 ERROR org.apache.flink.yarn.YarnApplicationMasterRunner - YARN Application Master initialization failed java.lang.NoSuchMethodError: org.apache.flink.runtime.util.ExecutorThreadFactory.<init>(Ljava/lang/String;)V at org.apache.flink.yarn.YarnApplicationMasterRunner.runApplicationMaster(YarnApplicationMasterRunner.java:223) at org.apache.flink.yarn.YarnApplicationMasterRunner$1.call(YarnApplicationMasterRunner.java:195) at org.apache.flink.yarn.YarnApplicationMasterRunner$1.call(YarnApplicationMasterRunner.java:192) at org.apache.flink.runtime.security.HadoopSecurityContext$1.run(HadoopSecurityContext.java:43) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548) at org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:40) at org.apache.flink.yarn.YarnApplicationMasterRunner.run(YarnApplicationMasterRunner.java:192) at org.apache.flink.yarn.YarnApplicationMasterRunner.main(YarnApplicationMasterRunner.java:116) End of LogType:jobmanager.log LogType:jobmanager.out Log Upload Time:Wed Nov 22 22:07:37 +0800 2017 LogLength:0 Log Contents: End of LogType:jobmanager.out
啊,這個異常資訊終於可以提示一些有意義的資訊了!如上異常提示ExecutorThreadFactory沒有帶有String型別的構造器,經過遠端debug發現也有該類,使用反射檢查構造器個數的確是0個。反覆解壓jar包,使用反編譯檢視class檔案都沒有發現問題,而且使用-verbose:class顯示class載入資訊也沒有發現問題。如果有人發現問題請指點。
由於程式jar包是由idea直接export的,不是maven打包。後來將原有程式不動,使用maven打包執行既然可以執行。
問題解決了,卻沒有找到原因的確挺尷尬的。但是收穫還是有的:
1:可以編碼方式顯示每一個class檔案來至於哪一個jar包,除了使用maven分析衝突之外,這種方式可以分析非maven專案的jar衝突。程式原始碼:
public static List<String> getJarName(String className) throws Exception {
//boot類載入器載入的jar路徑
String bootPath = System.getProperty("sun.boot.class.path");
//ext類載入器載入的jar路徑
String extPath = System.getProperty("java.ext.dirs");
//所有的jars
String allPath = System.getProperty("java.class.path");
//windows平臺是; linux平臺是: ,最好使用File.pathSeparatorChar做跨平臺
String[] jarnames = allPath.split(";");
List<String> jars = new ArrayList<>();
for (int i = 0; i < jarnames.length; i++) {
File file = new File(jarnames[i]);
if (file.isFile() && file.exists()) {
JarFile jar = new JarFile(file);
Enumeration<JarEntry> enums = jar.entries();
while (enums.hasMoreElements()) {
JarEntry entry = enums.nextElement();
String qulifierName = entry.getName().replace("/", ".");
if (qulifierName.equals(className)) {
jars.add("class name = "+qulifierName + " , jar filename = " + file.getName());
}
}
}
}
return jars;
}
例如呼叫:
getJarName("java.util.concurrent.ThreadPoolExecutor.class")
輸出:
class name = java.util.concurrent.ThreadPoolExecutor.class , jar filename = rt.jar
輸出部分資訊如:
3:關於Flink除錯JVM引數文件地址:
https://ci.apache.org/projects/flink/flink-docs-release-1.3/setup/config.html#common-options
配置引數的類:
org.apache.flink.configuration.ConfigConstants
org.apache.flink.configuration.CoreOptions
所有有關配置的工具類都在org.apache.flink.configuration包中。