1. 程式人生 > >Flink On Yarn 異常排除過程以及根據位元組碼名字獲取jar檔名字

Flink On Yarn 異常排除過程以及根據位元組碼名字獲取jar檔名字

最初學習Flink,寫了一個簡單的wordcount執行一下,發現報錯,異常資訊如下:

 The program finished with the following exception:

java.lang.RuntimeException: Error deploying the YARN cluster
	at org.apache.flink.yarn.cli.FlinkYarnSessionCli.createCluster(FlinkYarnSessionCli.java:556)
	at org.apache.flink.yarn.cli.FlinkYarnSessionCli.createCluster(FlinkYarnSessionCli.java:72)
	at org.apache.flink.client.CliFrontend.createClient(CliFrontend.java:962)
	at org.apache.flink.client.CliFrontend.run(CliFrontend.java:243)
	at org.apache.flink.client.CliFrontend.parseParameters(CliFrontend.java:1086)
	at org.apache.flink.client.CliFrontend$2.call(CliFrontend.java:1133)
	at org.apache.flink.client.CliFrontend$2.call(CliFrontend.java:1130)
	at org.apache.flink.runtime.security.HadoopSecurityContext$1.run(HadoopSecurityContext.java:43)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:415)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
	at org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:40)
	at org.apache.flink.client.CliFrontend.main(CliFrontend.java:1130)
Caused by: java.lang.RuntimeException: Couldn't deploy Yarn cluster
	at org.apache.flink.yarn.AbstractYarnClusterDescriptor.deploy(AbstractYarnClusterDescriptor.java:443)
	at org.apache.flink.yarn.cli.FlinkYarnSessionCli.createCluster(FlinkYarnSessionCli.java:554)
	... 12 more
Caused by: org.apache.flink.yarn.AbstractYarnClusterDescriptor$YarnDeploymentException: The YARN application unexpectedly switched to state FAILED during deployment. 
Diagnostics from YARN: Application application_1505452408837_0056 failed 1 times due to AM Container for appattempt_1505452408837_0056_000001 exited with  exitCode: 31
For more detailed output, check application tracking page:http://node:8099/cluster/app/application_1505452408837_0056Then, click on links to logs of each attempt.
Diagnostics: Exception from container-launch.
Container id: container_1505452408837_0056_01_000001
Exit code: 31
Stack trace: ExitCodeException exitCode=31: 
	at org.apache.hadoop.util.Shell.runCommand(Shell.java:545)
	at org.apache.hadoop.util.Shell.run(Shell.java:456)
	at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:722)
	at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:212)
	at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
	at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
	at java.util.concurrent.FutureTask.run(FutureTask.java:262)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
	at java.lang.Thread.run(Thread.java:744)


Container exited with a non-zero exit code 31
Failing this attempt. Failing the application.
If log aggregation is enabled on your cluster, use this command to further investigate the issue:

由於Flink資料比較少,google很久也沒有找到,後來一直以為是Flink Bug所致了,故而沒有太重視,後來隨著版本升級發現還是報這個錯誤,開始懷疑自己了。然後開始進行錯誤排查,只能說這個錯誤資訊真能矇蔽人。經過一點點探索,通過檢視yarn命令可以發現如下比較有意義的資訊:

命令列:

yarn logs -applicationId application_1505452408837_0056

異常資訊為:
2017-11-22 22:07:35,539 INFO  org.apache.flink.runtime.security.modules.HadoopModule        - Hadoop user set to daxin (auth:SIMPLE)
2017-11-22 22:07:35,551 ERROR org.apache.flink.yarn.YarnApplicationMasterRunner             - YARN Application Master initialization failed
java.lang.NoSuchMethodError: org.apache.flink.runtime.util.ExecutorThreadFactory.<init>(Ljava/lang/String;)V
	at org.apache.flink.yarn.YarnApplicationMasterRunner.runApplicationMaster(YarnApplicationMasterRunner.java:223)
	at org.apache.flink.yarn.YarnApplicationMasterRunner$1.call(YarnApplicationMasterRunner.java:195)
	at org.apache.flink.yarn.YarnApplicationMasterRunner$1.call(YarnApplicationMasterRunner.java:192)
	at org.apache.flink.runtime.security.HadoopSecurityContext$1.run(HadoopSecurityContext.java:43)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:415)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
	at org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:40)
	at org.apache.flink.yarn.YarnApplicationMasterRunner.run(YarnApplicationMasterRunner.java:192)
	at org.apache.flink.yarn.YarnApplicationMasterRunner.main(YarnApplicationMasterRunner.java:116)
End of LogType:jobmanager.log

LogType:jobmanager.out
Log Upload Time:Wed Nov 22 22:07:37 +0800 2017
LogLength:0
Log Contents:
End of LogType:jobmanager.out

啊,這個異常資訊終於可以提示一些有意義的資訊了!如上異常提示ExecutorThreadFactory沒有帶有String型別的構造器,經過遠端debug發現也有該類,使用反射檢查構造器個數的確是0個。反覆解壓jar包,使用反編譯檢視class檔案都沒有發現問題,而且使用-verbose:class顯示class載入資訊也沒有發現問題。如果有人發現問題請指點。

由於程式jar包是由idea直接export的,不是maven打包。後來將原有程式不動,使用maven打包執行既然可以執行。

問題解決了,卻沒有找到原因的確挺尷尬的。但是收穫還是有的:

1:可以編碼方式顯示每一個class檔案來至於哪一個jar包,除了使用maven分析衝突之外,這種方式可以分析非maven專案的jar衝突。程式原始碼:

public static List<String> getJarName(String className) throws Exception {
        //boot類載入器載入的jar路徑
        String bootPath = System.getProperty("sun.boot.class.path");
        //ext類載入器載入的jar路徑
        String extPath = System.getProperty("java.ext.dirs");
        //所有的jars
        String allPath = System.getProperty("java.class.path");
        //windows平臺是; linux平臺是:  ,最好使用File.pathSeparatorChar做跨平臺
        String[] jarnames = allPath.split(";");


        List<String> jars = new ArrayList<>();

        for (int i = 0; i < jarnames.length; i++) {
            File file = new File(jarnames[i]);
            if (file.isFile() && file.exists()) {
                JarFile jar = new JarFile(file);
                Enumeration<JarEntry> enums = jar.entries();
                while (enums.hasMoreElements()) {
                    JarEntry entry = enums.nextElement();
                    String qulifierName = entry.getName().replace("/", ".");
                    if (qulifierName.equals(className)) {
                        jars.add("class name =  "+qulifierName + " , jar filename = " + file.getName());
                    }
                }
            }
        }

        return jars;
    }

例如呼叫:

getJarName("java.util.concurrent.ThreadPoolExecutor.class")

輸出:

class name =  java.util.concurrent.ThreadPoolExecutor.class , jar filename = rt.jar

2:使用-verbose:class顯示程式載入class的詳細資訊:


輸出部分資訊如:


3:關於Flink除錯JVM引數文件地址:

https://ci.apache.org/projects/flink/flink-docs-release-1.3/setup/config.html#common-options

配置引數的類:

org.apache.flink.configuration.ConfigConstants
org.apache.flink.configuration.CoreOptions

所有有關配置的工具類都在org.apache.flink.configuration包中。