yarn关于app max attempt深度解析,针对长服务appmaster平滑重启
在YARN上开发长服务,需要注意fault-tolerance,本篇文章对appmaster的平滑重启的一个参数做了解析,如何设置可以有助于达到appmaster平滑重启。
在yarn-site.xml有个参数
/** * The maximum number of application attempts. * It's a global setting for all application masters. */ yarn.resourcemanager.am.max-attempts
一个全局的appmaster重试次数的限制,yarn提交应用时,还可以为单独一个应用设置最大重试次数
/** * Set the number of max attempts of the application to be submitted. WARNING: * it should be no larger than the global number of max attempts in the Yarn * configuration. * @param maxAppAttempts the number of max attempts of the application * to be submitted. */ @Public @Stable public abstract void setMaxAppAttempts(int maxAppAttempts);
当attempt失败时,如果设置keepContainersAcrossAppAttempts了,resource manager会决定上个attempt的container是否仍然保留着。
boolean keepContainersAcrossAppAttempts = false; switch (finalAttemptState) { case FINISHED: { appEvent = new RMAppFinishedAttemptEvent(applicationId, appAttempt.getDiagnostics()); } break; case KILLED: { // don't leave the tracking URL pointing to a non-existent AM appAttempt.setTrackingUrlToRMAppPage(); appAttempt.invalidateAMHostAndPort(); appEvent = new RMAppFailedAttemptEvent(applicationId, RMAppEventType.ATTEMPT_KILLED, "Application killed by user.", false); } break; case FAILED: { // don't leave the tracking URL pointing to a non-existent AM appAttempt.setTrackingUrlToRMAppPage(); appAttempt.invalidateAMHostAndPort(); if (appAttempt.submissionContext .getKeepContainersAcrossApplicationAttempts() && !appAttempt.submissionContext.getUnmanagedAM()) { // See if we should retain containers for non-unmanaged applications if (!appAttempt.shouldCountTowardsMaxAttemptRetry()) { // Premption, hardware failures, NM resync doesn't count towards // app-failures and so we should retain containers. keepContainersAcrossAppAttempts = true; } else if (!appAttempt.maybeLastAttempt) { // Not preemption, hardware failures or NM resync. // Not last-attempt too - keep containers. keepContainersAcrossAppAttempts = true; } } appEvent = new RMAppFailedAttemptEvent(applicationId, RMAppEventType.ATTEMPT_FAILED, appAttempt.getDiagnostics(), keepContainersAcrossAppAttempts); } }
关注appAttempt.maybeLastAttempt这个变量,rs如何判断是否这次attempt是最后一次呢?
private void createNewAttempt() { ApplicationAttemptId appAttemptId = ApplicationAttemptId.newInstance(applicationId, attempts.size() + 1); RMAppAttempt attempt = new RMAppAttemptImpl(appAttemptId, rmContext, scheduler, masterService, submissionContext, conf, // The newly created attempt maybe last attempt if (number of // previously failed attempts(which should not include Preempted, // hardware error and NM resync) + 1) equal to the max-attempt // limit. maxAppAttempts == (getNumFailedAppAttempts() + 1), amReq); attempts.put(appAttemptId, attempt); currentAttempt = attempt; }
在每次构造新的attempt时候,maxAppAttempts == (getNumFailedAppAttempts() + 1)会决定,已经失败的次数+1,是否已经达到了maxAppAttempts的限制了。
而maxAppAttempts这个参数是由global和individual两个配置取min,决定的。
int globalMaxAppAttempts = conf.getInt(YarnConfiguration.RM_AM_MAX_ATTEMPTS, YarnConfiguration.DEFAULT_RM_AM_MAX_ATTEMPTS); int individualMaxAppAttempts = submissionContext.getMaxAppAttempts(); if (individualMaxAppAttempts <= 0 || individualMaxAppAttempts > globalMaxAppAttempts) { this.maxAppAttempts = globalMaxAppAttempts; LOG.warn("The specific max attempts: " + individualMaxAppAttempts + " for application: " + applicationId.getId() + " is invalid, because it is out of the range [1, " + globalMaxAppAttempts + "]. Use the global max attempts instead."); } else { this.maxAppAttempts = individualMaxAppAttempts; }
总结:
如果希望appmaster可以达到不断重启,而且可以接管之前的container,需要把yarn.resourcemanager.am.max-attempts这个参数尽量调大,比如设置为10000,并且提交app时候设置submit context的最大次数,以及刷新窗口,这样基本就可以满足长服务应用在yarn上面的运行需求了。