It is possible that some malicious application master requires way more resource than cluster can offer. Based my observation, it will starve the whole cluster since the malicious application reserves all resource but still waiting for more. It seems YARN eventually times out that app master. There is property yarn.am.liveness-monitor.expiry-interval-ms seems to be relevant, but I don’t want legitimate long running am times out prematurely.
Similarly, I can submit thousands of application at same time to Yarn cluster and each will launch an AM, which could produce deadlock situation.
What’s the best way to handle this type of malicious application?