I´m a student and just experimenting with YARN and I got a couple questions concerning the “availability mechanisms” in YARN. Maybe someone of you can help me and give me some hints…
I wrote a simple demo application (comparable to the “Distributed Shell” example) and tested it on a single node setup. That all worked great!
Now I want to investigate in the “Failure Tolerance / Availability” topic, for that I examined several scenarios (but YARN did not behave as I expected):
(1) Start the YARN app (1 Client, 1 ApplicationMaster (AM) and 1 Worker/Task (associated with the AM)). Kill the ResourceManager process (for simulating a crashed RM node) during running the AM/Worker. I would expect that the RM would be relaunched on a new allocated container (If so what component is responsible for that relaunch -> ZooKeeper stuff?). Is that assumption wrong? How must the AM react to reconnect to a new launched RM? Or is the complete system down after RM crash?
(2) If I kill the AM process, I expect that the ApplicationsManager (ASM) restart the AM also on a new allocated container and execute all tasks that haven´t been executed so far.
(3) (Liveness-Protocols) If I set the RM_AM_EXPIRY_INTERVAL_MS to 2 min and let the AM freeze (per sleep command) for 3 min nothing happens. I would also expect a restart of the AM by the ASM. But the job finishes without problems and without any log notification or something.
(4) The liveness management for the worker nodes is completely handled by the AM. If a worker node crashes, the AM must restart all crashed tasks?
I have these behavioral assumptions due to the design document “Architecture of Next Generation Apache Hadoop MapReduce Framework”…. But maybe my way of exploring is completely wrong…
It would be great if someone could give me some few hints!!!
Thanks in advance,