Hadoop can use Kerberos to authenticate users, and processes running within a Hadoop cluster acting on behalf of the user. It is also used to authenticate services running within the Hadoop cluster itself -so that only authenticated HDFS Datanodes can join the HDFS filesystem, that only trusted Node Managers can heartbeat to the YARN Resource Manager and receive work.
- The exact means by which all this is done is one of the most complicated pieces of code to span the entire Hadoop codebase.*
Users of Hadoop do not need to worry about the implementation details, and, ideally, nor should the operations team.
Developers of core Hadoop code, anyone writing a YARN application, and anyone writing code to interact with a Hadoop cluster and applications running in it do need to know those details.
This is what this book attempts to cover.
Before going in there, here's a recurring question: why? Why Kerberos and not, say some SSL-certificate like system? Or OAuth?
Kerberos was written to support centrally managed accounts in a local area network, one in which adminstrators manage individual accounts. This is actually much simpler to manage than PKI-certificate based systems: look at the effort it takes to revoke a certificate in a browser.