So far we've covered Kerberos and Kerberos Tickets. Hadoop complicates things by adding another form of delegated authentication, Hadoop Tokens.
That's a good question, one developers ask on a regular basis —at least once every hour based on our limited experiments.
Hadoop clusters are some of the largest "single" distributed systems on the planet in terms of numbers of services: a YARN cluster of 10,000 nodes would have 10,000 hdfs principals, 10,000 yarn principals and the principals of the users running the applications. That's a lot of principals, all talking to each other, all having to talk to the KDC, having to re-authenticate all the time, and making calls to the KDC whenever they wish to talk to another principal in the system.
Tokens are wire-serializable objects issued by Hadoop services, which grant access to services. Some services issue tokens to callers which are then used by those callers to directly interact with other services without involving the KDC at all.
As an example, The HDFS NameNode has to give callers access to the blocks comprising a file. This isn't done in the DataNodes: all filenames and the permissions are stored in the NN. All the DNs have is their set of blocks.
To get at these blocks, HDFS gives an authenticated caller a Block Tokens for every block they need to read in a file. The caller then requests the blocks of any of the datanodes hosting that block, including the block token in the request.
These HDFS Block Tokens do not contain any specific knowledge of the principal running the Datanodes, instead they declare that the caller has stated access rights to the specific block, up until the token expires.
public class BlockTokenIdentifier extends TokenIdentifier {
static final Text KIND_NAME = new Text("HDFS_BLOCK_TOKEN");
private long expiryDate;
private int keyId;
private String userId;
private String blockPoolId;
private long blockId;
private final EnumSet<AccessMode> modes;
private byte [] cache;
...
Alongside the fields covering the block and permissions, that cache
data contains the token
identifier
Token | Function |
---|---|
Authentication Token | Directly authenticate a caller. |
Delegation Token | A token which can be passed to another process. |
Authentication Tokens are explicitly issued by services to allow the caller to interact with the service without having to re-request tickets from the TGT.
When an Authentication Tokens expires, the caller must request a new one from the service. If the Kerberos ticket to interact with the service has expired, this may include re-requesting a ticket off the TGS, or even re-logging in to Kerberos to obtain a new TGT.
As such, they are almost equivalent to Kerberos Tickets -except that it is the distributed services themselves issuing the Authentication Token, not the TGS.
A delegation token is requested by a client of a service; they can be passed to other processes.
When the token expires, the original client must request a new delegation token and pass it on to the other process, again.
What is more important is:
- delegation tokens can be renewed before they expire.*
This is a fundamental difference between Kerberos Tickets and Hadoop Delegation Tokens.
Holders of delegation tokens may renew them with a token-specific TokenRenewer
service,
so refresh them without needing the Kerberos credentials to log in to kerberos.
More subtly
- The tokens must be renewed before they expire: once expired, a token is worthless.
- Token renewers can be implemented as a Hadoop RPC service, or by other means, including HTTP.
- Token renewal may simply be the updating of an expiry time in the server, without pushing out new tokens to the clients. This scales well when there are many processes across the cluster associated with a single application..
For the HDFS Client protocol, the client protocol itself is the token renewer. A client may talk to the Namenode using its current token, and request a new one, so refreshing it.
In contrast, the YARN timeline service is a pure REST API, which implements its token renewal over HTTP/HTTPS. To refresh the token, the client must issue an HTTP request (a PUT operation, interestingly enough), receiving a new token as a response.
Other delegation token renewal mechanisms alongside Hadoop RPC and HTTP could be implemented, that is a detail which client applications do not need to care about. All the matters is that they have the code to refresh tokens, usually code which lives alongside the RPC/REST client, and keep renewing the tokens on a regularl basis. Generally this is done by starting a thread in the background.
Delegation tokens can be revoked —such as when the YARN which needed them completes.
In Kerberos, the client obtains a ticket off the KDC, then hands it to the service —a service which does not need to contain any information about the tickets issued to clients.
With delegation tokens, the specific services supporting them have to implement their own internal tracking of issued tokens. That comes with benefits as well as a cost.
The cost? The services now have to maintain some form of state, either locally or, in HA deployments, in some form of storage shared across the failover services.
The benefit: there's no need to involve the KDC in authenticating requests, yet short-lived access can be granted to applications running in the cluster. This explicitly avoid the problem of having 1000+ containers in a YARN application each trying to talk to the KDC. (Issue: surely tickets offer that feature?).
Imagine a user deploying a YARN application in a cluster, one which needs access to the user's data stored in HDFS. The user would be required to be authenticated with the KDC, and have been granted a Ticket Granting Ticket; the ticket needed to work with the TGS.
The client-side launcher of the YARN application would be able to talk to HDFS and the YARN resource manager, because the user was logged in to Kerberos. This would be managed in the Hadoop RPC layer, requesting tickets to talk to the HDFS NameNode and YARN ResourceManager, if needed.
To give the YARN application the same rights to HDFS, the client-side application must
request a Delegation Token to talk to HDFS, a key which is then passed to the YARN application in
the ContainerLaunchContext
within the ApplicationSubmissionContext
used to define the
application to launch: its required container resources, artifacts to download, "localize",
environment to set up and command to run.
The YARN resource manager finds a location for the Application Master, and requests that hosts' Node Manager start the container/application.
The Node Manager uses the "delegated HDFS token" to download the launch-time resources into a local directory space, then executes the application.
Somehow, the HDFS token (and any other supplied tokens) are passed to the application that has been launched.
The launched application master can use this token to interact with HDFS as the original user.
The AM can also pass token(s) on to launched containers, so that they too have access to HDFS.
The Hadoop NameNode does not need to care whether the caller is the user themselves, the Node Manager localizing the container, the launched application or any launched containers. All it does is verify that when a caller requests access to the HDFS filesystem metadata or the contents of a file, it must have a ticket/token which declares that they are the specific user, and that the token is currently considered valid (based on the expiry time and the clock value of the Name Node)
If you are writing an application, what does this mean?
You need to worry about tokens in servers if:
- You want an application deploying work into a YARN cluster to access your service as the user submitting the job.
- You want to support secure connections without requiring Kerberos authentication at the rate of the maximum life of a kerberos ticket.
- You want to allow applications to delegate authority, such as to YARN applications, or other services. (Example, filesystem delegation tokens provided to a Hive thrift server could be used to access the filesystem as that user).
- You want a consistent client/server authentication and identification mechanism across secure and insecure clusters. This is exactly what YARN does: a token is issued by the YARN Resource Manager to an application instance's Application Manager at launch time; this is used in all communications from the AM to the RM. Using tokens always means there is no separate codepath between insecure and secure clusters.
You need to worry about tokens in client applications if you wish to interact with Hadoop services. If the client is required to run on a kerberos-authenticated account (e.g. kinit or keytab), then your main concern is simply making sure the principal is logged in.
If your application wishes to run code in the cluster using the YARN scheduler, you need to directly worry about Hadoop tokens. You will need to request delegation tokens from every service with which your application will interact, include them in the YARN launch information —and propagate them from your Application Master to all containers the application launches.
(from Owen's design document)
Namenode Token
TokenID = {ownerID, renewerID, issueDate, maxDate, sequenceNumber}
TokenAuthenticator = HMAC-SHA1(masterKey, TokenID)
Delegation Token = {TokenID, TokenAuthenticator}
The token ID is used in messages from the client to identify the client; service can
rebuild the TokenAuthenticator
from it; this is the secret used for DIGEST-MD5 signing
of requests.
Token renewal: caller asks service provider for a token to be renewed. The server updates
the expiry date in its local table to min(maxDate, now()+renew_period)
. A non-HA NN
can use these renewal requests to actually rebuild its token table —provided the master
key has been persisted.
What is inside a Hadoop Token? Whateve marshallable data the service wishes to supply.
A token is treated as a byte array to be passed in communications, such as when setting up an IPC connection, or as a data to include on an HTTP header while negotiating with a remote REST endpoint.
The code on the server which issues tokens,
the SecretManager
is free to fill its byte arrays with
structures of its choice. Sometimes serialized java objects
are used; more recent code, such as that in YARN, serializes
data as a protobuf structure and provides that in the byte array
(example, NMTokenIdentifier
).
The abstract class org.apache.hadoop.yarn.api.records.Token
is
used to represent a token in Java code; it contains
field | type | role |
---|---|---|
identifier | ByteBuffer |
the service-specific data within a token |
password | ByteBuffer |
a password |
tokenKind | String |
token kind for looking up tokens. |
Implementations of the the abstract class org.apache.hadoop.security.token.TokenIdentifier
must contain everything needed for a caller to be validated by a
service which uses tokens for authentication.
The base class has:
field | type | role |
---|---|---|
kind |
Text |
Unique type of token identifier |
The kind
field must be unique across all tokens, as it is used in bonding
to tokens.
The abstract class org.apache.hadoop.security.token.delegation.AbstractDelegationTokenIdentifier
is
the base of all Delegation Tokens in HDFS and elsewhere.
It extends TokenIdentifier
with:
field | type | role |
---|---|---|
owner |
Text |
owner of the token |
renewer |
Text |
|
realUser |
Text |
It is straightforward to extend this with more data, which can be used to add information into a token which can then be read by the applications which have loaded the tokens. This has been used in HADOOP-14556 to pass Amazon AWS login secrets as if they were a filesystem token.
Every server which handles tokens through Hadoop RPC should implement an
a org.apache.hadoop.security.token.SecretManager
subclass.
In org.apache.hadoop.ipc.Server
and org.apache.hadoop.security.SaslRpcServer
,
a specific implementation of the class SecretManager<T extends TokenIdentifier>
creates instances of the TokenIdentifier
This contains a "secret" (generated by the javax.crypto
libraries), adding serialization
and equality checks. Because of this the keys can be persisted (as HDFS does) or sent
over a secure channel. Uses crop up in YARN's ZKRMStateStore
, the MapReduce History server
and the YARN Application Timeline Service.
A first step is determining the Kerberos Principal for a service:
- The Service name is derived from the URI (see
SecurityUtil.buildDTServiceName
)...different services on the same host have different service names - Every service has a protocol (usually defined by the RPC protocol API)
- To find a token for a service, client enumerates all
SecurityInfo
instances; these return info about the provider. One classAnnotatedSecurityInfo
, examines the annotations on the class to determine these values, including looking in the Hadoop configuration to determine the kerberos principal declared for that service (see IPC for specifics).
Token renewal is the second part of token work. If you are implementing token support, it is easiest to postpone this until the core issuing is working.
Implement a subclass of org.apache.hadoop.security.token.TokenRenewer
,
public abstract class TokenRenewer {
/**
* Does this renewer handle this kind of token?
* @param kind the kind of the token
* @return true if this renewer can renew it
*/
public abstract boolean handleKind(Text kind);
/**
* Is the given token managed? Only managed tokens may be renewed or
* cancelled.
* @param token the token being checked
* @return true if the token may be renewed or cancelled
* @throws IOException
*/
public abstract boolean isManaged(Token<?> token) throws IOException;
/**
* Renew the given token.
* @return the new expiration time
* @throws IOException
* @throws InterruptedException
*/
public abstract long renew(Token<?> token,
Configuration conf
) throws IOException, InterruptedException;
/**
* Cancel the given token
* @throws IOException
* @throws InterruptedException
*/
public abstract void cancel(Token<?> token,
Configuration conf
) throws IOException, InterruptedException;
}
- In
handleKind()
, verify the token kind is that which the renewer supports. - In
isManaged()
, return true, unless the ability to renew a token is made on a token-by-token basis. - In
reneww()
, renew the credential by notifying the bonded service that the token is to be renewed. - In
cancel()
, notify the bonded service that the token should be cancelled.
Finally, declare the class in the resource
META-INF/services/org.apache.hadoop.security.token.TokenRenewer
.
DTs can be serialized; that is done when issued/renewed.
When making requests over Hadoop RPC, you don't need to include the DT, simply include the Hash to indicate that you have it (Revisit: what does this mean?)
Every token which is deserialized at the far end must have a service declaration in
META-INF/services/org.apache.hadoop.security.token.TokenIdentifier
Here is the one in hadoop-common
as of Hadoop 3.2:
org.apache.hadoop.crypto.key.kms.KMSDelegationToken$KMSDelegationTokenIdentifier
When Hadoop is looking for an implementation of a token, it
enumerates all available token identifiers registered this way through the
java ServiceLoader.load(class)
API.
Token.decodeIdentifier()
is then invoked to extract the identifier
information.
Important: All java classes directly invoked in a token implementation class must be on the classpath.
If this condition is not met, the identifier cannot be loaded.
YARN applications depend on delegation tokens to gain access to cluster resources and data on behalf of the principal. It is the task of the client-side launcher code to collect the tokens needed, and pass them to the launch context used to launch the Application Master.
Proxy users are a feature which was included in the Hadoop security model for services such as Oozie; a service which needs to be able to execute work on behalf of a user
Because the time at which Oozie would execute future work cannot be determined, delegation tokens cannot be used to authenticate requests issued by Oozie on behalf of a user. Kerberos keytabs are a possible solution here, but it would require every user submitting work to Oozie to have a keytab and to pass it to Oozie.
See Proxy user - Superusers Acting On Behalf Of Other Users.
Also a clarification in HADOOP-15758
You cannot mimic a proxy user. A proxy user is specific construct. There is no substitute. A proxy user is a ugi that lacks its own authentication credentials, thus it explicitly encapsulates a "real" ugi that does contain kerberos credentials. The real ugi's user must be specifically configured on the target service to allow impersonation of the proxied user.
There is no correlation between a proxy user and a ticket cache. The real ugi can supply ticket cache or keytab based credentials. All that matters is the real user has credentials
There's also a special bit of advice for service implementors
An impersonating service should never ever be ticket cache based. Use a keytab. Otherwise you may be very surprised with proxy user service morphs into a different user if/when someone/something does a kinit as a different user.
- In HDFS Any compromised DN can create block tokens.
- Possession of the tokens is sufficent to impersonate a user. This means it is critical to transport tokens over the network in an encrypted form. Typically, this is done by SASL-encrypting the Hadoop IPC channel.
This is for relevance of people implementing/using FileSystem APIs.
- Any FileSystem instance may issue zero or more delegation tokens when asked.
- These can be used to grant time-limited access to the filesystem in different processes.
- The reason more than one can be issued is to support aggregate filesystems where multiple filesystems may eventually be invoked (e.g ViewFS), or when tokens for multiple services need to be included (e.g KMS access tokens).
- Clients must invoke
FileSystem.addDelegationTokens()
to get an array of tokens which may then be written to persistent storage, marshalled, and loaded in later. - Each filesystem must have a unique Canonical Name -a URI which will refer only to the FS instance to which the token matches.
- When working with YARN, container and application launch contexts may
include a list of tokens to include. the FS Tokens should be included here
along with any needed to talk to the YARN RM, and, for Hadoop 3+,
docker tokens. See
org.apache.hadoop.yarn.applications.distributedshell.Client
for an example of this.
In the remote application, these tokens can be unmarshalled from a the byte array, the tokens for specific services retrieved, and then used for authentication.
A TokenIdentifier is a Writable
object; it can contain arbitrary
data.
For this reason it can be used to propagate secrets from a client
to deployed YARN applications and containers, even when there is no
actual service which reads the token identifiers.
All that is required is that the far end implements the token identifier
and declares itself as such.