Skip to content

How to access webhdfs via knox

Ahmad Nouri edited this page Sep 24, 2019 · 13 revisions

How to access webhdfs via knox

The Apache Knox Gateway is a service that provides a single point of authentication and access for Apache Hadoop services in a cluster.
https://knox.apache.org/books/knox-1-4-0/user-guide.html

The streamsx.hdfs toolkit uses the knox authentication to access HDFS via REST API interface of webHDFS.
https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/WebHDFS.html

Benefit of knox authentication

  • It is secure.
  • It is not necessary to install any Hadoop client.
  • It is not necessary to copy any XML configuration file on the client server.
  • It is not necessary to set any environment variable like HADOOP_HOME.
  • It provides a single access point for all REST and HTTP interactions with Apache Hadoop cluster .
  • Kerberos Encapsulation. Kerberos requires a client side library and complex client side configuration. By encapsulating Kerberos via Knox, it is not necessary to copy any keytab and configuration files on clinet server.

https://www.cloudera.com/products/open-source/apache-hadoop/apache-knox.html

KNOX Configuration

Before you start with the SPL demo application, you have to configure and start the knox service in your Hadoop server.

Add new LDAP user

Login as root in your Hadoop server and perform the following steps:

cd {GATEWAY_HOME}

cd /usr/hdp/current/knox-server/

Make a backup from LDAP user file.

cp ./conf/users.ldif ./conf/users.ldif1

Edit the LDAP user file:
Due to the security issues, remove all already available demo users like guest, sam, tom from the list of users in ldif file.
And add your own user and password into users.ldif file.

Here is a LDIF example for user hdfs.

# Please replace with site specific values
dn: dc=hadoop,dc=apache,dc=org
objectclass: organization
objectclass: dcObject
o: Hadoop
dc: hadoop

# Entry for a sample people container
# Please replace with site specific values
dn: ou=people,dc=hadoop,dc=apache,dc=org
objectclass:top
objectclass:organizationalUnit
ou: people

dn: uid=hdfs,ou=people,dc=hadoop,dc=apache,dc=org
objectclass:top
objectclass:person
objectclass:organizationalPerson
objectclass:inetOrgPerson
cn: Hdfs
sn: Hdfs
uid: hdfs
userPassword:hdfs-password

Restart LDAP

cd /usr/hdp/current/knox-server/
bin/ldap.sh stop
Stopping LDAP with PID 3705 succeeded.
bin/ldap.sh start
Starting LDAP succeeded with PID 28870.
bin/ldap.sh status
LDAP is running with PID 28870.

Kerberos configuration

If your Hadoop cluster is kerberized, check if the following configuration parameters
are set in the hdfs-site.xml file:

<name>dfs.web.authentication.kerberos.keytab</name>
<name>dfs.web.authentication.kerberos.principal</name>
<name>dfs.webhdfs.enabled</name>

And change the value of following property in your core-site.xml file to *:

<property>
   <name>hadoop.proxyuser.knox.groups</name>
   <value>*</value>
</property>

Start the Knox service

su -l knox -c "/usr/hdp/current/knox-server/bin/gateway.sh start"

It is also possible to start/stop knox service and Demo LDAP from the Ambari web interface.

Test the knox access via curl from your client server.

curl -k -u hdfs:hdfs-password https:<your-hadoop-server>:8443/gateway/default/webhdfs/v1/user/?op=LISTSTATUS

If your configuration is correct, this curl command returns the name of all directories in /user.

You can also create a new directory with this curl command.

curl -k -X PUT -u hdfs:hdfs-password https://<your-hadoop-server>:8443/gateway/default/webhdfs/v1/user/hdfs/foo?op=MKDIRS&permission=777

Create a test directory for hdfs user on HDSF

Login as root on your Hadoop server, change user to hdfs and create a test directory

su - hdfs
hadoop fs -mkdir  /user/hdfs
hadoop fs -mkdir  /user/hdfs/out

SPL sample

The streamsx.hdfs toolkit supports from version 5.0.0 a new parameter credentials.
This optional parameter specifies the JSON string that contains the hdfs credentials key/value pairs for user, password and webhdfs .
This parameter can also be specified in an application configuration.
https://github.com/IBMStreams/streamsx.hdfs/releases/tag/v5.0.0

The JSON string must to have the following format:

{
    'user'     : 'user',
    'password' : "hdfs-password',
    'webhdfs'  : 'webhdfs://ip-address:8443'
}

IAE credentials

The IBM Analytic Engine on cloud is based on Hadoop 3.1 and supports webHDFS and knox.
https://cloud.ibm.com/catalog/services/analytics-engine?customCreatePageKey=catalog-custom-key-cfcuq9bj

It is also possible to access to IAE via credentials.
The IAE user name is clsadmin and the value of password has to be set to the password of IAE. Reset the IAE password and copy the created password and set it as password in the SPL credentials.
How to reset the IAI password described in:
https://cloud.ibm.com/docs/services/AnalyticsEngine?topic=AnalyticsEngine-retrieve-cluster-credentials
The webhdfs information (hostname:port) is a part of IAE credentials.

The following SPL sample uses credentials to access webHDFS.
Before you make the SPL application you have to replace the credentials properties with credentials of your Hadoop server.

/*******************************************************************************
* Copyright (C) 2019, International Business Machines Corporation
* All Rights Reserved
*******************************************************************************/

/**
 * The webHDFS sample demonstrates how to access webhdfs via knox user and password.
 * A Beacon operator generates some test lines.
 * HdfsFileSink writes every 10 lines in a new file in /user/hdfs/out directory.
 * HdfsDirScanOut scans the given directory (out) from HDFS, which is locates in the 
 * user's home directory (/user/hdfs/out) and returns the file names.
 * HdfsFileSource gets the file name via input stream from HdfsDirScanOut and reads files and returns lines. 
 * CopyFromHdfsToLocal copies all incoming files (/user/hdfs/out/output-xx.txt) from input port into 
 * a local directory (data).
 * Prints operators are Custom and prints the output of HDFS operators
 */

namespace application ;

use com.ibm.streamsx.hdfs::* ;


composite webHDFS
{
  param
    expression<rstring> $credentials : getSubmissionTimeValue("credentials", 
    	"{ 'user': 'hdfs', 'webhdfs': 'webhdfs://<yourhadoop-server>:8443', 'password': 'hdfspassword'}" );

  graph

    // generates lines
    stream<rstring line> CreateLines = Beacon()
    {
        param
            initDelay : 1.0 ;
            iterations : 100u ;
        output
            CreateLines : line = (rstring)IterationCount() + ": This line will be written into a HDFS file." ;
    }

    // HdfsFileSink writes every 10 lines from CreateLines in a new file in /user/hdfs/out directory
    stream<rstring fileName, uint64 size> HdfsFileSink = HDFS2FileSink(CreateLines)
    {
      param
        credentials : $credentials ;
        file : "out/output-%FILENUM.txt" ;
        tuplesPerFile : 10l ;
    }

    //print out the file names and the size of file
    () as PrintHdfsFileSink = Custom(HdfsFileSink)
    {
      logic
        onTuple HdfsFileSink :
        {
          printStringLn("HdfsFileSink fileName , size : " +(rstring) HdfsFileSink) ;
        }
    }

      // HdfsDirScanOut scans the given directory from HDFS, default to . which is the user's home directory
    stream<rstring hdfsFile> HdfsDirScanOut = HDFS2DirectoryScan()
    {
      param
        initDelay : 10.0 ;
        directory : "out" ;
        credentials : $credentials ;
        strictMode : false ;
    }


    //print out the names of each file found in the directory
    () as PrintHdfsDirScanOut = Custom(HdfsDirScanOut)
    {
      logic
        onTuple HdfsDirScanOut :
        {
          printStringLn("HdfsDirScanOut fileName  : " +(rstring) HdfsDirScanOut) ;
        }
    }

    // HdfsFileSource reads files and returns lines into output port
    // It uses the file name from directory scan to read the file
    stream<rstring lines> HdfsFileSource = HDFS2FileSource(HdfsDirScanOut)
    {
      param
        credentials : $credentials ;
    }

    //print out the names of each file found in the directory
    () as PrintHdfsFileSource = Custom(HdfsFileSource)
    {
      logic
        onTuple HdfsFileSource :
        {
          printStringLn("HdfsFileSource   line : " + lines) ;
        }
    }

    // copies all incoming files from input port /user/hdfs/out/outputx.txt into local data directory.
    stream<rstring message, uint64 elapsedTime> CopyFromHdfsToLocal = HDFS2FileCopy(HdfsDirScanOut)
    {
        param
            hdfsFileAttrName : "hdfsFile" ;
            localFile : "./" ;
                deleteSourceFile : false ;
                overwriteDestinationFile : true ;
                direction : copyToLocalFile ;
                credentials : $credentials ;
    }

    //print out the message and the elapsed time  
    () as PrintCopyFromHdfsToLocal = Custom(CopyFromHdfsToLocal)
    {
      logic
        onTuple CopyFromHdfsToLocal :
        {
          printStringLn("CopyFromHdfsToLocal message,  elapsedTime : " +(rstring) CopyFromHdfsToLocal) ;
        }
    }

}