-
Notifications
You must be signed in to change notification settings - Fork 20
How to access webhdfs via knox
The Apache Knox Gateway is a system that provides a single point of authentication and access for Apache Hadoop services in a cluster.
https://knox.apache.org/books/knox-1-4-0/user-guide.html
The streamsx.hdfs
toolkit uses the knox
authentication to access HDFS via REST API interface of webHDFS
.
https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/WebHDFS.html
Before you start with the SPL demo application, you have to configure and start the knox
service in your Hadoop server.
Login as root
in your Hadoop server and perform the following steps:
cd {GATEWAY_HOME}
cd /usr/hdp/current/knox-server/
Make a backup from LDAP user file.
cp ./conf/users.ldif ./conf/users.ldif1
Edit the LDAP user file:
Due to the security issues, remove all already available demo users like guest
, sam
, tom
from the list of users in ldif file.
And add your own user and password into users.ldif file.
Here is a LDIF example for user hdfs.
# Please replace with site specific values
dn: dc=hadoop,dc=apache,dc=org
objectclass: organization
objectclass: dcObject
o: Hadoop
dc: hadoop
# Entry for a sample people container
# Please replace with site specific values
dn: ou=people,dc=hadoop,dc=apache,dc=org
objectclass:top
objectclass:organizationalUnit
ou: people
dn: uid=hdfs,ou=people,dc=hadoop,dc=apache,dc=org
objectclass:top
objectclass:person
objectclass:organizationalPerson
objectclass:inetOrgPerson
cn: Hdfs
sn: Hdfs
uid: hdfs
userPassword:hdfs-password
cd /usr/hdp/current/knox-server/
bin/ldap.sh stop
Stopping LDAP with PID 3705 succeeded.
bin/ldap.sh start
Starting LDAP succeeded with PID 28870.
bin/ldap.sh status
LDAP is running with PID 28870.
If your Hadoop cluster is kerberized
, check if the following configuration parameters
are set in the hdfs-site.xml
file:
<name>dfs.web.authentication.kerberos.keytab</name>
<name>dfs.web.authentication.kerberos.principal</name>
<name>dfs.webhdfs.enabled</name>
And change the value of following property in your core-site.xml file to *
:
<property>
<name>hadoop.proxyuser.knox.groups</name>
<value>*</value>
</property>
It is possible to start/stop knox
service and Demo LDAP from the Ambari web interface.
curl -k -u hdfs:hdfs-password https:<your-hadoop-server>:8443/gateway/default/webhdfs/v1/user/?op=LISTSTATUS
If your configuration is correct, this curl
command returns the name of all directories in /user.
You can also create a new directory with this curl
command.
curl -k -X PUT -u hdfs:hdfs-password https://<your-hadoop-server>:8443/gateway/default/webhdfs/v1/user/hdfs/foo?op=MKDIRS&permission=777
Login as root on your Hadoop server, change user to hdfs
and create a test directory
su - hdfs
hadoop fs -mkdir /user/hdfs
hadoop fs -mkdir /user/hdfs/out
The streamsx.hdfs toolkit supports from version 5.0.0 a new parameter credentials
.
This optional parameter specifies the JSON string that contains the hdfs credentials key/value pairs for user
, password
and webhdfs
.
This parameter can also be specified in an application configuration.
https://github.com/IBMStreams/streamsx.hdfs/releases/tag/v5.0.0
The JSON string must to have the following format:
{
'user' : 'user',
'password' : "hdfs-password',
'webhdfs' : 'webhdfs://ip-address:8443'
}
The IBM Analytic Engine on cloud is based on Hadoop 3.1 and supports webHDFS and knox.
https://cloud.ibm.com/catalog/services/analytics-engine?customCreatePageKey=catalog-custom-key-cfcuq9bj
It is also possible to access to IAE via credentials.
The IAE user name is clsadmin
and the value of password has to be set to the password of IAE.
Reset the IAE password and copy the created password and set it as password in the SPL credentials.
How to reset the IAI password described in:
https://cloud.ibm.com/docs/services/AnalyticsEngine?topic=AnalyticsEngine-retrieve-cluster-credentials
The webhdfs information (hostname:port) is a part of IAE credentials.
The following SPL sample uses credentials to access webHDFS.
Before you make the SPL application you have to replace the credentials properties with credentials of your Hadoop server.
/*******************************************************************************
* Copyright (C) 2019, International Business Machines Corporation
* All Rights Reserved
*******************************************************************************/
namespace application ;
use com.ibm.streamsx.hdfs::* ;
/**
* The webHDFS sample demonstrates how to access webhdfs via knox useer and password.
* A Beacon operator generates some test lines.
* HdfsFileSink writes every 10 lines in a new file in /user/hdfs/out directory.
* HdfsDirScanOut scans the given directory (out) from HDFS, which is locates in the
* user's home directory (/user/hdfs/out) and returns the file names.
* HdfsFileSource gets the file name via input stream from HdfsDirScanOut and reads files and returns lines.
* CopyFromHdfsToLocal copies all incoming files (/user/hdfs/out/output-xx.txt) from input port into
* a local directory (data).
* Prints operators are Custom and prints the output of HDFS operators
*/
composite webHDFS
{
param
expression<rstring> $credentials : getSubmissionTimeValue("credentials",
"{ 'user': 'hdfs', 'webhdfs': 'webhdfs://<yourhadoop-server>:8443', 'password': 'hdfspassword'}" );
graph
// generates lines
stream<rstring line> CreateLines = Beacon()
{
param
initDelay : 1.0 ;
iterations : 100u ;
output
CreateLines : line = (rstring)IterationCount() + ": This line will be written into a HDFS file." ;
}
// HdfsFileSink writes every 10 lines from CreateLines in a new file in /user/hdfs/out directory
stream<rstring fileName, uint64 size> HdfsFileSink = HDFS2FileSink(CreateLines)
{
param
credentials : $credentials ;
file : "out/output-%FILENUM.txt" ;
tuplesPerFile : 10l ;
}
//print out the file names and the size of file
() as PrintHdfsFileSink = Custom(HdfsFileSink)
{
logic
onTuple HdfsFileSink :
{
printStringLn("HdfsFileSink fileName , size : " +(rstring) HdfsFileSink) ;
}
}
// HdfsDirScanOut scans the given directory from HDFS, default to . which is the user's home directory
stream<rstring hdfsFile> HdfsDirScanOut = HDFS2DirectoryScan()
{
param
initDelay : 10.0 ;
directory : "out" ;
credentials : $credentials ;
strictMode : false ;
}
//print out the names of each file found in the directory
() as PrintHdfsDirScanOut = Custom(HdfsDirScanOut)
{
logic
onTuple HdfsDirScanOut :
{
printStringLn("HdfsDirScanOut fileName : " +(rstring) HdfsDirScanOut) ;
}
}
// HdfsFileSource reads files and returns lines into output port
// It uses the file name from directory scan to read the file
stream<rstring lines> HdfsFileSource = HDFS2FileSource(HdfsDirScanOut)
{
param
credentials : $credentials ;
}
//print out the names of each file found in the directory
() as PrintHdfsFileSource = Custom(HdfsFileSource)
{
logic
onTuple HdfsFileSource :
{
printStringLn("HdfsFileSource line : " + lines) ;
}
}
// copies all incoming files from input port /user/hdfs/out/outputx.txt into local data directory.
stream<rstring message, uint64 elapsedTime> CopyFromHdfsToLocal = HDFS2FileCopy(HdfsDirScanOut)
{
param
hdfsFileAttrName : "hdfsFile" ;
localFile : "./" ;
deleteSourceFile : false ;
overwriteDestinationFile : true ;
direction : copyToLocalFile ;
credentials : $credentials ;
}
//print out the message and the elapsed time
() as PrintCopyFromHdfsToLocal = Custom(CopyFromHdfsToLocal)
{
logic
onTuple CopyFromHdfsToLocal :
{
printStringLn("CopyFromHdfsToLocal message, elapsedTime : " +(rstring) CopyFromHdfsToLocal) ;
}
}
}