The Apache Hive data warehouse software facilitates reading, writing, and managing datasets residing in distributed storage using SQL.
First download the Hive JDBC driver.
Now register the driver in EXAOperation:
- Click "Software"
- Switch to tab "JDBC Drivers"
- Click "Browse..."
- Select JDBC driver file
- Click "Upload"
- Click "Add"
- In dialog "Add EXACluster JDBC driver" configure the JDBC driver (see below)
You need to specify the following settings when adding the JDBC driver via EXAOperation.
Parameter | Value |
---|---|
Name | HIVE |
Main | com.cloudera.hive.jdbc41.HS2Driver |
Prefix | jdbc:hive2: |
Files | HiveJDBC41.jar |
- Create a bucket in BucketFS
- Upload the driver to BucketFS
This step is necessary since the UDF container the adapter runs in has no access to the JDBC drivers installed via EXAOperation but it can access BucketFS.
Upload the latest available release of Virtual Schema JDBC Adapter to Bucket FS.
Then create a schema to hold the adapter script.
CREATE SCHEMA ADAPTER;
The SQL statement below creates the adapter script, defines the Java class that serves as entry point and tells the UDF framework where to find the libraries (JAR files) for Virtual Schema and database driver.
CREATE OR REPLACE JAVA ADAPTER SCRIPT ADAPTER.JDBC_ADAPTER AS
%scriptclass com.exasol.adapter.RequestDispatcher;
%jar /buckets/<BFS service>/<bucket>/jars/virtualschema-jdbc-adapter-dist-2.2.0.jar;
%jar /buckets/<BFS service>/<bucket>/jars/HiveJDBC41.jar;
/
Define the connection to Hive as shown below.
CREATE OR REPLACE CONNECTION HIVE_CONNECTION
TO 'jdbc:hive2://<Hive host>:<port>'
USER '<user>'
IDENTIFIED BY '<password>';
Below you see how a Hive Virtual Schema is created. Please note that you have to provide the name of a schema.
CREATE VIRTUAL SCHEMA <virtual schema name>
USING ADAPTER.JDBC_ADAPTER
WITH
SQL_DIALECT = 'HIVE'
CONNECTION_NAME = 'HIVE_CONN'
SCHEMA_NAME = '<schema name>';
Connecting to a Kerberos secured Hive service only differs in one aspect: You have a CONNECTION
object which contains all the relevant information for the Kerberos authentication. This section describes how Kerberos authentication works and how to create such a CONNECTION
.
Both the adapter script and the internally used IMPORT FROM JDBC
statement support Kerberos authentication. They detect, that the connection is a Kerberos connection by a special prefix in the IDENTIFIED BY
field. In such case, the authentication will happen using a Kerberos keytab and Kerberos config file (using the JAAS Java API).
The CONNECTION
object stores all relevant information and files in its fields:
- The
TO
field contains the JDBC connection string - The
USER
field contains the Kerberos principal - The
IDENTIFIED BY
field contains the Kerberos configuration file and keytab file (base64 encoded) along with an internal prefixExaAuthType=Kerberos;
to identify theCONNECTION
as a KerberosCONNECTION
.
In order to simplify the creation of Kerberos CONNECTION
objects, the create_kerberos_conn.py
Python script has been provided. The script requires 5 arguments:
CONNECTION
name (arbitrary name for the newCONNECTION
)- Kerberos principal for Hadoop (i.e., Hadoop user)
- Kerberos configuration file path (e.g.,
krb5.conf
) - Kerberos keytab file path, which contains keys for the Kerberos principal
- JDBC connection string
Example command:
python tools/create_kerberos_conn.py krb_conn [email protected] /etc/krb5.conf ./krbuser.keytab \
'jdbc:hive2://<Hive host>:<port>;AuthMech=1;KrbRealm=EXAMPLE.COM;KrbHostFQDN=hive-host.example.com;KrbServiceName=hive'
Output:
CREATE CONNECTION krb_conn TO 'jdbc:hive2://<Hive host>:<port>;AuthMech=1;KrbRealm=EXAMPLE.COM;KrbHostFQDN=hive-host.example.com;KrbServiceName=hive' USER '[email protected]' IDENTIFIED BY 'ExaAuthType=Kerberos;enp6Cg==;YWFhCg=='
You have to execute the generated CREATE CONNECTION
statement directly in EXASOL to actually create the Kerberos CONNECTION
object. For more detailed information about the script, use the help option:
python tools/create_kerberos_conn.py -h
You can now create a virtual schema using the Kerberos connection created before.
CREATE VIRTUAL SCHEMA <virtual schema name>
USING ADAPTER.JDBC_ADAPTER
WITH
SQL_DIALECT = 'HIVE'
CONNECTION_NAME = 'KRB_CONN'
SCHEMA_NAME = '<schema name>';
Hive is operating on schemaless data. Virtual Schemas — as the name suggests — require a schema. In the case of string variables this creates the situation that the Hive JDBC driver cannot tell how long strings are when asked for schema data. To achieve this it would have to scan all data first, which is not an option.
Instead the Driver simply reports a configurable fixed value as the maximum string size. By default this is 255 characters. In the Virtual Schema this is mapped to VARCHAR(255) columns.
If you need larger strings, you have to override the default setting of the Hive JDBC driver by adding the following parameter to the JDBC connection string when creating the Virtual Schema:
DefaultStringColumnLength=<length>
So an example for a connection string would look like:
jdbc:hive2://localhost:10000;DefaultStringColumnLength=32767;
Please also note that 32KiB are the maximum string size the driver accepts.
See also:
The dialect was tested with the Cloudera Hive JDBC driver available on the Cloudera downloads page. The driver is also available directly from Simba technologies, who developed the driver. We tested with the JDBC 4.1.
-DECIMAL with precision > 36
is casted to VARCHAR
to prevent a loss of precision.
If you want to return a DECIMAL type you can set the property HIVE_CAST_NUMBER_TO_DECIMAL_WITH_PRECISION_AND_SCALE:
`HIVE_CAST_NUMBER_TO_DECIMAL_WITH_PRECISION_AND_SCALE='36,20'`
This will cast DECIMAL with precision > 36, DECIMAL without precision to DECIMAL(36,20).
Keep in mind that this will yield errors if the data in the Hive database does not fit into the specified DECIMAL type.