Skip to content

Latest commit

 

History

History
264 lines (190 loc) · 12.6 KB

hdinsight-using-json-in-hive.md

File metadata and controls

264 lines (190 loc) · 12.6 KB
title description services documentationcenter author manager editor ms.assetid ms.service ms.custom ms.devlang ms.topic ms.tgt_pltfrm ms.workload ms.date ms.author
Analyze and Process JSON documents with Hive in HDInsight | Microsoft Docs
Learn how to use JSON documents and analyze them using Hive in HDInsight.
hdinsight
mumian
jhubbard
cgronlun
e17794e8-faae-4264-9434-67f61ea78f13
hdinsight
hdinsightactive
na
article
na
big-data
04/26/2017
jgao

Process and analyze JSON documents using Hive in HDInsight

Learn how to process and analyze JSON files using Hive in HDInsight. The following JSON document is used in the tutorial:

{
    "StudentId": "trgfg-5454-fdfdg-4346",
    "Grade": 7,
    "StudentDetails": [
        {
            "FirstName": "Peggy",
            "LastName": "Williams",
            "YearJoined": 2012
        }
    ],
    "StudentClassCollection": [
        {
            "ClassId": "89084343",
            "ClassParticipation": "Satisfied",
            "ClassParticipationRank": "High",
            "Score": 93,
            "PerformedActivity": false
        },
        {
            "ClassId": "78547522",
            "ClassParticipation": "NotSatisfied",
            "ClassParticipationRank": "None",
            "Score": 74,
            "PerformedActivity": false
        },
        {
            "ClassId": "78675563",
            "ClassParticipation": "Satisfied",
            "ClassParticipationRank": "Low",
            "Score": 83,
            "PerformedActivity": true
        }
    ]
}

The file can be found at wasb://[email protected]/. For more information on using Azure Blob storage with HDInsight, see Use HDFS-compatible Azure Blob storage with Hadoop in HDInsight. You can copy the file to the default container of your cluster.

In this tutorial, you use the Hive console. For instructions of opening the Hive console, see Use Hive with Hadoop on HDInsight with Remote Desktop.

Flatten JSON documents

The methods listed in the next section require the JSON document in a single row. So you must flatten the JSON document to a string. If your JSON document is already flattened, you can skip this step and go straight to the next section on Analyzing JSON data.

DROP TABLE IF EXISTS StudentsRaw;
CREATE EXTERNAL TABLE StudentsRaw (textcol string) STORED AS TEXTFILE LOCATION "wasb://[email protected]/";

DROP TABLE IF EXISTS StudentsOneLine;
CREATE EXTERNAL TABLE StudentsOneLine
(
  json_body string
)
STORED AS TEXTFILE LOCATION '/json/students';

INSERT OVERWRITE TABLE StudentsOneLine
SELECT CONCAT_WS(' ',COLLECT_LIST(textcol)) AS singlelineJSON
      FROM (SELECT INPUT__FILE__NAME,BLOCK__OFFSET__INSIDE__FILE, textcol FROM StudentsRaw DISTRIBUTE BY INPUT__FILE__NAME SORT BY BLOCK__OFFSET__INSIDE__FILE) x
      GROUP BY INPUT__FILE__NAME;

SELECT * FROM StudentsOneLine

The raw JSON file is located at wasb://[email protected]/. The StudentsRaw Hive table points to the raw unflattened JSON document.

The StudentsOneLine Hive table stores the data in the HDInsight default file system under the /json/students/ path.

The INSERT statement populates the StudentOneLine table with the flattened JSON data.

The SELECT statement shall only return one row.

Here is the output of the SELECT statement:

Flattening of the JSON document.

Analyze JSON documents in Hive

Hive provides three different mechanisms to run queries on JSON documents:

  • use the GET_JSON_OBJECT UDF (User-defined function)
  • use the JSON_TUPLE UDF
  • use custom SerDe
  • write you own UDF using Python or other languages. See this article on running your own Python code with Hive.

Use the GET_JSON_OBJECT UDF

Hive provides a built-in UDF called get json object, which can perform JSON querying during run time. This method takes two arguments – the table name and method name, which has the flattened JSON document and the JSON field that needs to be parsed. Let’s look at an example to see how this UDF works.

Get the first name and last name for each student

SELECT
  GET_JSON_OBJECT(StudentsOneLine.json_body,'$.StudentDetails.FirstName'),
  GET_JSON_OBJECT(StudentsOneLine.json_body,'$.StudentDetails.LastName')
FROM StudentsOneLine;

Here is the output when running this query in console window.

get_json_object UDF

There are a few limitations of the get-json_object UDF.

  • Because each field in the query requires reparsing the query, it affects the performance.
  • GET_JSON_OBJECT() returns the string representation of an array. To convert this array to a Hive array, you have to use regular expressions to replace the square brackets ‘[‘ and ‘]’ and then also call split to get the array.

This is why the Hive wiki recommends using json_tuple.

Use the JSON_TUPLE UDF

Another UDF provided by Hive is called json_tuple, which performs better than get_ json _object. This method takes a set of keys and a JSON string, and returns a tuple of values using one function. The following query returns the student id and the grade from the JSON document:

SELECT q1.StudentId, q1.Grade
  FROM StudentsOneLine jt
  LATERAL VIEW JSON_TUPLE(jt.json_body, 'StudentId', 'Grade') q1
    AS StudentId, Grade;

The output of this script in the Hive console:

json_tuple UDF

JSON_TUPLE uses the lateral view syntax in Hive, which allows json_tuple to create a virtual table by applying the UDT function to each row of the original table. Complex JSONs become too unwieldy because of the repeated use of LATERAL VIEW. Furthermore, JSON_TUPLE cannot handle nested JSONs.

Use custom SerDe

SerDe is the best choice for parsing nested JSON documents, it allows you to define the JSON schema, and use the schema to parse the documents. In this tutorial, you use one of the more popular SerDe that has been developed by Roberto Congiu.

To use the custom SerDe:

  1. Install Java SE Development Kit 7u55 JDK 1.7.0_55. Choose the Windows X64 version of the JDK if you are going to be using the Windows deployment of HDInsight

    [!WARNING] JDK 1.8 doesn't work with this SerDe.

    After the installation is completed, add a new user environment variable:

    1. Open View advanced system settings from the Windows screen.

    2. Click Environment Variables.

    3. Add a new JAVA_HOME environment variable is pointing to C:\Program Files\Java\jdk1.7.0_55 or wherever your JDK is installed.

      Setting up correct config values for JDK

  2. Install Maven 3.3.1

    Add the bin folder to your path by going to Control Panel-->Edit the System Variables for your account Environment variables. The following screenshot shows you how to do this.

    Setting up Maven

  3. Clone the project from Hive-JSON-SerDe github site. You can do this by clicking on the “Download Zip” button as shown in the following screenshot.

    Cloning the project

4: Go to the folder where you have downloaded this package and then type “mvn package”. This should create the necessary jar files that you can then copy over to the cluster.

5: Go to the target folder under the root folder where you downloaded the package. Upload the json-serde-1.1.9.9-Hive13-jar-with-dependencies.jar file to head-node of your cluster. I usually put it under the hive binary folder: C:\apps\dist\hive-0.13.0.2.1.11.0-2316\bin or something similar.

6: In the hive prompt, type “add jar /path/to/json-serde-1.1.9.9-Hive13-jar-with-dependencies.jar”. Since in my case, the jar is in the C:\apps\dist\hive-0.13.x\bin folder, I can directly add the jar with the name as shown:

add jar json-serde-1.1.9.9-Hive13-jar-with-dependencies.jar;

Adding JAR to your project

Now, you are ready to use the SerDe to run queries against the JSON document.

The following statement creates a table with a defined schema:

DROP TABLE json_table;
CREATE EXTERNAL TABLE json_table (
  StudentId string,
  Grade int,
  StudentDetails array<struct<
      FirstName:string,
      LastName:string,
      YearJoined:int
      >
  >,
  StudentClassCollection array<struct<
      ClassId:string,
      ClassParticipation:string,
      ClassParticipationRank:string,
      Score:int,
      PerformedActivity:boolean
      >
  >
) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
LOCATION '/json/students';

To list the first name and last name of the student

SELECT StudentDetails.FirstName, StudentDetails.LastName FROM json_table;

Here is the result from the Hive console.

SerDe Query 1

To calculate the sum of scores of the JSON document

SELECT SUM(scores)
FROM json_table jt
  lateral view explode(jt.StudentClassCollection.Score) collection as scores;

The preceding query uses lateral view explode UDF to expand the array of scores so that they can be summed.

Here is the output from the Hive console.

SerDe Query 2

To find which subjects a given student has scored more than 80 points:

SELECT  
  jt.StudentClassCollection.ClassId
FROM json_table jt
  lateral view explode(jt.StudentClassCollection.Score) collection as score  where score > 80;

The preceding query returns a Hive array unlike get_json_object, which returns a string.

SerDe Query 3

If you want to skil malformed JSON, then as explained in the wiki page of this SerDe you can achieve that by typing the following code:

ALTER TABLE json_table SET SERDEPROPERTIES ( "ignore.malformed.json" = "true");

Summary

In conclusion, the type of JSON operator in Hive that you choose depends on your scenario. If you have a simple JSON document and you only have one field to look up on – you can choose to use the Hive UDF get_json_object. If you have more than one key to look up on, then you can use json_tuple. If you have a nested document, then you should use the JSON SerDe.

Next steps

For other related articles, see