title | description | services | documentationcenter | author | manager | editor | ms.assetid | ms.service | ms.custom | ms.devlang | ms.topic | ms.tgt_pltfrm | ms.workload | ms.date | ms.author |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Analyze and Process JSON documents with Hive in HDInsight | Microsoft Docs |
Learn how to use JSON documents and analyze them using Hive in HDInsight. |
hdinsight |
mumian |
jhubbard |
cgronlun |
e17794e8-faae-4264-9434-67f61ea78f13 |
hdinsight |
hdinsightactive |
na |
article |
na |
big-data |
04/26/2017 |
jgao |
Learn how to process and analyze JSON files using Hive in HDInsight. The following JSON document is used in the tutorial:
{
"StudentId": "trgfg-5454-fdfdg-4346",
"Grade": 7,
"StudentDetails": [
{
"FirstName": "Peggy",
"LastName": "Williams",
"YearJoined": 2012
}
],
"StudentClassCollection": [
{
"ClassId": "89084343",
"ClassParticipation": "Satisfied",
"ClassParticipationRank": "High",
"Score": 93,
"PerformedActivity": false
},
{
"ClassId": "78547522",
"ClassParticipation": "NotSatisfied",
"ClassParticipationRank": "None",
"Score": 74,
"PerformedActivity": false
},
{
"ClassId": "78675563",
"ClassParticipation": "Satisfied",
"ClassParticipationRank": "Low",
"Score": 83,
"PerformedActivity": true
}
]
}
The file can be found at wasb://[email protected]/. For more information on using Azure Blob storage with HDInsight, see Use HDFS-compatible Azure Blob storage with Hadoop in HDInsight. You can copy the file to the default container of your cluster.
In this tutorial, you use the Hive console. For instructions of opening the Hive console, see Use Hive with Hadoop on HDInsight with Remote Desktop.
The methods listed in the next section require the JSON document in a single row. So you must flatten the JSON document to a string. If your JSON document is already flattened, you can skip this step and go straight to the next section on Analyzing JSON data.
DROP TABLE IF EXISTS StudentsRaw;
CREATE EXTERNAL TABLE StudentsRaw (textcol string) STORED AS TEXTFILE LOCATION "wasb://[email protected]/";
DROP TABLE IF EXISTS StudentsOneLine;
CREATE EXTERNAL TABLE StudentsOneLine
(
json_body string
)
STORED AS TEXTFILE LOCATION '/json/students';
INSERT OVERWRITE TABLE StudentsOneLine
SELECT CONCAT_WS(' ',COLLECT_LIST(textcol)) AS singlelineJSON
FROM (SELECT INPUT__FILE__NAME,BLOCK__OFFSET__INSIDE__FILE, textcol FROM StudentsRaw DISTRIBUTE BY INPUT__FILE__NAME SORT BY BLOCK__OFFSET__INSIDE__FILE) x
GROUP BY INPUT__FILE__NAME;
SELECT * FROM StudentsOneLine
The raw JSON file is located at wasb://[email protected]/. The StudentsRaw Hive table points to the raw unflattened JSON document.
The StudentsOneLine Hive table stores the data in the HDInsight default file system under the /json/students/ path.
The INSERT statement populates the StudentOneLine table with the flattened JSON data.
The SELECT statement shall only return one row.
Here is the output of the SELECT statement:
Hive provides three different mechanisms to run queries on JSON documents:
- use the GET_JSON_OBJECT UDF (User-defined function)
- use the JSON_TUPLE UDF
- use custom SerDe
- write you own UDF using Python or other languages. See this article on running your own Python code with Hive.
Hive provides a built-in UDF called get json object, which can perform JSON querying during run time. This method takes two arguments – the table name and method name, which has the flattened JSON document and the JSON field that needs to be parsed. Let’s look at an example to see how this UDF works.
Get the first name and last name for each student
SELECT
GET_JSON_OBJECT(StudentsOneLine.json_body,'$.StudentDetails.FirstName'),
GET_JSON_OBJECT(StudentsOneLine.json_body,'$.StudentDetails.LastName')
FROM StudentsOneLine;
Here is the output when running this query in console window.
There are a few limitations of the get-json_object UDF.
- Because each field in the query requires reparsing the query, it affects the performance.
- GET_JSON_OBJECT() returns the string representation of an array. To convert this array to a Hive array, you have to use regular expressions to replace the square brackets ‘[‘ and ‘]’ and then also call split to get the array.
This is why the Hive wiki recommends using json_tuple.
Another UDF provided by Hive is called json_tuple, which performs better than get_ json _object. This method takes a set of keys and a JSON string, and returns a tuple of values using one function. The following query returns the student id and the grade from the JSON document:
SELECT q1.StudentId, q1.Grade
FROM StudentsOneLine jt
LATERAL VIEW JSON_TUPLE(jt.json_body, 'StudentId', 'Grade') q1
AS StudentId, Grade;
The output of this script in the Hive console:
JSON_TUPLE uses the lateral view syntax in Hive, which allows json_tuple to create a virtual table by applying the UDT function to each row of the original table. Complex JSONs become too unwieldy because of the repeated use of LATERAL VIEW. Furthermore, JSON_TUPLE cannot handle nested JSONs.
SerDe is the best choice for parsing nested JSON documents, it allows you to define the JSON schema, and use the schema to parse the documents. In this tutorial, you use one of the more popular SerDe that has been developed by Roberto Congiu.
To use the custom SerDe:
-
Install Java SE Development Kit 7u55 JDK 1.7.0_55. Choose the Windows X64 version of the JDK if you are going to be using the Windows deployment of HDInsight
[!WARNING] JDK 1.8 doesn't work with this SerDe.
After the installation is completed, add a new user environment variable:
-
Install Maven 3.3.1
Add the bin folder to your path by going to Control Panel-->Edit the System Variables for your account Environment variables. The following screenshot shows you how to do this.
-
Clone the project from Hive-JSON-SerDe github site. You can do this by clicking on the “Download Zip” button as shown in the following screenshot.
4: Go to the folder where you have downloaded this package and then type “mvn package”. This should create the necessary jar files that you can then copy over to the cluster.
5: Go to the target folder under the root folder where you downloaded the package. Upload the json-serde-1.1.9.9-Hive13-jar-with-dependencies.jar file to head-node of your cluster. I usually put it under the hive binary folder: C:\apps\dist\hive-0.13.0.2.1.11.0-2316\bin or something similar.
6: In the hive prompt, type “add jar /path/to/json-serde-1.1.9.9-Hive13-jar-with-dependencies.jar”. Since in my case, the jar is in the C:\apps\dist\hive-0.13.x\bin folder, I can directly add the jar with the name as shown:
add jar json-serde-1.1.9.9-Hive13-jar-with-dependencies.jar;
Now, you are ready to use the SerDe to run queries against the JSON document.
The following statement creates a table with a defined schema:
DROP TABLE json_table;
CREATE EXTERNAL TABLE json_table (
StudentId string,
Grade int,
StudentDetails array<struct<
FirstName:string,
LastName:string,
YearJoined:int
>
>,
StudentClassCollection array<struct<
ClassId:string,
ClassParticipation:string,
ClassParticipationRank:string,
Score:int,
PerformedActivity:boolean
>
>
) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
LOCATION '/json/students';
To list the first name and last name of the student
SELECT StudentDetails.FirstName, StudentDetails.LastName FROM json_table;
Here is the result from the Hive console.
To calculate the sum of scores of the JSON document
SELECT SUM(scores)
FROM json_table jt
lateral view explode(jt.StudentClassCollection.Score) collection as scores;
The preceding query uses lateral view explode UDF to expand the array of scores so that they can be summed.
Here is the output from the Hive console.
To find which subjects a given student has scored more than 80 points:
SELECT
jt.StudentClassCollection.ClassId
FROM json_table jt
lateral view explode(jt.StudentClassCollection.Score) collection as score where score > 80;
The preceding query returns a Hive array unlike get_json_object, which returns a string.
If you want to skil malformed JSON, then as explained in the wiki page of this SerDe you can achieve that by typing the following code:
ALTER TABLE json_table SET SERDEPROPERTIES ( "ignore.malformed.json" = "true");
In conclusion, the type of JSON operator in Hive that you choose depends on your scenario. If you have a simple JSON document and you only have one field to look up on – you can choose to use the Hive UDF get_json_object. If you have more than one key to look up on, then you can use json_tuple. If you have a nested document, then you should use the JSON SerDe.
For other related articles, see