title: Upload data for Hadoop jobs in HDInsight | Microsoft Docs description: Learn how to upload and access data for Hadoop jobs in HDInsight using the Azure CLI, Azure Storage Explorer, Azure PowerShell, the Hadoop command line, or Sqoop. keywords: etl hadoop, getting data into hadoop, hadoop load data services: hdinsight,storage documentationcenter: '' tags: azure-portal author: mumian manager: jhubbard editor: cgronlun
ms.assetid: 56b913ee-0f9a-4e9f-9eaf-c571f8603dd6 ms.service: hdinsight ms.custom: hdinsightactive,hdiseo17may2017 ms.workload: big-data ms.tgt_pltfrm: na ms.devlang: na ms.topic: article ms.date: 05/12/2017 ms.author: jgao
Azure HDInsight provides a full-featured Hadoop distributed file system (HDFS) over Azure Blob storage. It is designed as an HDFS extension to provide a seamless experience to customers. It enables the full set of components in the Hadoop ecosystem to operate directly on the data it manages. Azure Blob storage and HDFS are distinct file systems that are optimized for storage of data and computations on that data. For information about the benefits of using Azure Blob storage, see Use Azure Blob storage with HDInsight.
Prerequisites
Note the following requirement before you begin:
- An Azure HDInsight cluster. For instructions, see Get started with Azure HDInsight or Provision HDInsight clusters.
Azure HDInsight clusters are typically deployed to run MapReduce jobs, and the clusters are dropped after these jobs complete. Keeping the data in the HDFS clusters after computations are complete would be an expensive way to store this data. Azure Blob storage is a highly available, highly scalable, high capacity, low cost, and shareable storage option for data that is to be processed using HDInsight. Storing data in a blob enables the HDInsight clusters that are used for computation to be safely released without losing data.
Azure Blob storage containers store data as key/value pairs, and there is no directory hierarchy. However the "/" character can be used within the key name to make it appear as if a file is stored within a directory structure. HDInsight sees these as if they are actual directories.
For example, a blob's key may be input/log1.txt. No actual "input" directory exists, but due to the presence of the "/" character in the key name, it has the appearance of a file path.
Because of this, if you use Azure Explorer tools you may notice some 0 byte files. These files serve two purposes:
- If there are empty folders, they mark of the existence of the folder. Azure Blob storage is clever enough to know that if a blob called foo/bar exists, there is a folder called foo. But the only way to signify an empty folder called foo is by having this special 0 byte file in place.
- They hold special metadata that is needed by the Hadoop file system, notably the permissions and owners for the folders.
Microsoft provides the following utilities to work with Azure Blob storage:
Tool | Linux | OS X | Windows |
---|---|---|---|
Azure Command-Line Interface | ✔ | ✔ | ✔ |
Azure PowerShell | ✔ | ||
AzCopy | ✔ | ||
Hadoop command | ✔ | ✔ | ✔ |
Note
While the Azure CLI, Azure PowerShell, and AzCopy can all be used from outside Azure, the Hadoop command is only available on the HDInsight cluster and only allows loading data from the local file system into Azure Blob storage.
The Azure CLI is a cross-platform tool that allows you to manage Azure services. Use the following steps to upload data to Azure Blob storage:
[!INCLUDE use-latest-version]
-
Install and configure the Azure CLI for Mac, Linux and Windows.
-
Open a command prompt, bash, or other shell, and use the following to authenticate to your Azure subscription.
azure login
When prompted, enter the user name and password for your subscription.
-
Enter the following command to list the storage accounts for your subscription:
azure storage account list
-
Select the storage account that contains the blob you want to work with, then use the following command to retrieve the key for this account:
azure storage account keys list <storage-account-name>
This should return Primary and Secondary keys. Copy the Primary key value because it will be used in the next steps.
-
Use the following command to retrieve a list of blob containers within the storage account:
azure storage container list -a <storage-account-name> -k <primary-key>
-
Use the following commands to upload and download files to the blob:
-
To upload a file:
azure storage blob upload -a <storage-account-name> -k <primary-key> <source-file> <container-name> <blob-name>
-
To download a file:
azure storage blob download -a <storage-account-name> -k <primary-key> <container-name> <blob-name> <destination-file>
-
Note
If you will always be working with the same storage account, you can set the following environment variables instead of specifying the account and key for every command:
- AZURE_STORAGE_ACCOUNT: The storage account name
- AZURE_STORAGE_ACCESS_KEY: The storage account key
Azure PowerShell is a scripting environment that you can use to control and automate the deployment and management of your workloads in Azure. For information about configuring your workstation to run Azure PowerShell, see Install and configure Azure PowerShell.
[!INCLUDE use-latest-version]
To upload a local file to Azure Blob storage
-
Open the Azure PowerShell console as instructed in Install and configure Azure PowerShell.
-
Set the values of the first five variables in the following script:
$resourceGroupName = "<AzureResourceGroupName>" $storageAccountName = "<StorageAccountName>" $containerName = "<ContainerName>" $fileName ="<LocalFileName>" $blobName = "<BlobName>" # Get the storage account key $storageAccountKey = (Get-AzureRmStorageAccountKey -ResourceGroupName $resourceGroupName -Name $storageAccountName)[0].Value # Create the storage context object $destContext = New-AzureStorageContext -StorageAccountName $storageAccountName -StorageAccountKey $storageaccountkey # Copy the file from local workstation to the Blob container Set-AzureStorageBlobContent -File $fileName -Container $containerName -Blob $blobName -context $destContext
-
Paste the script into the Azure PowerShell console to run it to copy the file.
For example PowerShell scripts created to work with HDInsight, see HDInsight tools.
AzCopy is a command-line tool that is designed to simplify the task of transferring data into and out of an Azure Storage account. You can use it as a standalone tool or incorporate this tool in an existing application. Download AzCopy.
The AzCopy syntax is:
AzCopy <Source> <Destination> [filePattern [filePattern...]] [Options]
For more information, see AzCopy - Uploading/Downloading files for Azure Blobs.
The Hadoop command line is only useful for storing data into blob storage when the data is already present on the cluster head node.
In order to use the Hadoop command, you must first connect to the headnode using one of the following methods:
- Windows-based HDInsight: Connect using Remote Desktop
- Linux-based HDInsight: Connect using SSH (the SSH command or PuTTY)
Once connected, you can use the following syntax to upload a file to storage.
hadoop -copyFromLocal <localFilePath> <storageFilePath>
For example, hadoop fs -copyFromLocal data.txt /example/data/data.txt
Because the default file system for HDInsight is in Azure Blob storage, /example/data.txt is actually in Azure Blob storage. You can also refer to the file as:
wasb:///example/data/data.txt
or
wasb://<ContainerName>@<StorageAccountName>.blob.core.windows.net/example/data/davinci.txt
For a list of other Hadoop commands that work with files, see http://hadoop.apache.org/docs/r2.7.0/hadoop-project-dist/hadoop-common/FileSystemShell.html
Warning
On HBase clusters, the default block size used when writing data is 256KB. While this works fine when using HBase APIs or REST APIs, using the hadoop
or hdfs dfs
commands to write data larger than ~12GB results in an error. See the storage exception for write on blob section below for more information.
There are also several applications that provide a graphical interface for working with Azure Storage. The following is a list of a few of these applications:
Client | Linux | OS X | Windows |
---|---|---|---|
Microsoft Visual Studio Tools for HDInsight | ✔ | ✔ | ✔ |
Azure Storage Explorer | ✔ | ✔ | ✔ |
Cloud Storage Studio 2 | ✔ | ||
CloudXplorer | ✔ | ||
Azure Explorer | ✔ | ||
Cyberduck | ✔ | ✔ |
For more information, see Navigate the linked resources.
Azure Storage Explorer is a useful tool for inspecting and altering the data in blobs. It is a free, open source tool that can be downloaded from http://storageexplorer.com/. The source code is available from this link as well.
Before using the tool, you must know your Azure storage account name and account key. For instructions about getting this information, see the "How to: View, copy and regenerate storage access keys" section of Create, manage, or delete a storage account.
-
Run Azure Storage Explorer. If this is the first time you have run the Storage Explorer, you will be prompted for the _Storage account name and Storage account key. If you have run it before, use the Add button to add a new storage account name and key.
Enter the name and key for the storage account used by your HDInsight cluster and then select SAVE & OPEN.
-
In the list of containers to the left of the interface, click the name of the container that is associated with your HDInsight cluster. By default, this is the name of the HDInsight cluster, but may be different if you entered a specific name when creating the cluster.
-
From the tool bar, select the upload icon.
-
Specify a file to upload, and then click Open. When prompted, select Upload to upload the file to the root of the storage container. If you want to upload the file to a specific path, enter the path in the Destination field and then select Upload.
Once the file has finished uploading, you can use it from jobs on the HDInsight cluster.
See Mount Azure Blob Storage as Local Drive.
The Azure Data Factory service is a fully managed service for composing data storage, data processing, and data movement services into streamlined, scalable, and reliable data production pipelines.
Azure Data Factory can be used to move data into Azure Blob storage, or to create data pipelines that directly use HDInsight features such as Hive and Pig.
For more information, see the Azure Data Factory documentation.
Sqoop is a tool designed to transfer data between Hadoop and relational databases. You can use it to import data from a relational database management system (RDBMS), such as SQL Server, MySQL, or Oracle into the Hadoop distributed file system (HDFS), transform the data in Hadoop with MapReduce or Hive, and then export the data back into an RDBMS.
For more information, see Use Sqoop with HDInsight.
Azure Blob storage can also be accessed using an Azure SDK from the following programming languages:
- .NET
- Java
- Node.js
- PHP
- Python
- Ruby
For more information on installing the Azure SDKs, see Azure downloads
Symptoms: When using the hadoop
or hdfs dfs
commands to write files that are ~12GB or larger on an HBase cluster, you may encounter the following error:
ERROR azure.NativeAzureFileSystem: Encountered Storage Exception for write on Blob : example/test_large_file.bin._COPYING_ Exception details: null Error Code : RequestBodyTooLarge
copyFromLocal: java.io.IOException
at com.microsoft.azure.storage.core.Utility.initIOException(Utility.java:661)
at com.microsoft.azure.storage.blob.BlobOutputStream$1.call(BlobOutputStream.java:366)
at com.microsoft.azure.storage.blob.BlobOutputStream$1.call(BlobOutputStream.java:350)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: com.microsoft.azure.storage.StorageException: The request body is too large and exceeds the maximum permissible limit.
at com.microsoft.azure.storage.StorageException.translateException(StorageException.java:89)
at com.microsoft.azure.storage.core.StorageRequest.materializeException(StorageRequest.java:307)
at com.microsoft.azure.storage.core.ExecutionEngine.executeWithRetry(ExecutionEngine.java:182)
at com.microsoft.azure.storage.blob.CloudBlockBlob.uploadBlockInternal(CloudBlockBlob.java:816)
at com.microsoft.azure.storage.blob.CloudBlockBlob.uploadBlock(CloudBlockBlob.java:788)
at com.microsoft.azure.storage.blob.BlobOutputStream$1.call(BlobOutputStream.java:354)
... 7 more
Cause: HBase on HDInsight clusters default to a block size of 256KB when writing to Azure storage. While this works for HBase APIs or REST APIs, it will result in an error when using the hadoop
or hdfs dfs
command-line utilities.
Resolution: Use fs.azure.write.request.size
to specify a larger block size. You can do this on a per-use basis by using the -D
parameter. The following is an example using this parameter with the hadoop
command:
hadoop -fs -D fs.azure.write.request.size=4194304 -copyFromLocal test_large_file.bin /example/data
You can also increase the value of fs.azure.write.request.size
globally by using Ambari. The following steps can be used to change the value in the Ambari Web UI:
-
In your browser, go to the Ambari Web UI for your cluster. This is https://CLUSTERNAME.azurehdinsight.net, where CLUSTERNAME is the name of your cluster.
When prompted, enter the admin name and password for the cluster.
-
From the left side of the screen, select HDFS, and then select the Configs tab.
-
In the Filter... field, enter
fs.azure.write.request.size
. This will display the field and current value in the middle of the page. -
Change the value from 262144 (256KB) to the new value. For example, 4194304 (4MB).
For more information on using Ambari, see Manage HDInsight clusters using the Ambari Web UI.
Now that you understand how to get data into HDInsight, read the following articles to learn how to perform analysis: