Skip to content

Latest commit

 

History

History
146 lines (86 loc) · 9.05 KB

hdinsight-install-published-app-cask.md

File metadata and controls

146 lines (86 loc) · 9.05 KB
title description services documentationcenter author manager editor tags ms.assetid ms.service ms.custom ms.devlang ms.topic ms.tgt_pltfrm ms.workload ms.date ms.author
Install Published Application - Datameer on Azure HDInsight | Microsoft Docs
Learn how to install third-party Hadoop applications on Azure HDInsight.
hdinsight
azure-portal
hdinsight
hdinsightactive
na
article
na
big-data

Install published application - Cask Data Application Platform (CDAP) on Azure HDInsight

In this article, you will learn how to install the CDAP published Hadoop application on Azure HDInsight. Read Install third-party Hadoop applications for a list of available Independent Software Vendor (ISV) applications, as well as an overview of the HDInsight application platform. For instructions on installing your own application, see Install custom HDInsight applications.

About CDAP

Developing applications in the traditional Hadoop world is a not an easy task. Listed below are some key aspects that add to the challenges faced by a Hadoop developer: -

  • Over the past few years, the increased interest in the Big Data space has resulted in a technology explosion in the Hadoop ecosystem. It has become progressively difficult to keep track of all the existing technologies as well as new ones come up.
  • Simple processes like data ingestion and ETL require a complicated setup which is not generally extensible or reusable.
  • Apart from the significant learning curve involved in using each of the different Hadoop technologies, there is a substantial amount of time spent in integrating all of them to form a data processing solution.
  • Moving from a proof-of-concept solution to a production-ready one is far from a trivial step involving multiple iterations and can lead to an increased unpredictability in delivery times.
  • It is hard to locate data and trace its flow in an application. Collecting metrics and auditing is generally a challenge and often requires building a separate solution.

How does CDAP help?

CDAP (Cask Data Application Platform) is a unified integration platform for big data. The highlight of CDAP is that a user can focus on building applications rather than its underlying infrastructure and integration.

CDAP works using high-level concepts and abstractions which are familiar to developers and empowers them to use their existing skills to build new solutions. These abstractions hide the complexities of internal systems and encourage re-usability of solutions.

An extension called Cask Hydrator is available in CDAP, which provides a rich user interface to develop and manage data pipelines. A data pipeline is composed of various plugins which perform several tasks like data acquisition, transformation, analysis, and post-run operations.

Each CDAP plugin has well-defined interfaces which essentially means that evaluating different technologies would just be a matter of replacing a plugin with another one – there is no need to touch the rest of the application.

CDAP pipelines provide a high-level pictorial flow of the data in your application which enables developers to easily visualize the end-to-end flow of the data and all the steps involved in the processing starting from its ingestion, to the various transformations and analyses performed on the data followed by the eventual writing into an external data store.

Here is an example of a data pipeline which ingests twitter data in real time, filters out some tweets based on some pre-defined criteria, transforms, and projects the data into a more readable format, groups them according to a set of values and writes the results into an HBase store.

CDAP pipeline

The end-to-end pipeline was completely built using the Cask Hydrator UI, utilizing its plugin interface and drag-and-drop functionality to form connections between each stage. It is easy to isolate and modify the functionality of each plugin independent of the rest of the pipeline. Using CDAP, similar pipelines can be built and validated in less than a couple of hours. In the traditional Hadoop world, constructing such solutions could easily take a few days.

Additionally, CDAP provides an extension called Cask Tracker where you can visually trace the data as it flows through the application. Cask tracker adds data governance to the system so that data assets are formally managed throughout the application. You can track its lineage, collect relevant metrics, and audit the data trail throughout the process.

Here is an illustration of how data is flowing in the above pipeline:

CDAP tracker

Installing the CDAP published application

For step-by-step instructions on installing this and other available ISV applications, please read Install third-party Hadoop applications.

Prerequisites

When creating a new HDInsight cluster, or to install on an existing one, you must have the following configuration to install this app:

  • Cluster tier: Standard
  • Cluster type: HBase
  • Cluster version: 3.4, 3.5

Launching CDA: for the first time

After installation, you can launch CDAP from your cluster in Azure Portal by going to the the Settings blade, then clicking Applications under the General category. The Installed Apps blade lists all the installed applications.

Installed CDAP app

When you select CDAP, you'll see a link to the web page, HTTP Endpoint, as well as the SSH endpoint path. Select the WEBPAGE link.

When prompted, enter your cluster admin credentials.

Authentication

After signing in, you will be presented with the Cask CDAP GUI home page.

Cask GUI home page

To get an idea of using the CDAP interface, click the Cask Market menu link on top of the page.

Cask Market link

Select the Access Log Sample from the list.

Access Log Sample

Click Load to confirm.

Click Load

A sample view of the included data will be displayed. Click Next.

Access Log Sample - View Data

Select Stream as the Destination Type, enter a Destination Name, then click Finish.

Access Log Sample - Select Destination

Once the datapack has been successfully loaded, click View Stream Details.

Datapack successfully uploaded

On the Access Log details page, click Enable within the Usage tab to enable metadata for the namespace.

Access Log Sample - Loaded - enable metadata

You will see a graph displaying audit message information, once metadata has been enabled.

Access Log Sample - Metadata enabled

To explore the log data, click the Explore icon on top of the page.

Access Log Sample - Explore

You will see a sample SQL query. Feel free to modify, if desired, then click Execute.

Access Log Sample - Explore dataset with a query

After the query has finished, click the View icon under the Actions column.

Access Log Sample - View completed query

You will now see the query results.

Access Log Sample - Query results

Next steps