Skip to content
Ronen Botzer edited this page Jan 5, 2020 · 3 revisions

Overview

The Aerospike Hadoop connector shows examples of InputFormat and OutputFormat programming. The integration allows you to pull data directly from Hadoop clusters in order to analyze or check any form of unstructured data.

This basic but powerful tool provides a building block to integrate Aerospike with a huge number of Hadoop projects, such as Hive, Pig, Flume.

Besides connector code itself, the repository includes classic examples like word count and how to sessionize web data.

When you integrate Aerospike's capabilities at key-value operations into your architecture, you'll achieve a Hadoop system which is capable of not just streaming analytics, but also computations that require random access. For example, imagine analyzing a customer dataset by a variety of audience segments, skipping easily through a large dataset to operate on only the data you need.

In all integrations with Hadoop, the underlying concept is to be able to enrich the operational data on Aerospike as well as provide the operational data from Aerospike to the Hadoop ecosystem for enterprise wide analytics - and in turn enrich the analytics data set with updates from operational data on Aerospike. In real time applications, Aerospike is also used as a results store, append incremental updates from machine learning algorithms running on enterprise wide data and providing operational data for models feeding real time web applications.

Community

The Aerospike Hadoop connector has been turned over to the community. If you wish to contribute code, go ahead and clone the repo, modify the code, and create a pull request. Active contributors can then ask to become maintainers for the repo. The wiki can similarly be modified by any code contributor who has been granted pull permissions.