Skip to content
James Baker edited this page Dec 28, 2015 · 5 revisions

Data At Scale Hub (DASH) for Azure Storage

Overview

DASH provides scalability to very large solutions using Azure Storage where the scale of the solution exceeds the limits of a single storage account (by capacity, throughput or IOPS). While the approach of aggregating the limits of multiple storage accounts to provide greater scalability is a relatively common pattern, the ability of Dash to perform this in a completely transparent manner, while maintaining maximal network and compute efficiency is a considerable improvement over existing practices.

An architectural overview of Dash is provided here: [Architectural Overview](Architectural Overview).

Supported Scenarios

The following scalability scenarios are supported for Azure Storage clients:

  1. Total storage capacity greater than 500TB. Applications that require very large storage capacity but do not want to build in the complexity of mapping where the data actually resides. Eg. Media, backup, genomics.
  2. Distributed analytics workloads. Distributed compute clusters (eg. HDInsight/Hadoop, Mesos, etc.) are capable of exceeding the throughput limits of a single storage account by all computers in the cluster converging their I/O on an account. This is applicable to even modestly sized clusters (>= 15 nodes). Dash is capable of aggregating the throughput capabilities of multiple storage accounts (> 300Gbps) without introducing significant application complexity to the workload.
  3. High Performance Computing (HPC) clusters requiring high read throughput for reference datasets. HPC clusters typically consist of a very large number of VMs that need to converge on a relatively small (< 100GB) reference dataset to be utilized in the workload calculation. Dash is able to create many Read Replicas that effectively distributes the read load over multiple storage accounts.
  4. Workloads that require a very large amount of transactional throughput or IOPS, such as very large Key Value stores (eg. HBase running on HDInsight) can typically require more than the permitted 20,000 operations per second for a single storage account. These workloads can utilize Dash to aggregate the transaction rate across multiple storage accounts.
  5. (Coming Soon) The 200GB limit for single blobs is an issue for many workloads. Dash will provide a mechanism that will allow for more than the standard 50,000 blocks to be written to a single blob resulting in blobs of multi-TB.
  6. (Coming Soon) Geo-distribution. Many applications utilize Azure to provide a geo-distributed footprint. While having a local point-of-presence for the web frontend delivers many important improvements, data locality is also an issue. Dash will provide a mechanism whereby blobs may be replicated over a flexible topology of storage accounts yielding the desired data locality. Additionally, Dash will also provide a policy-based mechanism whereby data may be assigned to ONLY exist in a given region, regardless of which frontend performed the writing. This capability provides necessary data sovereignty or 'safe harbor' qualities demanded by certain jurisdictions.

Deployment

There are 3 ways to deploy Dash, each with applicability to different audiences:

  1. Deploy a pre-built binary directly to Azure. See [Deploying Pre-Built Binaries](Deploying Pre-Built Binaries).
  2. Use Visual Studio to build Dash and then deploy from within the IDE. See [Deploying from Visual Studio](Deploying from Visual Studio).
  3. Incorporate building and deploying Dash into your normal development lifecycle. See [Incorporate Dash Deployment into ALM](Incorporate Dash Deployment into ALM).

Management

Once a Dash service has been deployed it may be managed using various approaches:

  1. Use the built-in Management Portal that provides a web application to manage and monitor the service.
  2. Write your own application or tooling and call the Management API REST interface.
  3. Use the Azure Management Portal to directly manipulate the configuration for the Dash service.

How Do I Use It From My Application?

From the application's perspective, DASH looks exactly like a standard Azure Blob Storage endpoint. The same REST API is supported so applications directly using the REST API or any of the storage libraries will work unmodified.

The only thing that needs to change is the connection string:

  • The standard connection string supports the specification of 'custom domain names' as described here
  • Specify the DNS name for your DASH endpoint in the BlobEndpoint attribute (eg. BlobEndpoint=http://mydashservice.cloudapp.net;
  • Include the account name/key OR shared access signature as normal
  • A complete example of a DASH connection string is:

AccountName=dashaccount;AccountKey=myBase64EncodedAccountKey;BlobEndpoint=http://mydashservice.cloudapp.net

Library Support for the Automatic Following of Redirections

A number of the standard Azure Storage libraries for certain languages do not automatically follow HTTP redirections. While DASH will support communication with clients utilizing these libraries, it will do so in 'Passthrough Proxy' mode which is not anywhere near to redirection mode for efficiency.

To address this issue, we have versions of the standard libraries available for the following languages which have been modified to automatically support following of redirections (new language support will be added in an as-demanded order):

  • .NET - The .NET library https://github.com/Azure/azure-storage-net/ does support automatic following of HTTP redirections. We have modified this library to support the Expect: 100-Continue header which means that payload is NOT send to the DASH server for PUT requests.
  • Java - Various JREs have inconsistent support for automatic following of HTTP redirects and so the Java library for Azure Storage https://github.com/Azure/azure-storage-java explicitly prevents it. Additionally, support for the Expect: 100-Continue request header was only added in Java 8. We have modified the standard library to support both of these features and work in the most efficient manner with DASH.
  • Python (Coming Soon) - The Python library for Azure Storage https://github.com/Azure/azure-sdk-for-python exhibits similar behavior to the Java library and will be modified to work with DASH in full redirection mode.

These modified libraries are available as pre-built binaries from our package downloader with the base URI https://www.dash-update.net/Client/Latest