Skip to content

v0.4

Pre-release
Pre-release
Compare
Choose a tag to compare
@jamesbak jamesbak released this 30 Jul 18:51
· 60 commits to master since this release

This release adds support for a new type of workload to dramatically improve read performance. Importing/expanding data accounts and SAS urls are now also supported.

Read Replicas

For compute intensive workloads that must concurrently spin-up a large number of computers, read a relatively small reference dataset and then compute an outcome, a storage challenge exists to not have the reference dataset readers throttled at startup. For a single Azure Storage Account, the throughput limit of 30/20 Gbps (US) 15/10 Gbps (Elsewhere - see https://azure.microsoft.com/en-us/documentation/articles/storage-scalability-targets/) can become a constraining factor, even when reading a relatively small dataset (~10GB) when you have > 1000 readers. The Azure Batch Service (http://azure.microsoft.com/en-us/services/batch/) identified this as a recurring pattern in their customer's workloads.

To address this constraint a feature has been added to Dash whereby an identified blob will be asynchronously replicated to all available data accounts. Subsequent read operations will randomly select one of the replicas to redirect the client to, thus distributing read load across all of the available data accounts.

Not all blobs will be replicated. Blobs may be identified for replication either by attaching special metadata to the blob (see ReplicationMetadataName and ReplicationMetadataValue configuration options) or by configuring a regular expression (see ReplicationPathPattern configuration option) that will be matched against the path name of the blob.

SAS Url Support

Support has been added to Dash so that Shared Access Signature (SAS) keys may be specified as query parameters to requests as a form of authentication. The SAS feature in Dash is fully compatible with the same feature in Azure Storge, including client library support. SAS urls are fully described here: https://msdn.microsoft.com/en-us/library/azure/ee395415.aspx.

Importing New Data Accounts

Additional data accounts may now be added to an existing Dash deployment. The data accounts to be added may be empty, in which case additional capacity is added to the virtual account. Data accounts with existing data blobs may also be imported. In this case, the blobs contained in the account are listed and added to the namespace. After the namespace has been updated, the existing blobs will be accessible via the Dash endpoint in exactly the same way as any other blob.

At this stage we have not added support to detach a data account from Dash. Although the implementation for this feature would be trivial, we do not yet have any requests to implement it and therefore will wait.

One point to note about any existing blobs that are imported from a new data account is that they will NOT be automatically replicated, even if their metadata or blob name match the configuration. To force an imported blob to be replicated, simply 'touch' the blob by updating a property or adding benign metadata - once this write operation is processed, the blob will be replicated.

To import a data account to an existing Dash deployment, update the next available ScaleoutStorage configuration entry with the connection string of the account. Add the account name to the ImportAccounts configuration entry (a comma separated list) and restart the server. At server startup, the account will be imported. Protections exist such that the same account will only be imported once.

In the future, the mechanism to import accounts will be included in our Management API feature.

Azure Virtual Network Support

In conjunction with our friends in the HDInsight http://azure.microsoft.com/en-us/services/hdinsight/ team we identified a networking bottleneck when a large HDInsight cluster (> 128 data nodes) performed I/O through Dash. Given that all data nodes in a HDInsight cluster contain only private IP addresses, it is necessary for network traffic to flow through a Source Network Address Translation (SNAT) device so that Dash (or any other destination host) knows where to send it's responses. Given the extreme volume of requests flowing from such a large cluster, we found that the number of available SNAT ports were being exhausted on the cluster and traffic was being throttled.

The solution to overcome this extreme limit is to deploy both the HDInsight cluster and Dash into the same Azure Virtual Network (VNet) and configure the Internal Load Balancer (ILB) for Dash to communicate directly with the HDInsight data nodes.

We have updated our cloud deployment configurations to support this deployment topology.

Other Features

  • The base VM SKU for Dash is now a D3. Research across available SKUs indicated that D3 provides the best combination of network & memory resources for the best price.

Bug Fixes

  • Fixed issue with handling specifically encoded blob names.
  • Corrected error in List Blobs handler for formatting the name of a snapshot.
  • Fixed bug in Copy Blob handler to correctly copy page blobs.
  • Fixed incorrect response to Get Service Properties handler. This is required so that other service can identify the endpoint as supporting the Azure Blob Storage protocol.

Obtaining Dash

Pre-built binaries for this release are published here: https://www.dash-update.net/DashServer/v0.4 - this is a link to a configuration manifest file, which includes references to the artifacts required for the desired configuration (HTTP, HTTPS, ILB).

Download the files listed under the desired configuration - there are 2 files per configuration - the binary package (.cspkg) and configuration (.cscfg). Update the configuration file with appropriate values and then deploy both files to Azure as a Cloud Service.

Future improvements will enable updating of an existing Dash deployment with new versions as well as improved initial deployment using Azure Resource Manager (ARM) templates.

Modified Azure Storage client SDKs are available at the following locations - note that it is not necessary to use our modified versions (the official versions work), but certain operations are more performant using the modified libraries:

Alternatively, clone the code repository, update the configuration file and deploy directly from Visual Studio or using your own cloud deployment mechanism.

See readme.md for details on how to build and deploy DASH.