Skip to content

Latest commit

 

History

History
79 lines (60 loc) · 6.63 KB

hdinsight-hadoop-r-server-compute-contexts.md

File metadata and controls

79 lines (60 loc) · 6.63 KB
title description services documentationcenter author manager editor ms.assetid ms.service ms.custom ms.devlang ms.topic ms.tgt_pltfrm ms.workload ms.date ms.author
Compute context options for R Server on HDInsight - Azure | Microsoft Docs
Learn about the different compute context options available to users with R Server on HDInsight
HDInsight
jeffstokes72
jhubbard
cgronlun
0deb0b1c-4094-459b-94fc-ec9b774c1f8a
HDInsight
hdinsightactive
R
article
na
data-services
02/28/2017
jeffstok

Compute context options for R Server on HDInsight

Microsoft R Server on Azure HDInsight provides the latest capabilities for R-based analytics. It uses data that's stored in HDFS in a container in your Azure Blob storage account, a Data Lake store or the local Linux file system. Since R Server is built on open source R, the R-based applications you build can leverage any of the 8000+ open source R packages. They can also leverage the routines in ScaleR, Microsoft’s big data analytics package that's included with R Server.

The edge node of a cluster provides a convenient place to connect to the cluster and run your R scripts. With an edge node, you have the option of running ScaleR’s parallelized distributed functions across the cores of the edge node server. You also have the option to run them across the nodes of the cluster by using ScaleR’s Hadoop Map Reduce or Spark compute contexts.

Compute contexts for an edge node

In general, an R script that's run in R Server on the edge node runs within the R interpreter on that node. The exceptions are those steps that call a ScaleR function. The ScaleR calls run in a compute environment that's determined by how you set the ScaleR compute context. When you run your R script from an edge node, the possible values of the compute context are local sequential (‘local’), local parallel (‘localpar’), Map Reduce, and Spark.

The ‘local’ and ‘localpar’ options differ only in how rxExec calls are executed. They both execute other rx-function calls in a parallel manner across all available cores unless specified otherwise through use of the ScaleR numCoresToUse option, e.g. rxOptions(numCoresToUse=6). The following summarizes the various compute context options

Compute context How to set Execution context
Local sequential rxSetComputeContext(‘local’) Parallelized execution across the cores of the edge node server, except for rxExec calls which are executed serially
Local parallel rxSetComputeContext(‘localpar’) Parallelized execution across the cores of the edge node server
Spark RxSpark() Parallelized distributed execution via Spark across the nodes of the HDI cluster
Map Reduce RxHadoopMR() Parallelized distributed execution via Map Reduce across the nodes of the HDI cluster

Assuming that you’d like parallelized execution for the purposes of performance, then there are three options. Which option you choose depends on the nature of your analytics work, and the size and location of your data.

Guidelines for deciding on a compute context

Currently, there is no formula that tells you which compute context to use. There are, however, some guiding principles that can help you make the right choice, or at least help you narrow down your choices before you run a benchmark. These guiding principles include:

  1. The local Linux file system is faster than HDFS.
  2. Repeated analyses are faster if the data is local, and if it's in XDF.
  3. It's preferable to stream small amounts of data from a text data source; if the amount of data is larger, convert it to XDF prior to analysis.
  4. The overhead of copying or streaming the data to the edge node for analysis becomes unmanageable for very large amounts of data.
  5. Spark is faster than Map Reduce for analysis in Hadoop by running compute at in-memory speeds using Spark RDDs.
  6. The Spark compute context leverages the Spark DAG for distributing work across the nodes of the cluster, and provides a number of options for persisting those tasks. Because spawning these tasks is an expensive process, we can see performance increases over Map Reduce for many types of tasks.
  7. Spark runs under YARN for resource management, providing greater flexibility on selecting the number of nodes on which to run tasks.

Given these principles, some general rules of thumb for selecting a compute context are:

Local

  • If the amount of data to analyze is small and does not require repeated analysis, then stream it directly into the analysis routine and use 'local' or 'localpar'.
  • If the amount of data to analyze is small or medium-sized and requires repeated analysis, then copy it to the local file system, import it to XDF, and analyze it via 'local' or 'localpar'.

Hadoop Spark

  • If the amount of data to analyze is large, then then import it to a Spark DataFrame using RxHiveData or RxParquetData, or to XDF in HDFS (unless storage is an issue), and analyze it via ‘Spark’.
  • SparkR provides access to native Spark capabilities, including a growing number of predictive analytics algorithms available in Spark.

Hadoop Map Reduce

  • Use only if you encounter an insurmountable problem with use of the Spark compute context since generally it will be slower.

Inline help on rxSetComputeContext

For more information and examples of ScaleR compute contexts, see the inline help in R on the rxSetComputeContext method, for example:

> ?rxSetComputeContext

You can also refer to the “ScaleR Distributed Computing Guide” that's available from the R Server MSDN library.

Next steps

In this article, you learned how to create a new HDInsight cluster that includes R Server. You also learned the basics of using the R console from an SSH session. Now you can read the following articles to discover other ways of working with R Server on HDInsight: