title | description | services | documentationcenter | author | manager | editor | ms.assetid | ms.service | ms.custom | ms.devlang | ms.topic | ms.tgt_pltfrm | ms.workload | ms.date | ms.author |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Compute context options for R Server on HDInsight - Azure | Microsoft Docs |
Learn about the different compute context options available to users with R Server on HDInsight |
HDInsight |
jeffstokes72 |
jhubbard |
cgronlun |
0deb0b1c-4094-459b-94fc-ec9b774c1f8a |
HDInsight |
hdinsightactive |
R |
article |
na |
data-services |
02/28/2017 |
jeffstok |
Microsoft R Server on Azure HDInsight provides the latest capabilities for R-based analytics. It uses data that's stored in HDFS in a container in your Azure Blob storage account, a Data Lake store or the local Linux file system. Since R Server is built on open source R, the R-based applications you build can leverage any of the 8000+ open source R packages. They can also leverage the routines in ScaleR, Microsoft’s big data analytics package that's included with R Server.
The edge node of a cluster provides a convenient place to connect to the cluster and run your R scripts. With an edge node, you have the option of running ScaleR’s parallelized distributed functions across the cores of the edge node server. You also have the option to run them across the nodes of the cluster by using ScaleR’s Hadoop Map Reduce or Spark compute contexts.
In general, an R script that's run in R Server on the edge node runs within the R interpreter on that node. The exceptions are those steps that call a ScaleR function. The ScaleR calls run in a compute environment that's determined by how you set the ScaleR compute context. When you run your R script from an edge node, the possible values of the compute context are local sequential (‘local’), local parallel (‘localpar’), Map Reduce, and Spark.
The ‘local’ and ‘localpar’ options differ only in how rxExec calls are executed. They both execute other rx-function calls in a parallel manner across all available cores unless specified otherwise through use of the ScaleR numCoresToUse option, e.g. rxOptions(numCoresToUse=6). The following summarizes the various compute context options
Compute context | How to set | Execution context |
---|---|---|
Local sequential | rxSetComputeContext(‘local’) | Parallelized execution across the cores of the edge node server, except for rxExec calls which are executed serially |
Local parallel | rxSetComputeContext(‘localpar’) | Parallelized execution across the cores of the edge node server |
Spark | RxSpark() | Parallelized distributed execution via Spark across the nodes of the HDI cluster |
Map Reduce | RxHadoopMR() | Parallelized distributed execution via Map Reduce across the nodes of the HDI cluster |
Assuming that you’d like parallelized execution for the purposes of performance, then there are three options. Which option you choose depends on the nature of your analytics work, and the size and location of your data.
Currently, there is no formula that tells you which compute context to use. There are, however, some guiding principles that can help you make the right choice, or at least help you narrow down your choices before you run a benchmark. These guiding principles include:
- The local Linux file system is faster than HDFS.
- Repeated analyses are faster if the data is local, and if it's in XDF.
- It's preferable to stream small amounts of data from a text data source; if the amount of data is larger, convert it to XDF prior to analysis.
- The overhead of copying or streaming the data to the edge node for analysis becomes unmanageable for very large amounts of data.
- Spark is faster than Map Reduce for analysis in Hadoop by running compute at in-memory speeds using Spark RDDs.
- The Spark compute context leverages the Spark DAG for distributing work across the nodes of the cluster, and provides a number of options for persisting those tasks. Because spawning these tasks is an expensive process, we can see performance increases over Map Reduce for many types of tasks.
- Spark runs under YARN for resource management, providing greater flexibility on selecting the number of nodes on which to run tasks.
Given these principles, some general rules of thumb for selecting a compute context are:
- If the amount of data to analyze is small and does not require repeated analysis, then stream it directly into the analysis routine and use 'local' or 'localpar'.
- If the amount of data to analyze is small or medium-sized and requires repeated analysis, then copy it to the local file system, import it to XDF, and analyze it via 'local' or 'localpar'.
- If the amount of data to analyze is large, then then import it to a Spark DataFrame using RxHiveData or RxParquetData, or to XDF in HDFS (unless storage is an issue), and analyze it via ‘Spark’.
- SparkR provides access to native Spark capabilities, including a growing number of predictive analytics algorithms available in Spark.
- Use only if you encounter an insurmountable problem with use of the Spark compute context since generally it will be slower.
For more information and examples of ScaleR compute contexts, see the inline help in R on the rxSetComputeContext method, for example:
> ?rxSetComputeContext
You can also refer to the “ScaleR Distributed Computing Guide” that's available from the R Server MSDN library.
In this article, you learned how to create a new HDInsight cluster that includes R Server. You also learned the basics of using the R console from an SSH session. Now you can read the following articles to discover other ways of working with R Server on HDInsight: