-
Notifications
You must be signed in to change notification settings - Fork 18
/
README.Rmd
113 lines (80 loc) · 4.83 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
---
output:
html_document:
keep_md: yes
---
```{r, echo=FALSE}
library(knitr)
opts_chunk$set(echo=TRUE, eval=TRUE, tidy=FALSE, comment="", cache=FALSE, error=FALSE)
```
# dplyr.spark
This package implements a [`spark`](http://spark.apache.org/) backend for the [`dplyr`](http://github.com/hadley/dplyr) package, providing a powerful and intuitive DSL to manipulate large datasets on a powerful big data platform. It is a simple package: simple to learn if you have any familiarity with `dplyr` or even just R and SQL, simple to deploy: just a few packages to install on a single machine, as long as your Spark installation comes with JDBC support -- or build it in, instructions below.
The current state of the project is:
- most `dplyr` features supported
- adds some `spark`-specific goodies, like *caching* tables.
- can go succesfully through tutorials for `dplyr` like any other database backend^[with the exception of one bug to avoid which you need to run Spark from trunk or wait for version 1.5, see [SPARK-9221](https://issues.apache.org/jira/browse/SPARK-9921)].
- not yet endowed with a thorugh test suite. Nonetheless we expect it to inherit much of its correctness, scalability and robustness from its main dependencies, `dplyr` and `spark`.
- we don't recommend production use yet
## Installation
You need to [download spark](https://spark.apache.org/downloads.html) and [build it](https://spark.apache.org/docs/latest/building-spark.html) as follows
```
cd <spark root>
build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests -Phive -Phive-thriftserver clean package
```
It may work with other hadoop versions, but we need the hive and hive-thriftserver support. The package is able to start the thirft server but can also connect to a running one.
`dplyr.spark` has a few dependencies: get them with
```
install.packages(c("RJDBC", "dplyr", "DBI", "devtools"))
devtools::install_github("hadley/purrr")
```
Indirectly `RJDBC` needs `rJava`. Make sure that you have `rJava` working with:
```{r, eval=FALSE}
library(rJava)
.jinit()
```
This is only a test, in general you don't need it before loading `dplyr.spark`.
----------------
#### Mac Digression
On the mac `rJava` required two different versions of java installed, [for real](http://andrewgoldstone.com/blog/2015/02/03/rjava/), and in particular this shell variable set
```
DYLD_FALLBACK_LIBRARY_PATH=/Library/Java/JavaVirtualMachines/jdk1.8.0_51.jdk/Contents/Home/jre/lib/server/
```
The specific path may be different, particularly the version numbers. To start Rstudio (optional, you can use a different GUI or none at all), which doesn't read environment variables, you can enter the following command:
```
DYLD_FALLBACK_LIBRARY_PATH=/Library/Java/JavaVirtualMachines/jdk1.8.0_51.jdk/Contents/Home/jre/lib/server/ open -a rstudio
```
----------------
The `HADOOP_JAR` environment variable needs to be set to the main hadoop JAR file, something like `"<spark home>/assembly/target/scala-2.10/spark-assembly-1.4.1-SNAPSHOT-hadoop2.4.0.jar"`
To start the thrift server from R, which happens by default when creating a `src_SparkSQL` object, you need one more variable set, `SPARK_HOME`, as the name suggests pointing to the root of the Spark installation. If you are connecting with a running server, you just need host and port information. Those can be stored in environment variable as well, see help documentation.
```{r, echo=FALSE}
library(httr)
version = content(GET("https://api.github.com/repos/RevolutionAnalytics/dplyr-spark/releases"))[[1]]$tag
```
Then, to install from source:
```{r, echo=FALSE, comment=""}
cat(
paste0('devtools::install_github("RevolutionAnalytics/dplyr-spark@', version, '", subdir = "pkg")
'))
```
Linux package:
```{r, echo=FALSE, comment=""}
cat(
paste0('devtools::install_url(
"https://github.com/RevolutionAnalytics/dplyr-spark/releases/download/', version, '/dplyr.spark_', version, '.tar.gz")
'))
```
<!--
A windows package will be added in the near future.
Windows package:
```{r, echo=FALSE, comment=""}
cat(
paste0('install_url(
"https://github.com/RevolutionAnalytics/dplyr-spark/releases/download/', version, '/dplyr.spark_', version, '.zip")
'))
```
-->
```{r, echo=FALSE, results='asis'}
cat("The current version is", version, ".")
```
You can find a number of examples derived from @hadley's own tutorials for dplyr look under the [test](https://github.com/RevolutionAnalytics/dplyr-spark/tree/master/pkg/tests) directory, files `databases.R`, `window-functions.R` and `two-table.R`.
For new releases, subscribe to `dplyr-spark`'s Release notes [feed](https://github.com/RevolutionAnalytics/dplyr.spark/releases.atom) or join the [RHadoop Google group](https://groups.google.com/forum/#!forum/rhadoop). The latter is also the best place to get support, together with the [issue tracker](http://github.com/RevolutionAnalytics/dplyr.spark/issues))