Manipulating Spark data using both dplyr and SQL

Once you're done with the installation from this chapters introduction, let's create a remote dplyr data source for the Spark cluster. To do this, use the spark_connect function, as shown:

sc <- spark_connect(master = "local")

This will create a Spark cluster in your computer; you can see it at your RStudio, a tab guide alongside your R environment. To disconnect, use the spark_disconnect(sc) function. Keep connected and copy a couple of datasets from any R packages into the cluster:

library(DAAG)dt_sugar <- copy_to(sc, sugar, "SUGAR")dt_stVincent <- copy_to(sc, stVincent, "STVINCENT")

The preceding code uploads the DAAG::sugar and DAAG::stVicent DataFrames into the your connected Spark cluster. ...

Get Hands-On Data Science with R now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.