Difference between revisions of "Distributed Computing Working Group"
(add Hossein Falaki)
|Line 47:||Line 47:|
== Minutes ==
== Minutes ==
=== 10/13/2016 ===
=== 10/13/2016 ===
Revision as of 22:33, 10 November 2016
Goals and Purpose
The Distributed Computing Working Group will endorse the design of a common abstraction for distributed data structures in R. We aim to have at least one open-source implementation, as well as a SQL implementation, released within a year of forming the group.
- Michael Lawrence (Genentech)
- Indrajit Roy (HP Enterprise)
- Joe Rickert (ISC liason, RStudio)
- Bernd Bischl (LMU)
- Matt Dowle (H2O)
- Mario Inchiosa (Microsoft)
- Michael Kane (Yale)
- Javier Luraschi (RStudio)
- Edward Ma (HP Enterprise)
- Luke Tierney (University of Iowa)
- Simon Urbanek (AT&T)
- Bryan Lewis (Paradigm4)
- Hossein Falaki (databricks)
- Adopt ddR as a prototype for a standard API for distributed computing in R
Clark Fitzgerald, a PhD student in the UC Davis Statistics department, worked on ddR and Spark integration.
- Wrote sparklite and rddlist as minimal proof-of-concept R packages to connect and store general data on Spark. slides
- Patched SparkR to allow user defined functions returning binary columns. This allows implementation of different data structures in SparkR.
- Updated design documents with suggested changes to DDR's internal design and object oriented model.
- Improved testing and ddR internals.
- Agree on a final standard API for distributed computing in R
- Implement at least one scalable backend based on an open-source technology like Spark, SQL, etc
- How can we address the needs of both the end user data scientists and the algorithm implementers?
- How should we share data between R and a system like Spark?
- Is there any way to unify SparkR and sparklyr?
- Could we use the abstractions of tensorflow to partially or fully integrate with platforms like Spark?
- Slides were presented by Hossein Faliki and Shivaram from Databricks and UC Berkeley:
- SparkR was a prototype was from AMPLab in 2014. Initially it had the RDD API and was similar to PySpark API
- In 2015 the merge with upstream Spark, the decision was made to integrate with the Dataframe API, and hide the RDD API
- In 2016 onwards more MLLib algos have been exposed and new APIs have been added, and a CRAN package will be released soon
- SparkR architecture runs R on the Spark driver that communicates with the JVM processes in the driver, which are sent to the worker JVM processes, and executed as scala/java on workers.
- You can read distributed data inside the JVM from different sources such as S3, HDFS, etc.
- In the driver there is a socket based connection between SparkR and the RBackend which is on the JVM. RBackend deserializes the R code and converts them into Java calls.
- collect() and createDataFrame() are used to move data between R and JVM across sockets. createDataFrame will convert your local R data into a JVM based distributed data frame.
- The API has IO, Caching, MLLib, and SQL commands
- Since Spark 2.0, we can run R processes inside the JVM worker processes. There is no need to keep long running R processes.
- There are 3 UDF functions (1) lapply, run function on different value of a list (2) dapply, run function on each partition of a data frame. You have to careful about how data is partitioned (3) gapply, performs a grouping on differnt column names and then runs the function on each group.
- New CRAN package. install.spark() to automatically download and install Spark. Automated CRAN checks with every commit to the code. Should be available with Spark 2.1.0
- With Spark 2.0, Spark has a its off head manager where they will use Arrow. Once this is tested it on the Python integration we can use it for R. Currently trying to get zero copy dataframe between python and Spark.
- Spark dataframes gain from plan optimizations. It is not SparkR specific. R UDFs are still
treated as black boxes by the optimizer
- Spark doesn't directly support matrixes. There is no immediate intent to do so either.
One can store an array or vector as a single column of a Spark dataframe.
Detailed minutes were not taken for this meeting
- Mario Inchiosa: Microsoft's perspective on distributed computing with R
- Microsoft R Server: abstractions and algorithms for distributed computation on top of open-source R
- Desired features of a distributed API like ddR:
- Supports PEMA (initialize, processData, updateResults, processResults)
- Fast runtime
- Supports algorithm writer and data scientist
- Comes with a comprehensive set of algorithms
- Easy deployment
- ddR is making good progress but does not yet meet those requirements
- Indrajit: ddR progress report and next steps
- Recap of Clark's internship
- Next step: implement some of Clark's design suggestions: https://github.com/vertica/ddR/wiki/Design
- Spark integration will be based on sparklyr
- Should we limit Spark interaction to the DataFrame API or directly interact with RDDs?
- Consensus: will likely need flexibility of RDDs to implement everything we need, e.g., arrays and lists
- Clark and Javier raised concerns about the scalability of sharing data between R and Spark
- Michael: Spark is a platform in its own right, so interoperability is important, should figure something out
- Bryan Lewis: Why not use tensor abstraction from tensorflow? Spark supports tensorflow and an R interface is already in the works.
- Michael raised the issue of additional funding from the R Consortium to continue Clark's work
- Joe Rickert suggested that the working group develop one or more white papers summarizing the findings of the working group for presentation to the Infrastructure Steering Committee.
- Consensus was in favor of this, and several pointed out that the progress so far has been worthwhile, despite not meeting the specific goals laid out in the proposal.
- Michael: do we want to invite some external speakers, one per meeting, from groups like databricks, tensorflow, etc?
- Consensus was in favor.
Detailed minutes were not taken for this meeting
- Clark Fitzgerald: internship report
- Developed two packages for low-level Spark integration: rddlist, sparklite
- Patched a bug in Spark
- ddR needs refactoring before Spark integration is feasible:
- dlist, dframe, and darray should be formal classes.
- Partitions of data should be represented by a distributed list abstraction, and most functions (e.g., dmapply) should be implemented on top of that list.
- Javier: sparklyr update
- Preparing for CRAN release
- Mario: what happened to sparkapi?
- Javier: sparkapi has been merged into sparklyr in order to avoid overhead of maintaining two packages. ddR can do everything it needs with sparklyr.
- Luke Tierney: Update on the low-level vector abstraction, which might support interfaces like ddR and sparklyr.
- Overall approach seems feasible, but still working out a few details.
- Will land in a branch soon.
- Bernd Bischl: update on the batchtools package
- Successor to BatchJobs based on in-memory database
Meeting was canceled due to lack of availability.
- Introduced Clark who is the intern funded by R Consortium. Clark is a graduate student from UC Davis. He will work on ddR integration with Spark and improving the core ddR API as well such as adding a distributed apply() for matrices, split function, etc.
- Bernd: Can I play around with ddR now? What backend should I use? How robust is the code?
- Clark: It's in good enough shape to be played around with. We will continue to improve it. Hopefully the spark integration will be done before the end of my internship in September.
- Q: Is anyone working on using ddR to make ML scale better.
- Indrajit: We have kmeans, glm, etc. already in CRAN.
- Michael Kane: We are working on glmnet and other packages related to algorithm development.
- Javier gave a demo of sparklyr and sparkapi.
- Motivation for the pacakage: The SparkR package overrides the dplyr interface. This is an issue for RStudio. SparkR is not a CRAN package which makes it difficult to add changes. dplyr is the most popular tool by RStudio and is broken on SparkR.
- Sparklyr provides a dplyr interface. It will also support ML like interfaces, such as consuming a ML model.
- Sparklyr does not currently support any distributed computing features. Instead we can recommend ddR as the distributed computing framework on top of sparkapi. We will put the code in CRAN in a couple of weeks.
- Simon: Can you talk more about the wrapper/low level API to work with Spark?
- Javier: The package underneath the cover is called "sparkapi" it is to be used by pacakge builders. "spark_context()" and "invoke()" are the functionality to call scala methods. It does not you to currently run R user defined functions. I am currently working on enabling that feature. Depending upon the interest in using ddR with sparkapi, I can spend more time to make sparkapi feature rich.
- Indrajit: What versions of Spark are supported
- Javier: Anything after 1.6
- Bernd: How do you export data?
- Javier: We are using all the code from SparkR. So everything in SparkR should continue to work. We don't need to change SparkR. We just need to maintain compatibility.
- Bernd: What happens when the RDDs are very large?
- Javier: Spark will spill on disk.
- Michael Kane: Presented examples that he implemented on ddR.
- Talked about how the different distributed packages compare to each other in terms of functionality.
- Michael K. looked at glm and truncated SVD on ddR. Was able to implement irls on ddR by implementing two distributed functions such as "cross". In truncated SVD only needed to overload two different distributed multiplications.
- Ran these algorithms on the 1000 genome dataset.
- Overall liked ddR since it was easy to implement the algorithms in the package.
- New ideas:
- Trying to separate the data layer from the execution layer
- Create an API that works on "chunks" (which is similar to the "parts" API in ddR). Would like to add these APIs to ddR.
- Indrajit: You should be able to get some of the chunk like features by using parts and dmapply. E.g., you can call dmapply to read 10 different files, which correspond 10 chunks now. These are however wrapped as a darray or dframe. But you can continue to work on the individual chunks by using parts(i).
- Round table introduction
- (Michael) Goals for the group:
- Make a common abstraction/interfaces to make it easier to work with distributed data and R
- Unify the interface
- Working group will run for a year. Get an API defined, get at least one open source reference implementations
- not everyone needs to work hands on. We will create smaller groups to focus on those aspects.
- We tried to get a diverse group of participants
- Logistics: meet monthly, focus groups may meet more often
- R Consoritum may be able to figure ways to fund smaller projects that come out of the working group
- Michael Kane: Should we start with an inventory of what is available and people are using?
- Michael Lawrence: Yes, we should find the collection of tools as well as the use cases that are common.
- Joe: I will figure out a wiki space.
- Javier: Who are the end users? Simon: Common layer needed to get algorithms working. We started from algos and tried to find the minimal common api. One of the goals is to make sure everyone is on the same page and not trying to create his/her own custom interface.
- Javier: Should we try to get people with more algo expertise?
- Joe: Simon do you have a stack diagram?
- Simon: Can we get R Consortium to help write things up and draw things?
- Next meeting: Javier is going to present SparkR next time.