<?xml version="1.0"?>
<?xml-stylesheet type="text/css" href="https://wiki.r-consortium.org/skins/common/feed.css?303"?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
		<id>https://wiki.r-consortium.org/index.php?action=history&amp;feed=atom&amp;title=Distributed_Computing_Working_Group_Progress_Report_2016</id>
		<title>Distributed Computing Working Group Progress Report 2016 - Revision history</title>
		<link rel="self" type="application/atom+xml" href="https://wiki.r-consortium.org/index.php?action=history&amp;feed=atom&amp;title=Distributed_Computing_Working_Group_Progress_Report_2016"/>
		<link rel="alternate" type="text/html" href="https://wiki.r-consortium.org/index.php?title=Distributed_Computing_Working_Group_Progress_Report_2016&amp;action=history"/>
		<updated>2026-05-11T04:14:55Z</updated>
		<subtitle>Revision history for this page on the wiki</subtitle>
		<generator>MediaWiki 1.23.15</generator>

	<entry>
		<id>https://wiki.r-consortium.org/index.php?title=Distributed_Computing_Working_Group_Progress_Report_2016&amp;diff=76&amp;oldid=prev</id>
		<title>MichaelLawrence at 18:45, 1 May 2017</title>
		<link rel="alternate" type="text/html" href="https://wiki.r-consortium.org/index.php?title=Distributed_Computing_Working_Group_Progress_Report_2016&amp;diff=76&amp;oldid=prev"/>
				<updated>2017-05-01T18:45:03Z</updated>
		
		<summary type="html">&lt;p&gt;&lt;/p&gt;
&lt;table class='diff diff-contentalign-left'&gt;
				&lt;col class='diff-marker' /&gt;
				&lt;col class='diff-content' /&gt;
				&lt;col class='diff-marker' /&gt;
				&lt;col class='diff-content' /&gt;
				&lt;tr style='vertical-align: top;'&gt;
				&lt;td colspan='2' style=&quot;background-color: white; color:black; text-align: center;&quot;&gt;← Older revision&lt;/td&gt;
				&lt;td colspan='2' style=&quot;background-color: white; color:black; text-align: center;&quot;&gt;Revision as of 18:45, 1 May 2017&lt;/td&gt;
				&lt;/tr&gt;&lt;tr&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot;&gt;Line 1:&lt;/td&gt;
&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot;&gt;Line 1:&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td colspan=&quot;2&quot;&gt;&amp;#160;&lt;/td&gt;&lt;td class='diff-marker'&gt;+&lt;/td&gt;&lt;td style=&quot;color:black; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;Authors: Michael Lawrence and Indrajit Roy&lt;/ins&gt;&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td colspan=&quot;2&quot;&gt;&amp;#160;&lt;/td&gt;&lt;td class='diff-marker'&gt;+&lt;/td&gt;&lt;td style=&quot;color:black; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;&lt;/ins&gt;&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class='diff-marker'&gt;&amp;#160;&lt;/td&gt;&lt;td style=&quot;background-color: #f9f9f9; color: #333333; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #e6e6e6; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;== Introduction ==&lt;/div&gt;&lt;/td&gt;&lt;td class='diff-marker'&gt;&amp;#160;&lt;/td&gt;&lt;td style=&quot;background-color: #f9f9f9; color: #333333; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #e6e6e6; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;== Introduction ==&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class='diff-marker'&gt;&amp;#160;&lt;/td&gt;&lt;td style=&quot;background-color: #f9f9f9; color: #333333; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #e6e6e6; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;/td&gt;&lt;td class='diff-marker'&gt;&amp;#160;&lt;/td&gt;&lt;td style=&quot;background-color: #f9f9f9; color: #333333; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #e6e6e6; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;/td&gt;&lt;/tr&gt;

&lt;!-- diff cache key rconsortium_mwiki:diff:version:1.11a:oldid:75:newid:76 --&gt;
&lt;/table&gt;</summary>
		<author><name>MichaelLawrence</name></author>	</entry>

	<entry>
		<id>https://wiki.r-consortium.org/index.php?title=Distributed_Computing_Working_Group_Progress_Report_2016&amp;diff=75&amp;oldid=prev</id>
		<title>MichaelLawrence: Created page with &quot;== Introduction ==   Data sizes continue to increase, while single core performance has stagnated. We scale computations by leveraging multiple cores and machines. Large datas...&quot;</title>
		<link rel="alternate" type="text/html" href="https://wiki.r-consortium.org/index.php?title=Distributed_Computing_Working_Group_Progress_Report_2016&amp;diff=75&amp;oldid=prev"/>
				<updated>2017-05-01T18:43:10Z</updated>
		
		<summary type="html">&lt;p&gt;Created page with &amp;quot;== Introduction ==   Data sizes continue to increase, while single core performance has stagnated. We scale computations by leveraging multiple cores and machines. Large datas...&amp;quot;&lt;/p&gt;
&lt;p&gt;&lt;b&gt;New page&lt;/b&gt;&lt;/p&gt;&lt;div&gt;== Introduction ==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Data sizes continue to increase, while single core performance has&lt;br /&gt;
stagnated. We scale computations by leveraging multiple cores and&lt;br /&gt;
machines. Large datasets are expensive to replicate, so we minimize&lt;br /&gt;
data movement by moving the computation to the data. Many systems,&lt;br /&gt;
such as Hadoop, Spark, and massively parallel processing (MPP)&lt;br /&gt;
databases, have emerged to support these strategies, and each exposes&lt;br /&gt;
its own unique interface, with little standardization. &lt;br /&gt;
&lt;br /&gt;
Developing and executing an algorithm in the distributed context is a&lt;br /&gt;
complex task that requires specific knowledge of and dependency on the&lt;br /&gt;
system storing the data. It is also a task orthogonal to the primary&lt;br /&gt;
role of a data scientist or statistician: extracting knowledge from&lt;br /&gt;
data. The task thus falls to the data analysis environment, which&lt;br /&gt;
should mask the complexity behind a familiar interface, maintaining&lt;br /&gt;
user productivity. However, it is not always feasible to automatically&lt;br /&gt;
determine the optimal strategy for a given problem, so user input is&lt;br /&gt;
often beneficial. The environment should only abstract the details to&lt;br /&gt;
the extent deemed appropriate by the user.&lt;br /&gt;
&lt;br /&gt;
R needs a standardized, layered and idiomatic abstraction for&lt;br /&gt;
computing on distributed data structures. R has many packages that&lt;br /&gt;
provide parallelism constructs as well as bridges to distributed&lt;br /&gt;
systems such as Hadoop. Unfortunately, each interface has its own&lt;br /&gt;
syntax, parallelism techniques, and supported platform(s).  As a&lt;br /&gt;
consequence, contributors are forced to learn multiple idiosyncratic&lt;br /&gt;
interfaces, and to restrict each implementation to a particular&lt;br /&gt;
interface, thus limiting the applicability and adoption of their&lt;br /&gt;
software and hampering interoperability.&lt;br /&gt;
&lt;br /&gt;
The idea of a unified interface stemmed from a cross-industry workshop&lt;br /&gt;
organized at HP Labs in early 2015. The workshop was attended by&lt;br /&gt;
different companies, universities, and R-core members. Immediately&lt;br /&gt;
after the workshop, Indrajit Roy, Edward Ma, and Michael Lawrence began&lt;br /&gt;
designing an abstraction that later became known as the CRAN package&lt;br /&gt;
ddR (Distributed Data in R)[1]. It declares a unified API for distributed&lt;br /&gt;
computing in R and ensures that R programs written using the API are&lt;br /&gt;
portable across different systems, such as Distributed R, Spark, etc.&lt;br /&gt;
&lt;br /&gt;
The ddR package has completed its initial phase of development; the&lt;br /&gt;
first release is now on CRAN. Three ddR machine-learning algorithms&lt;br /&gt;
are also on CRAN, randomForest.ddR, glm.ddR, and kmeans.ddR. Two&lt;br /&gt;
reference backends for ddR have been completed, one for R’s parallel&lt;br /&gt;
package, and one for HP Distributed R. Example code and scripts to run&lt;br /&gt;
algorithms and code on both of these backends are available in our&lt;br /&gt;
public repository at https://github.com/vertica/ddR.&lt;br /&gt;
&lt;br /&gt;
The overarching goal of the ddR project was for it to be a starting&lt;br /&gt;
point in a collaborative effort, ultimately leading to a standard API&lt;br /&gt;
for working with distributed data in R.  We decided that it was&lt;br /&gt;
natural for the R Consortium to sponsor the collaboration, as it&lt;br /&gt;
should involve both industry and R-core members. To this end, we&lt;br /&gt;
established the R Consortium Working Group on Distributed Computing,&lt;br /&gt;
with a planned duration of a single year and the following aims:&lt;br /&gt;
&lt;br /&gt;
# Agree on the goal of the group, i.e., we should have a unifying framework for distributed computing. Define success metric.&lt;br /&gt;
# Brainstorm on what primitives should be included in the API. We can use ddR’s API of distributed data-structures and dmapply as the starting proposal. Understand relationship with existing packages such as parallel, foreach, etc.&lt;br /&gt;
# Explore how ddR like interface will interact with databases. Are there connections or redundancies with dplyr and multiplyr?&lt;br /&gt;
# Decide on a reference implementation for the API.&lt;br /&gt;
# Decide on whether we should also implement a few ecosystem packages, e.g., distributed algorithms written using the API.&lt;br /&gt;
&lt;br /&gt;
We declared the following milestones:&lt;br /&gt;
&lt;br /&gt;
# Mid-year milestone: Finalize API. Decide who all will help with developing the top-level implementation and backends.&lt;br /&gt;
# End-year milestone: Summary report and reference implementation. Socialize the final package.&lt;br /&gt;
&lt;br /&gt;
This report outlines the progress we have made on the above goals and&lt;br /&gt;
milestones, and how we plan to continue progress in the second half of&lt;br /&gt;
the working group term.&lt;br /&gt;
&lt;br /&gt;
== Results and Current Status ==&lt;br /&gt;
&lt;br /&gt;
The working group has achieved the first goal by agreeing that we&lt;br /&gt;
should aim for a unifying distributed computing abstraction, and we&lt;br /&gt;
have treated ddR as an informal API proposal.&lt;br /&gt;
&lt;br /&gt;
We have discussed many of the issues related to the second goal,&lt;br /&gt;
deciding which primitives should be part of the API.  We aim for the&lt;br /&gt;
API to support three shapes of data --- lists, arrays and data frames&lt;br /&gt;
--- and to enable the loading and basic manipulation of distributed&lt;br /&gt;
data, including multiple modes of functional iteration (e.g., apply()&lt;br /&gt;
operations). We aim to preserve consistency with base R data&lt;br /&gt;
structures and functions, so as to provide a simple path for users to&lt;br /&gt;
port computations to distributed systems.&lt;br /&gt;
&lt;br /&gt;
The ddR constructs permit a user to express a wide variety of&lt;br /&gt;
applications, including machine-learning algorithms, that will run on&lt;br /&gt;
different backends. We have successfully implemented distributed&lt;br /&gt;
versions of algorithms such as K-means, Regression, Random Forest, and&lt;br /&gt;
PageRank using the ddR API. Some of these ddR algorithms are now&lt;br /&gt;
available on CRAN.  In addition, the package provides several generic&lt;br /&gt;
definitions of common operators (such as colSums) that can be invoked&lt;br /&gt;
on distributed objects residing in the supporting backends.&lt;br /&gt;
&lt;br /&gt;
Each custom ddR backend is encapsulated in its own driver package. In&lt;br /&gt;
the conventional style of functional OOP, the driver registers methods&lt;br /&gt;
for generics declared by the backend API, such that ddR can dispatch&lt;br /&gt;
the backend-specific instructions by only calling the generics.&lt;br /&gt;
&lt;br /&gt;
The working group explored potential new backends with the aim of&lt;br /&gt;
broadening the applicability of the ddR interface. We hosted&lt;br /&gt;
presentations from external speakers on Spark and TensorFlow, and also&lt;br /&gt;
considered a generic SQL backend. The discussion focused on Spark&lt;br /&gt;
integration, and the R Consortium-funded intern Clark Fitzgerald took&lt;br /&gt;
on the task of developing a prototype Spark backend. The development&lt;br /&gt;
of the Spark backend encountered some obstacles, including the&lt;br /&gt;
immaturity of Spark and its R interfaces. Development is currently&lt;br /&gt;
paused, as we await additional funding.&lt;br /&gt;
&lt;br /&gt;
During the monthly meetings, the working group deliberated on&lt;br /&gt;
different design improvements for ddR itself. We list two key topics&lt;br /&gt;
that were discussed.  First, Michael Kane and Bryan Lewis argued for a&lt;br /&gt;
lower level API that directly operates on chunks of data. While ddR&lt;br /&gt;
supports chunk-wise data processing, via a combination of dmapply()&lt;br /&gt;
and parts(), its focus on distributed data structures means that&lt;br /&gt;
the chunk-based processing is exposed as the manipulation of these&lt;br /&gt;
data structures. Second, Clark Fitzgerald proposed restructuring the&lt;br /&gt;
ddR code into two layers that includes chunk-wise processing while&lt;br /&gt;
retaining the emphasis on distributed data structures[2]. The lower&lt;br /&gt;
level API, which will interface with backends, will use a Map() like&lt;br /&gt;
primitive to evaluate functions on chunks of data, while the higher&lt;br /&gt;
level ddR API will expose distributed data structures, dmapply, and&lt;br /&gt;
other convenience functions. This refactoring would facilitate the&lt;br /&gt;
implementation of additional backends.&lt;br /&gt;
&lt;br /&gt;
== Discussion and Future Plans ==&lt;br /&gt;
&lt;br /&gt;
The R Consortium-funded working group and internship has helped us&lt;br /&gt;
start a conversation on distributed computing APIs for R.  The ddR&lt;br /&gt;
CRAN package is a concrete outcome of this working group, and serves&lt;br /&gt;
as a platform for exploring APIs and their integration with different&lt;br /&gt;
backends. While ddR is still maturing, we have arrived at a consensus&lt;br /&gt;
for how we should improve and finalize the ddR API.&lt;br /&gt;
&lt;br /&gt;
As part of our goal for a reference implementation, we aim to develop&lt;br /&gt;
one or more prototype backends that will make the ddR interface useful&lt;br /&gt;
in practice. A good candidate backend is any open-source system that&lt;br /&gt;
is effective at R use cases and has strong community support. Spark&lt;br /&gt;
remains a viable candidate, and we also aim to further explore&lt;br /&gt;
TensorFlow.&lt;br /&gt;
&lt;br /&gt;
We plan for a second intern to perform three tasks: (1) refactor the&lt;br /&gt;
ddR API to a more final form, (2) compare Spark and TensorFlow in&lt;br /&gt;
detail, with an eye towards the feasibility of implementing a useful&lt;br /&gt;
backend, and (3) implement a prototype backend based on Spark or&lt;br /&gt;
TensorFlow, depending on the recommendation of the working group.&lt;br /&gt;
&lt;br /&gt;
By the conclusion of the working group, it will have produced:&lt;br /&gt;
* A stable version of the ddR package and at least one practical backend, released on CRAN,&lt;br /&gt;
* A list of requirements that are relevant and of interest to the community but have not yet been met by ddR, including alternative implementations that remain independent,&lt;br /&gt;
* A list of topics that the group believes worthy of further investigation.&lt;br /&gt;
&lt;br /&gt;
[1] http://h30507.www3.hp.com/t5/Behind-the-scenes-Labs/Enhancing-R-for-Distributed-Computing/ba-p/6795535#.VjE1K7erQQj&lt;br /&gt;
&lt;br /&gt;
[2] Clark Fitzgerald. https://github.com/vertica/ddR/wiki/Design&lt;/div&gt;</summary>
		<author><name>MichaelLawrence</name></author>	</entry>

	</feed>