<?xml version="1.0"?>
<?xml-stylesheet type="text/css" href="https://wiki.r-consortium.org/skins/common/feed.css?303"?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
		<id>https://wiki.r-consortium.org/api.php?action=feedcontributions&amp;feedformat=atom&amp;user=MichaelLawrence</id>
		<title>R Consortium Wiki - User contributions [en]</title>
		<link rel="self" type="application/atom+xml" href="https://wiki.r-consortium.org/api.php?action=feedcontributions&amp;feedformat=atom&amp;user=MichaelLawrence"/>
		<link rel="alternate" type="text/html" href="https://wiki.r-consortium.org/view/Special:Contributions/MichaelLawrence"/>
		<updated>2026-05-11T04:15:06Z</updated>
		<subtitle>User contributions</subtitle>
		<generator>MediaWiki 1.23.15</generator>

	<entry>
		<id>https://wiki.r-consortium.org/view/Distributed_Computing_Working_Group_Progress_Report_2016</id>
		<title>Distributed Computing Working Group Progress Report 2016</title>
		<link rel="alternate" type="text/html" href="https://wiki.r-consortium.org/view/Distributed_Computing_Working_Group_Progress_Report_2016"/>
				<updated>2017-05-01T18:45:03Z</updated>
		
		<summary type="html">&lt;p&gt;MichaelLawrence: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;Authors: Michael Lawrence and Indrajit Roy&lt;br /&gt;
&lt;br /&gt;
== Introduction ==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Data sizes continue to increase, while single core performance has&lt;br /&gt;
stagnated. We scale computations by leveraging multiple cores and&lt;br /&gt;
machines. Large datasets are expensive to replicate, so we minimize&lt;br /&gt;
data movement by moving the computation to the data. Many systems,&lt;br /&gt;
such as Hadoop, Spark, and massively parallel processing (MPP)&lt;br /&gt;
databases, have emerged to support these strategies, and each exposes&lt;br /&gt;
its own unique interface, with little standardization. &lt;br /&gt;
&lt;br /&gt;
Developing and executing an algorithm in the distributed context is a&lt;br /&gt;
complex task that requires specific knowledge of and dependency on the&lt;br /&gt;
system storing the data. It is also a task orthogonal to the primary&lt;br /&gt;
role of a data scientist or statistician: extracting knowledge from&lt;br /&gt;
data. The task thus falls to the data analysis environment, which&lt;br /&gt;
should mask the complexity behind a familiar interface, maintaining&lt;br /&gt;
user productivity. However, it is not always feasible to automatically&lt;br /&gt;
determine the optimal strategy for a given problem, so user input is&lt;br /&gt;
often beneficial. The environment should only abstract the details to&lt;br /&gt;
the extent deemed appropriate by the user.&lt;br /&gt;
&lt;br /&gt;
R needs a standardized, layered and idiomatic abstraction for&lt;br /&gt;
computing on distributed data structures. R has many packages that&lt;br /&gt;
provide parallelism constructs as well as bridges to distributed&lt;br /&gt;
systems such as Hadoop. Unfortunately, each interface has its own&lt;br /&gt;
syntax, parallelism techniques, and supported platform(s).  As a&lt;br /&gt;
consequence, contributors are forced to learn multiple idiosyncratic&lt;br /&gt;
interfaces, and to restrict each implementation to a particular&lt;br /&gt;
interface, thus limiting the applicability and adoption of their&lt;br /&gt;
software and hampering interoperability.&lt;br /&gt;
&lt;br /&gt;
The idea of a unified interface stemmed from a cross-industry workshop&lt;br /&gt;
organized at HP Labs in early 2015. The workshop was attended by&lt;br /&gt;
different companies, universities, and R-core members. Immediately&lt;br /&gt;
after the workshop, Indrajit Roy, Edward Ma, and Michael Lawrence began&lt;br /&gt;
designing an abstraction that later became known as the CRAN package&lt;br /&gt;
ddR (Distributed Data in R)[1]. It declares a unified API for distributed&lt;br /&gt;
computing in R and ensures that R programs written using the API are&lt;br /&gt;
portable across different systems, such as Distributed R, Spark, etc.&lt;br /&gt;
&lt;br /&gt;
The ddR package has completed its initial phase of development; the&lt;br /&gt;
first release is now on CRAN. Three ddR machine-learning algorithms&lt;br /&gt;
are also on CRAN, randomForest.ddR, glm.ddR, and kmeans.ddR. Two&lt;br /&gt;
reference backends for ddR have been completed, one for R’s parallel&lt;br /&gt;
package, and one for HP Distributed R. Example code and scripts to run&lt;br /&gt;
algorithms and code on both of these backends are available in our&lt;br /&gt;
public repository at https://github.com/vertica/ddR.&lt;br /&gt;
&lt;br /&gt;
The overarching goal of the ddR project was for it to be a starting&lt;br /&gt;
point in a collaborative effort, ultimately leading to a standard API&lt;br /&gt;
for working with distributed data in R.  We decided that it was&lt;br /&gt;
natural for the R Consortium to sponsor the collaboration, as it&lt;br /&gt;
should involve both industry and R-core members. To this end, we&lt;br /&gt;
established the R Consortium Working Group on Distributed Computing,&lt;br /&gt;
with a planned duration of a single year and the following aims:&lt;br /&gt;
&lt;br /&gt;
# Agree on the goal of the group, i.e., we should have a unifying framework for distributed computing. Define success metric.&lt;br /&gt;
# Brainstorm on what primitives should be included in the API. We can use ddR’s API of distributed data-structures and dmapply as the starting proposal. Understand relationship with existing packages such as parallel, foreach, etc.&lt;br /&gt;
# Explore how ddR like interface will interact with databases. Are there connections or redundancies with dplyr and multiplyr?&lt;br /&gt;
# Decide on a reference implementation for the API.&lt;br /&gt;
# Decide on whether we should also implement a few ecosystem packages, e.g., distributed algorithms written using the API.&lt;br /&gt;
&lt;br /&gt;
We declared the following milestones:&lt;br /&gt;
&lt;br /&gt;
# Mid-year milestone: Finalize API. Decide who all will help with developing the top-level implementation and backends.&lt;br /&gt;
# End-year milestone: Summary report and reference implementation. Socialize the final package.&lt;br /&gt;
&lt;br /&gt;
This report outlines the progress we have made on the above goals and&lt;br /&gt;
milestones, and how we plan to continue progress in the second half of&lt;br /&gt;
the working group term.&lt;br /&gt;
&lt;br /&gt;
== Results and Current Status ==&lt;br /&gt;
&lt;br /&gt;
The working group has achieved the first goal by agreeing that we&lt;br /&gt;
should aim for a unifying distributed computing abstraction, and we&lt;br /&gt;
have treated ddR as an informal API proposal.&lt;br /&gt;
&lt;br /&gt;
We have discussed many of the issues related to the second goal,&lt;br /&gt;
deciding which primitives should be part of the API.  We aim for the&lt;br /&gt;
API to support three shapes of data --- lists, arrays and data frames&lt;br /&gt;
--- and to enable the loading and basic manipulation of distributed&lt;br /&gt;
data, including multiple modes of functional iteration (e.g., apply()&lt;br /&gt;
operations). We aim to preserve consistency with base R data&lt;br /&gt;
structures and functions, so as to provide a simple path for users to&lt;br /&gt;
port computations to distributed systems.&lt;br /&gt;
&lt;br /&gt;
The ddR constructs permit a user to express a wide variety of&lt;br /&gt;
applications, including machine-learning algorithms, that will run on&lt;br /&gt;
different backends. We have successfully implemented distributed&lt;br /&gt;
versions of algorithms such as K-means, Regression, Random Forest, and&lt;br /&gt;
PageRank using the ddR API. Some of these ddR algorithms are now&lt;br /&gt;
available on CRAN.  In addition, the package provides several generic&lt;br /&gt;
definitions of common operators (such as colSums) that can be invoked&lt;br /&gt;
on distributed objects residing in the supporting backends.&lt;br /&gt;
&lt;br /&gt;
Each custom ddR backend is encapsulated in its own driver package. In&lt;br /&gt;
the conventional style of functional OOP, the driver registers methods&lt;br /&gt;
for generics declared by the backend API, such that ddR can dispatch&lt;br /&gt;
the backend-specific instructions by only calling the generics.&lt;br /&gt;
&lt;br /&gt;
The working group explored potential new backends with the aim of&lt;br /&gt;
broadening the applicability of the ddR interface. We hosted&lt;br /&gt;
presentations from external speakers on Spark and TensorFlow, and also&lt;br /&gt;
considered a generic SQL backend. The discussion focused on Spark&lt;br /&gt;
integration, and the R Consortium-funded intern Clark Fitzgerald took&lt;br /&gt;
on the task of developing a prototype Spark backend. The development&lt;br /&gt;
of the Spark backend encountered some obstacles, including the&lt;br /&gt;
immaturity of Spark and its R interfaces. Development is currently&lt;br /&gt;
paused, as we await additional funding.&lt;br /&gt;
&lt;br /&gt;
During the monthly meetings, the working group deliberated on&lt;br /&gt;
different design improvements for ddR itself. We list two key topics&lt;br /&gt;
that were discussed.  First, Michael Kane and Bryan Lewis argued for a&lt;br /&gt;
lower level API that directly operates on chunks of data. While ddR&lt;br /&gt;
supports chunk-wise data processing, via a combination of dmapply()&lt;br /&gt;
and parts(), its focus on distributed data structures means that&lt;br /&gt;
the chunk-based processing is exposed as the manipulation of these&lt;br /&gt;
data structures. Second, Clark Fitzgerald proposed restructuring the&lt;br /&gt;
ddR code into two layers that includes chunk-wise processing while&lt;br /&gt;
retaining the emphasis on distributed data structures[2]. The lower&lt;br /&gt;
level API, which will interface with backends, will use a Map() like&lt;br /&gt;
primitive to evaluate functions on chunks of data, while the higher&lt;br /&gt;
level ddR API will expose distributed data structures, dmapply, and&lt;br /&gt;
other convenience functions. This refactoring would facilitate the&lt;br /&gt;
implementation of additional backends.&lt;br /&gt;
&lt;br /&gt;
== Discussion and Future Plans ==&lt;br /&gt;
&lt;br /&gt;
The R Consortium-funded working group and internship has helped us&lt;br /&gt;
start a conversation on distributed computing APIs for R.  The ddR&lt;br /&gt;
CRAN package is a concrete outcome of this working group, and serves&lt;br /&gt;
as a platform for exploring APIs and their integration with different&lt;br /&gt;
backends. While ddR is still maturing, we have arrived at a consensus&lt;br /&gt;
for how we should improve and finalize the ddR API.&lt;br /&gt;
&lt;br /&gt;
As part of our goal for a reference implementation, we aim to develop&lt;br /&gt;
one or more prototype backends that will make the ddR interface useful&lt;br /&gt;
in practice. A good candidate backend is any open-source system that&lt;br /&gt;
is effective at R use cases and has strong community support. Spark&lt;br /&gt;
remains a viable candidate, and we also aim to further explore&lt;br /&gt;
TensorFlow.&lt;br /&gt;
&lt;br /&gt;
We plan for a second intern to perform three tasks: (1) refactor the&lt;br /&gt;
ddR API to a more final form, (2) compare Spark and TensorFlow in&lt;br /&gt;
detail, with an eye towards the feasibility of implementing a useful&lt;br /&gt;
backend, and (3) implement a prototype backend based on Spark or&lt;br /&gt;
TensorFlow, depending on the recommendation of the working group.&lt;br /&gt;
&lt;br /&gt;
By the conclusion of the working group, it will have produced:&lt;br /&gt;
* A stable version of the ddR package and at least one practical backend, released on CRAN,&lt;br /&gt;
* A list of requirements that are relevant and of interest to the community but have not yet been met by ddR, including alternative implementations that remain independent,&lt;br /&gt;
* A list of topics that the group believes worthy of further investigation.&lt;br /&gt;
&lt;br /&gt;
[1] http://h30507.www3.hp.com/t5/Behind-the-scenes-Labs/Enhancing-R-for-Distributed-Computing/ba-p/6795535#.VjE1K7erQQj&lt;br /&gt;
&lt;br /&gt;
[2] Clark Fitzgerald. https://github.com/vertica/ddR/wiki/Design&lt;/div&gt;</summary>
		<author><name>MichaelLawrence</name></author>	</entry>

	<entry>
		<id>https://wiki.r-consortium.org/view/Distributed_Computing_Working_Group_Progress_Report_2016</id>
		<title>Distributed Computing Working Group Progress Report 2016</title>
		<link rel="alternate" type="text/html" href="https://wiki.r-consortium.org/view/Distributed_Computing_Working_Group_Progress_Report_2016"/>
				<updated>2017-05-01T18:43:10Z</updated>
		
		<summary type="html">&lt;p&gt;MichaelLawrence: Created page with &amp;quot;== Introduction ==   Data sizes continue to increase, while single core performance has stagnated. We scale computations by leveraging multiple cores and machines. Large datas...&amp;quot;&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Introduction ==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Data sizes continue to increase, while single core performance has&lt;br /&gt;
stagnated. We scale computations by leveraging multiple cores and&lt;br /&gt;
machines. Large datasets are expensive to replicate, so we minimize&lt;br /&gt;
data movement by moving the computation to the data. Many systems,&lt;br /&gt;
such as Hadoop, Spark, and massively parallel processing (MPP)&lt;br /&gt;
databases, have emerged to support these strategies, and each exposes&lt;br /&gt;
its own unique interface, with little standardization. &lt;br /&gt;
&lt;br /&gt;
Developing and executing an algorithm in the distributed context is a&lt;br /&gt;
complex task that requires specific knowledge of and dependency on the&lt;br /&gt;
system storing the data. It is also a task orthogonal to the primary&lt;br /&gt;
role of a data scientist or statistician: extracting knowledge from&lt;br /&gt;
data. The task thus falls to the data analysis environment, which&lt;br /&gt;
should mask the complexity behind a familiar interface, maintaining&lt;br /&gt;
user productivity. However, it is not always feasible to automatically&lt;br /&gt;
determine the optimal strategy for a given problem, so user input is&lt;br /&gt;
often beneficial. The environment should only abstract the details to&lt;br /&gt;
the extent deemed appropriate by the user.&lt;br /&gt;
&lt;br /&gt;
R needs a standardized, layered and idiomatic abstraction for&lt;br /&gt;
computing on distributed data structures. R has many packages that&lt;br /&gt;
provide parallelism constructs as well as bridges to distributed&lt;br /&gt;
systems such as Hadoop. Unfortunately, each interface has its own&lt;br /&gt;
syntax, parallelism techniques, and supported platform(s).  As a&lt;br /&gt;
consequence, contributors are forced to learn multiple idiosyncratic&lt;br /&gt;
interfaces, and to restrict each implementation to a particular&lt;br /&gt;
interface, thus limiting the applicability and adoption of their&lt;br /&gt;
software and hampering interoperability.&lt;br /&gt;
&lt;br /&gt;
The idea of a unified interface stemmed from a cross-industry workshop&lt;br /&gt;
organized at HP Labs in early 2015. The workshop was attended by&lt;br /&gt;
different companies, universities, and R-core members. Immediately&lt;br /&gt;
after the workshop, Indrajit Roy, Edward Ma, and Michael Lawrence began&lt;br /&gt;
designing an abstraction that later became known as the CRAN package&lt;br /&gt;
ddR (Distributed Data in R)[1]. It declares a unified API for distributed&lt;br /&gt;
computing in R and ensures that R programs written using the API are&lt;br /&gt;
portable across different systems, such as Distributed R, Spark, etc.&lt;br /&gt;
&lt;br /&gt;
The ddR package has completed its initial phase of development; the&lt;br /&gt;
first release is now on CRAN. Three ddR machine-learning algorithms&lt;br /&gt;
are also on CRAN, randomForest.ddR, glm.ddR, and kmeans.ddR. Two&lt;br /&gt;
reference backends for ddR have been completed, one for R’s parallel&lt;br /&gt;
package, and one for HP Distributed R. Example code and scripts to run&lt;br /&gt;
algorithms and code on both of these backends are available in our&lt;br /&gt;
public repository at https://github.com/vertica/ddR.&lt;br /&gt;
&lt;br /&gt;
The overarching goal of the ddR project was for it to be a starting&lt;br /&gt;
point in a collaborative effort, ultimately leading to a standard API&lt;br /&gt;
for working with distributed data in R.  We decided that it was&lt;br /&gt;
natural for the R Consortium to sponsor the collaboration, as it&lt;br /&gt;
should involve both industry and R-core members. To this end, we&lt;br /&gt;
established the R Consortium Working Group on Distributed Computing,&lt;br /&gt;
with a planned duration of a single year and the following aims:&lt;br /&gt;
&lt;br /&gt;
# Agree on the goal of the group, i.e., we should have a unifying framework for distributed computing. Define success metric.&lt;br /&gt;
# Brainstorm on what primitives should be included in the API. We can use ddR’s API of distributed data-structures and dmapply as the starting proposal. Understand relationship with existing packages such as parallel, foreach, etc.&lt;br /&gt;
# Explore how ddR like interface will interact with databases. Are there connections or redundancies with dplyr and multiplyr?&lt;br /&gt;
# Decide on a reference implementation for the API.&lt;br /&gt;
# Decide on whether we should also implement a few ecosystem packages, e.g., distributed algorithms written using the API.&lt;br /&gt;
&lt;br /&gt;
We declared the following milestones:&lt;br /&gt;
&lt;br /&gt;
# Mid-year milestone: Finalize API. Decide who all will help with developing the top-level implementation and backends.&lt;br /&gt;
# End-year milestone: Summary report and reference implementation. Socialize the final package.&lt;br /&gt;
&lt;br /&gt;
This report outlines the progress we have made on the above goals and&lt;br /&gt;
milestones, and how we plan to continue progress in the second half of&lt;br /&gt;
the working group term.&lt;br /&gt;
&lt;br /&gt;
== Results and Current Status ==&lt;br /&gt;
&lt;br /&gt;
The working group has achieved the first goal by agreeing that we&lt;br /&gt;
should aim for a unifying distributed computing abstraction, and we&lt;br /&gt;
have treated ddR as an informal API proposal.&lt;br /&gt;
&lt;br /&gt;
We have discussed many of the issues related to the second goal,&lt;br /&gt;
deciding which primitives should be part of the API.  We aim for the&lt;br /&gt;
API to support three shapes of data --- lists, arrays and data frames&lt;br /&gt;
--- and to enable the loading and basic manipulation of distributed&lt;br /&gt;
data, including multiple modes of functional iteration (e.g., apply()&lt;br /&gt;
operations). We aim to preserve consistency with base R data&lt;br /&gt;
structures and functions, so as to provide a simple path for users to&lt;br /&gt;
port computations to distributed systems.&lt;br /&gt;
&lt;br /&gt;
The ddR constructs permit a user to express a wide variety of&lt;br /&gt;
applications, including machine-learning algorithms, that will run on&lt;br /&gt;
different backends. We have successfully implemented distributed&lt;br /&gt;
versions of algorithms such as K-means, Regression, Random Forest, and&lt;br /&gt;
PageRank using the ddR API. Some of these ddR algorithms are now&lt;br /&gt;
available on CRAN.  In addition, the package provides several generic&lt;br /&gt;
definitions of common operators (such as colSums) that can be invoked&lt;br /&gt;
on distributed objects residing in the supporting backends.&lt;br /&gt;
&lt;br /&gt;
Each custom ddR backend is encapsulated in its own driver package. In&lt;br /&gt;
the conventional style of functional OOP, the driver registers methods&lt;br /&gt;
for generics declared by the backend API, such that ddR can dispatch&lt;br /&gt;
the backend-specific instructions by only calling the generics.&lt;br /&gt;
&lt;br /&gt;
The working group explored potential new backends with the aim of&lt;br /&gt;
broadening the applicability of the ddR interface. We hosted&lt;br /&gt;
presentations from external speakers on Spark and TensorFlow, and also&lt;br /&gt;
considered a generic SQL backend. The discussion focused on Spark&lt;br /&gt;
integration, and the R Consortium-funded intern Clark Fitzgerald took&lt;br /&gt;
on the task of developing a prototype Spark backend. The development&lt;br /&gt;
of the Spark backend encountered some obstacles, including the&lt;br /&gt;
immaturity of Spark and its R interfaces. Development is currently&lt;br /&gt;
paused, as we await additional funding.&lt;br /&gt;
&lt;br /&gt;
During the monthly meetings, the working group deliberated on&lt;br /&gt;
different design improvements for ddR itself. We list two key topics&lt;br /&gt;
that were discussed.  First, Michael Kane and Bryan Lewis argued for a&lt;br /&gt;
lower level API that directly operates on chunks of data. While ddR&lt;br /&gt;
supports chunk-wise data processing, via a combination of dmapply()&lt;br /&gt;
and parts(), its focus on distributed data structures means that&lt;br /&gt;
the chunk-based processing is exposed as the manipulation of these&lt;br /&gt;
data structures. Second, Clark Fitzgerald proposed restructuring the&lt;br /&gt;
ddR code into two layers that includes chunk-wise processing while&lt;br /&gt;
retaining the emphasis on distributed data structures[2]. The lower&lt;br /&gt;
level API, which will interface with backends, will use a Map() like&lt;br /&gt;
primitive to evaluate functions on chunks of data, while the higher&lt;br /&gt;
level ddR API will expose distributed data structures, dmapply, and&lt;br /&gt;
other convenience functions. This refactoring would facilitate the&lt;br /&gt;
implementation of additional backends.&lt;br /&gt;
&lt;br /&gt;
== Discussion and Future Plans ==&lt;br /&gt;
&lt;br /&gt;
The R Consortium-funded working group and internship has helped us&lt;br /&gt;
start a conversation on distributed computing APIs for R.  The ddR&lt;br /&gt;
CRAN package is a concrete outcome of this working group, and serves&lt;br /&gt;
as a platform for exploring APIs and their integration with different&lt;br /&gt;
backends. While ddR is still maturing, we have arrived at a consensus&lt;br /&gt;
for how we should improve and finalize the ddR API.&lt;br /&gt;
&lt;br /&gt;
As part of our goal for a reference implementation, we aim to develop&lt;br /&gt;
one or more prototype backends that will make the ddR interface useful&lt;br /&gt;
in practice. A good candidate backend is any open-source system that&lt;br /&gt;
is effective at R use cases and has strong community support. Spark&lt;br /&gt;
remains a viable candidate, and we also aim to further explore&lt;br /&gt;
TensorFlow.&lt;br /&gt;
&lt;br /&gt;
We plan for a second intern to perform three tasks: (1) refactor the&lt;br /&gt;
ddR API to a more final form, (2) compare Spark and TensorFlow in&lt;br /&gt;
detail, with an eye towards the feasibility of implementing a useful&lt;br /&gt;
backend, and (3) implement a prototype backend based on Spark or&lt;br /&gt;
TensorFlow, depending on the recommendation of the working group.&lt;br /&gt;
&lt;br /&gt;
By the conclusion of the working group, it will have produced:&lt;br /&gt;
* A stable version of the ddR package and at least one practical backend, released on CRAN,&lt;br /&gt;
* A list of requirements that are relevant and of interest to the community but have not yet been met by ddR, including alternative implementations that remain independent,&lt;br /&gt;
* A list of topics that the group believes worthy of further investigation.&lt;br /&gt;
&lt;br /&gt;
[1] http://h30507.www3.hp.com/t5/Behind-the-scenes-Labs/Enhancing-R-for-Distributed-Computing/ba-p/6795535#.VjE1K7erQQj&lt;br /&gt;
&lt;br /&gt;
[2] Clark Fitzgerald. https://github.com/vertica/ddR/wiki/Design&lt;/div&gt;</summary>
		<author><name>MichaelLawrence</name></author>	</entry>

	<entry>
		<id>https://wiki.r-consortium.org/view/Distributed_Computing_Working_Group</id>
		<title>Distributed Computing Working Group</title>
		<link rel="alternate" type="text/html" href="https://wiki.r-consortium.org/view/Distributed_Computing_Working_Group"/>
				<updated>2017-05-01T18:42:48Z</updated>
		
		<summary type="html">&lt;p&gt;MichaelLawrence: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Goals and Purpose ==&lt;br /&gt;
&lt;br /&gt;
The Distributed Computing Working Group will endorse the design of a common abstraction for distributed data structures in R. We aim to have at least one open-source implementation, as well as a SQL implementation, released within a year of forming the group.&lt;br /&gt;
&lt;br /&gt;
== Members ==&lt;br /&gt;
&lt;br /&gt;
* '''Michael Lawrence''' (Genentech)&lt;br /&gt;
* '''Indrajit Roy''' (HP Enterprise)&lt;br /&gt;
* ''Joe Rickert'' (ISC liason, RStudio)&lt;br /&gt;
* Bernd Bischl (LMU)&lt;br /&gt;
* Matt Dowle (H2O)&lt;br /&gt;
* Mario Inchiosa (Microsoft)&lt;br /&gt;
* Michael Kane (Yale)&lt;br /&gt;
* Javier Luraschi (RStudio)&lt;br /&gt;
* Edward Ma (HP Enterprise)&lt;br /&gt;
* Luke Tierney (University of Iowa)&lt;br /&gt;
* Simon Urbanek (AT&amp;amp;T)&lt;br /&gt;
* Bryan Lewis (Paradigm4)&lt;br /&gt;
* Hossein Falaki (databricks)&lt;br /&gt;
&lt;br /&gt;
== Reports ==&lt;br /&gt;
&lt;br /&gt;
[[Distributed Computing Working Group Progress Report 2016|2016 Progress Report]]&lt;br /&gt;
&lt;br /&gt;
== Milestones ==&lt;br /&gt;
&lt;br /&gt;
=== Achieved ===&lt;br /&gt;
&lt;br /&gt;
* Adopt ddR as a prototype for a standard API for distributed computing in R&lt;br /&gt;
&lt;br /&gt;
=== 2016 Internship ===&lt;br /&gt;
&lt;br /&gt;
Clark Fitzgerald, a PhD student in the UC Davis Statistics department, worked on ddR and Spark integration.&lt;br /&gt;
&lt;br /&gt;
* Wrote [https://github.com/clarkfitzg/sparklite sparklite] and [https://github.com/clarkfitzg/rddlist rddlist] as minimal proof-of-concept R packages to connect and store general data on Spark. [https://docs.google.com/presentation/d/1WfUQ2ockNku90GWMXonEhUEcVOWcgBmWwt5uYSSBYPY/edit?usp=sharing slides]&lt;br /&gt;
* [https://issues.apache.org/jira/browse/SPARK-16785 Patched SparkR] to allow user defined functions returning binary columns. This allows implementation of different data structures in SparkR.&lt;br /&gt;
* Updated [https://github.com/vertica/ddR/wiki/Design design documents] with suggested changes to DDR's internal design and object oriented model. &lt;br /&gt;
* Improved [https://github.com/vertica/ddR/pull/15 testing and ddR internals].&lt;br /&gt;
&lt;br /&gt;
=== Outstanding ===&lt;br /&gt;
&lt;br /&gt;
* Agree on a final standard API for distributed computing in R&lt;br /&gt;
* Implement at least one scalable backend based on an open-source technology like Spark, SQL, etc&lt;br /&gt;
&lt;br /&gt;
== Open Questions ==&lt;br /&gt;
&lt;br /&gt;
* How can we address the needs of both the end user data scientists and the algorithm implementers?&lt;br /&gt;
* How should we share data between R and a system like Spark?&lt;br /&gt;
* Is there any way to unify SparkR and sparklyr?&lt;br /&gt;
* Could we use the abstractions of tensorflow to partially or fully integrate with platforms like Spark?&lt;br /&gt;
&lt;br /&gt;
== Minutes ==&lt;br /&gt;
=== 03/09/2017 ===&lt;br /&gt;
* Talk by Brian Lewis&lt;br /&gt;
** Has created a Github page with notes on Singularity.&lt;br /&gt;
** Singularity is a container technology for HPC applications&lt;br /&gt;
** No daemon. Minimum virtualization to get application running. Light weight and has very low overheads. &lt;br /&gt;
** Used widely in supercomputers&lt;br /&gt;
** All distributed computing platforms even with R skins are difficult to use. &lt;br /&gt;
** Containers make it much easier to abstract away the long tail of software dependencies and focus on R &lt;br /&gt;
** Demonstrated an example of using Singularity with Tensorflow&lt;br /&gt;
** Tried MPI and dopar on the 1000Genome data&lt;br /&gt;
** The program parses the variant data and stores chunks as files. Then ran principal components on each file.&lt;br /&gt;
** Overload matrix operations to use foreach/MPI underneath.&lt;br /&gt;
** Overall: Use existing R operators and overloading them with the appropriate backend.&lt;br /&gt;
** Will spend time working on Tensorflow, e.g., take a number of algorithms such as PCA and write them on top of Tensorflow using existing R primitives. &lt;br /&gt;
&lt;br /&gt;
=== 12/08/2016 ===&lt;br /&gt;
* Yuan Tang from Uptake was the presenter&lt;br /&gt;
** Michael and Indrajit will write a status report for the working group sometime in December or January&lt;br /&gt;
** Yuan gave an overview of TensorFlow&lt;br /&gt;
** JJ, Dirk and Yuan are working on R layer for TensorFlow &lt;br /&gt;
** TensorFlow is a platform for machine learning as well as other computations (even math proofs).&lt;br /&gt;
** It is GPU optimized and distributed. &lt;br /&gt;
** It is used in search, speech recognition, Google photos, etc.&lt;br /&gt;
** TensorFlow computations are directed graphs. odes are operations and edges are tensors.&lt;br /&gt;
** A lot of array, matrix, etc. operations are available&lt;br /&gt;
** Backend is mostly C++. Python front end exists. &lt;br /&gt;
** TensorFlow R is based on the python fronted&lt;br /&gt;
** In multi-device setting, TensorFlow figures out which devices to use and manages communication between devices. &lt;br /&gt;
** Computations are fault tolerant&lt;br /&gt;
** Yuan has previously worked on Scikit Flow which is now TF.Learn. It’s a easy transition for Scikit learn users.&lt;br /&gt;
** Yuan gave a brief overview of the python interface&lt;br /&gt;
** TensorFlow in R handles conversion between R and Python. Syntax is very similar to python API&lt;br /&gt;
** Future work: Adding more examples and tutorials, integration with Kubernetes/Marathon like framework.&lt;br /&gt;
** During the Q/A there were questions related to whether R kernels can be supported in TensorFlow, and whether R dataframes are a natural wrapper for TensorFlow objects.&lt;br /&gt;
&lt;br /&gt;
=== 11/10/2016 ===&lt;br /&gt;
&lt;br /&gt;
* SparkR slides were presented by Hossein Faliki and Shivaram from Databricks and UC Berkeley:&lt;br /&gt;
** SparkR was a prototype from AMPLab (2014). Initially it had the RDD API and was similar to PySpark API&lt;br /&gt;
** In 2015, the merge with upstream Spark, the decision was made to integrate with the Dataframe API, and hide the RDD API&lt;br /&gt;
** In 2016 more MLLib algorithms have been integrated and new APIs have been added. A CRAN package will be released soon&lt;br /&gt;
** Original SparkR architecture runs R on the master that communicates with the JVM processes in the driver. the driver sends commands  to the worker JVM processes, and executes them as scala/java statements.&lt;br /&gt;
** The system can read distributed data inside the JVM from different sources such as S3, HDFS, etc.&lt;br /&gt;
** The driver has a socket based connection between SparkR and the RBackend. RBackend runs on the JVM, deserializes the R code, and converts the R statements into Java calls.&lt;br /&gt;
** collect() and createDataFrame() are used to move data between R and JVM processes. createDataFrame will convert your local R data into a JVM based distributed data frame.&lt;br /&gt;
** The API has IO, Caching, MLLib, and SQL related commands&lt;br /&gt;
** Since Spark 2.0, we can run R processes inside the JVM worker processes. There is no need to keep long running R processes. &lt;br /&gt;
** There are 3 UDF functions (1) lapply, runs function on different value of a list (2) dapply, runs function on each partition of a data frame. You have to careful about how data is partitioned, and (3) gapply, performs a grouping on different column names and then runs the function on each group. &lt;br /&gt;
** The new CRAN package install.spark() will automatically download and install Spark. Automated CRAN checks have been added to every commit to the code. Should be available with Spark 2.1.0&lt;br /&gt;
&lt;br /&gt;
* Q/A &lt;br /&gt;
** Currently trying to get zero copy dataframe between python and Spark. Spark 2.0 has an off heap manager that uses Arrow. Once this feature is tested on the Python API, the next step will be integration R. &lt;br /&gt;
** Spark dataframes gain from plan optimizations. It is not SparkR specific. R UDFs are still treated as black boxes by the optimizer&lt;br /&gt;
** Spark doesn't directly support matrixes. There is no immediate intent to do so either. One can store an array or vector as a single column of a Spark dataframe.&lt;br /&gt;
&lt;br /&gt;
=== 10/13/2016 ===&lt;br /&gt;
&lt;br /&gt;
''Detailed minutes were not taken for this meeting''&lt;br /&gt;
&lt;br /&gt;
* Mario Inchiosa: Microsoft's perspective on distributed computing with R&lt;br /&gt;
** Microsoft R Server: abstractions and algorithms for distributed computation on top of open-source R&lt;br /&gt;
** Desired features of a distributed API like ddR:&lt;br /&gt;
*** Supports PEMA (initialize, processData, updateResults, processResults)&lt;br /&gt;
*** Cross-platform&lt;br /&gt;
*** Fast runtime&lt;br /&gt;
*** Supports algorithm writer and data scientist&lt;br /&gt;
*** Comes with a comprehensive set of algorithms&lt;br /&gt;
*** Easy deployment&lt;br /&gt;
** ddR is making good progress but does not yet meet those requirements&lt;br /&gt;
* Indrajit: ddR progress report and next steps&lt;br /&gt;
** Recap of Clark's internship&lt;br /&gt;
** Next step: implement some of Clark's design suggestions: https://github.com/vertica/ddR/wiki/Design&lt;br /&gt;
** Spark integration will be based on sparklyr&lt;br /&gt;
** Should we limit Spark interaction to the DataFrame API or directly interact with RDDs?&lt;br /&gt;
*** Consensus: will likely need flexibility of RDDs to implement everything we need, e.g., arrays and lists&lt;br /&gt;
** Clark and Javier raised concerns about the scalability of sharing data between R and Spark&lt;br /&gt;
*** Michael: Spark is a platform in its own right, so interoperability is important, should figure something out&lt;br /&gt;
*** Bryan Lewis: Why not use tensor abstraction from tensorflow? Spark supports tensorflow and an R interface is already in the works.&lt;br /&gt;
** Michael raised the issue of additional funding from the R Consortium to continue Clark's work&lt;br /&gt;
*** Joe Rickert suggested that the working group develop one or more white papers summarizing the findings of the working group for presentation to the Infrastructure Steering Committee.&lt;br /&gt;
*** Consensus was in favor of this, and several pointed out that the progress so far has been worthwhile, despite not meeting the specific goals laid out in the proposal.&lt;br /&gt;
* Michael: do we want to invite some external speakers, one per meeting, from groups like databricks, tensorflow, etc?&lt;br /&gt;
** Consensus was in favor.&lt;br /&gt;
&lt;br /&gt;
=== 9/8/2016 ===&lt;br /&gt;
&lt;br /&gt;
''Detailed minutes were not taken for this meeting''&lt;br /&gt;
&lt;br /&gt;
* Clark Fitzgerald: internship report&lt;br /&gt;
** Developed two packages for low-level Spark integration: rddlist, sparklite&lt;br /&gt;
** Patched a bug in Spark&lt;br /&gt;
** ddR needs refactoring before Spark integration is feasible:&lt;br /&gt;
*** dlist, dframe, and darray should be formal classes.&lt;br /&gt;
*** Partitions of data should be represented by a distributed list abstraction, and most functions (e.g., dmapply) should be implemented on top of that list.&lt;br /&gt;
* Javier: sparklyr update&lt;br /&gt;
** Preparing for CRAN release&lt;br /&gt;
** Mario: what happened to sparkapi?&lt;br /&gt;
*** Javier: sparkapi has been merged into sparklyr in order to avoid overhead of maintaining two packages. ddR can do everything it needs with sparklyr.&lt;br /&gt;
* Luke Tierney: Update on the low-level vector abstraction, which might support interfaces like ddR and sparklyr.&lt;br /&gt;
** Overall approach seems feasible, but still working out a few details.&lt;br /&gt;
** Will land in a branch soon.&lt;br /&gt;
* Bernd Bischl: update on the batchtools package&lt;br /&gt;
** Successor to BatchJobs based on in-memory database&lt;br /&gt;
&lt;br /&gt;
=== 8/11/2016 ===&lt;br /&gt;
&lt;br /&gt;
''Meeting was canceled due to lack of availability.''&lt;br /&gt;
&lt;br /&gt;
=== 7/14/2016 ===&lt;br /&gt;
&lt;br /&gt;
* Introduced Clark who is the intern funded by R Consortium. Clark is a graduate student from UC Davis. He will work on ddR integration with Spark and improving the core ddR API as well such as adding a distributed apply() for matrices, split function, etc.&lt;br /&gt;
* Bernd: Can I play around with ddR now? What backend should I use? How robust is the code?&lt;br /&gt;
** Clark: It's in good enough shape to be played around with. We will continue to improve it. Hopefully the spark integration will be done before the end of my internship in September.&lt;br /&gt;
* Q: Is anyone working on using ddR to make ML scale better.&lt;br /&gt;
** Indrajit: We have kmeans, glm, etc. already in CRAN.&lt;br /&gt;
** Michael Kane: We are working on glmnet and other packages related to algorithm development.&lt;br /&gt;
* Javier gave a demo of sparklyr and sparkapi.&lt;br /&gt;
** Motivation for the pacakage: The SparkR package overrides the dplyr interface. This is an issue for RStudio. SparkR is not a CRAN package which makes it difficult to add changes. dplyr is the most popular tool by RStudio and is broken on SparkR.&lt;br /&gt;
** Sparklyr provides a dplyr interface. It will also support ML like interfaces, such as consuming a ML model.&lt;br /&gt;
** Sparklyr does not currently support any distributed computing features. Instead we can recommend ddR as the distributed computing framework on top of sparkapi. We will put the code in CRAN in a couple of weeks.&lt;br /&gt;
** Simon: Can you talk more about the wrapper/low level API to work with Spark?&lt;br /&gt;
*** Javier: The package underneath the cover is called &amp;quot;sparkapi&amp;quot; it is to be used by pacakge builders. &amp;quot;spark_context()&amp;quot; and &amp;quot;invoke()&amp;quot; are the functionality to call scala methods. It does not you to currently run R user defined functions. I am currently working on enabling that feature. Depending upon the interest in using ddR with sparkapi, I can spend more time to make sparkapi feature rich.&lt;br /&gt;
** Indrajit: What versions of Spark are supported&lt;br /&gt;
*** Javier: Anything after 1.6&lt;br /&gt;
** Bernd: How do you export data?&lt;br /&gt;
*** Javier: We are using all the code from SparkR. So everything in SparkR should continue to work. We don't need to change SparkR. We just need to maintain compatibility.&lt;br /&gt;
** Bernd: What happens when the RDDs are very large?&lt;br /&gt;
*** Javier: Spark will spill on disk.&lt;br /&gt;
* Michael Kane: Presented examples that he implemented on ddR.&lt;br /&gt;
** Talked about how the different distributed packages compare to each other in terms of functionality.&lt;br /&gt;
** Michael K. looked at glm and truncated SVD on ddR. Was able to implement irls on ddR by implementing two distributed functions such as &amp;quot;cross&amp;quot;. In truncated SVD only needed to overload two different distributed multiplications. &lt;br /&gt;
** Ran these algorithms on the 1000 genome dataset.&lt;br /&gt;
** Overall liked ddR since it was easy to implement the algorithms in the package.&lt;br /&gt;
** New ideas:&lt;br /&gt;
*** Trying to separate the data layer from the execution layer&lt;br /&gt;
*** Create an API that works on &amp;quot;chunks&amp;quot; (which is similar to the &amp;quot;parts&amp;quot; API in ddR). Would like to add these APIs to ddR.&lt;br /&gt;
*** Indrajit: You should be able to get some of the chunk like features by using parts and dmapply. E.g., you can call dmapply to read 10 different files, which correspond 10 chunks now. These are however wrapped as a darray or dframe. But you can continue to work on the individual chunks by using parts(i).&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== 6/2/2016 ===&lt;br /&gt;
&lt;br /&gt;
* Round table introduction&lt;br /&gt;
* (Michael) Goals for the group:&lt;br /&gt;
**  Make a common abstraction/interfaces to make it easier to work with distributed data and R &lt;br /&gt;
**  Unify the interface  &lt;br /&gt;
**  Working group will run for a year. Get an API defined, get at least one open source reference implementations&lt;br /&gt;
**  not everyone needs to work hands on. We will create smaller groups to focus on those aspects.&lt;br /&gt;
**  We tried to get a diverse group of participants&lt;br /&gt;
* Logistics: meet monthly, focus groups may meet more often&lt;br /&gt;
* R Consoritum may be able to figure ways to fund smaller projects that come out of the working group&lt;br /&gt;
* Michael Kane: Should we start with an inventory of what is available and people are using?&lt;br /&gt;
** Michael Lawrence: Yes, we should find the collection of tools as well as the use cases that are common.&lt;br /&gt;
** Joe: I will figure out a wiki space.&lt;br /&gt;
* Javier: Who are the end users? Simon: Common layer needed to get algorithms working. We started from algos and tried to find the minimal common api. One of the goals is to make sure everyone is on the same page and not trying to create his/her own custom interface.&lt;br /&gt;
* Javier: Should we try to get people with more algo expertise?&lt;br /&gt;
* Joe: Simon do you have a stack diagram?&lt;br /&gt;
* Simon: Can we get R Consortium to help write things up and draw things?&lt;br /&gt;
* Next meeting: Javier is going to present SparkR next time.&lt;/div&gt;</summary>
		<author><name>MichaelLawrence</name></author>	</entry>

	<entry>
		<id>https://wiki.r-consortium.org/view/Ddr2016</id>
		<title>Ddr2016</title>
		<link rel="alternate" type="text/html" href="https://wiki.r-consortium.org/view/Ddr2016"/>
				<updated>2017-05-01T17:59:37Z</updated>
		
		<summary type="html">&lt;p&gt;MichaelLawrence: Created page with &amp;quot;= Distributed Computing Working Group Progress Report: 2016 =  == Introduction ==   Data sizes continue to increase, while single core performance has stagnated. We scale comp...&amp;quot;&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= Distributed Computing Working Group Progress Report: 2016 =&lt;br /&gt;
&lt;br /&gt;
== Introduction ==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Data sizes continue to increase, while single core performance has&lt;br /&gt;
stagnated. We scale computations by leveraging multiple cores and&lt;br /&gt;
machines. Large datasets are expensive to replicate, so we minimize&lt;br /&gt;
data movement by moving the computation to the data. Many systems,&lt;br /&gt;
such as Hadoop, Spark, and massively parallel processing (MPP)&lt;br /&gt;
databases, have emerged to support these strategies, and each exposes&lt;br /&gt;
its own unique interface, with little standardization. &lt;br /&gt;
&lt;br /&gt;
Developing and executing an algorithm in the distributed context is a&lt;br /&gt;
complex task that requires specific knowledge of and dependency on the&lt;br /&gt;
system storing the data. It is also a task orthogonal to the primary&lt;br /&gt;
role of a data scientist or statistician: extracting knowledge from&lt;br /&gt;
data. The task thus falls to the data analysis environment, which&lt;br /&gt;
should mask the complexity behind a familiar interface, maintaining&lt;br /&gt;
user productivity. However, it is not always feasible to automatically&lt;br /&gt;
determine the optimal strategy for a given problem, so user input is&lt;br /&gt;
often beneficial. The environment should only abstract the details to&lt;br /&gt;
the extent deemed appropriate by the user.&lt;br /&gt;
&lt;br /&gt;
R needs a standardized, layered and idiomatic abstraction for&lt;br /&gt;
computing on distributed data structures. R has many packages that&lt;br /&gt;
provide parallelism constructs as well as bridges to distributed&lt;br /&gt;
systems such as Hadoop. Unfortunately, each interface has its own&lt;br /&gt;
syntax, parallelism techniques, and supported platform(s).  As a&lt;br /&gt;
consequence, contributors are forced to learn multiple idiosyncratic&lt;br /&gt;
interfaces, and to restrict each implementation to a particular&lt;br /&gt;
interface, thus limiting the applicability and adoption of their&lt;br /&gt;
software and hampering interoperability.&lt;br /&gt;
&lt;br /&gt;
The idea of a unified interface stemmed from a cross-industry workshop&lt;br /&gt;
organized at HP Labs in early 2015. The workshop was attended by&lt;br /&gt;
different companies, universities, and R-core members. Immediately&lt;br /&gt;
after the workshop, Indrajit Roy, Edward Ma, and Michael Lawrence began&lt;br /&gt;
designing an abstraction that later became known as the CRAN package&lt;br /&gt;
ddR (Distributed Data in R)[1]. It declares a unified API for distributed&lt;br /&gt;
computing in R and ensures that R programs written using the API are&lt;br /&gt;
portable across different systems, such as Distributed R, Spark, etc.&lt;br /&gt;
&lt;br /&gt;
The ddR package has completed its initial phase of development; the&lt;br /&gt;
first release is now on CRAN. Three ddR machine-learning algorithms&lt;br /&gt;
are also on CRAN, randomForest.ddR, glm.ddR, and kmeans.ddR. Two&lt;br /&gt;
reference backends for ddR have been completed, one for R’s parallel&lt;br /&gt;
package, and one for HP Distributed R. Example code and scripts to run&lt;br /&gt;
algorithms and code on both of these backends are available in our&lt;br /&gt;
public repository at https://github.com/vertica/ddR.&lt;br /&gt;
&lt;br /&gt;
The overarching goal of the ddR project was for it to be a starting&lt;br /&gt;
point in a collaborative effort, ultimately leading to a standard API&lt;br /&gt;
for working with distributed data in R.  We decided that it was&lt;br /&gt;
natural for the R Consortium to sponsor the collaboration, as it&lt;br /&gt;
should involve both industry and R-core members. To this end, we&lt;br /&gt;
established the R Consortium Working Group on Distributed Computing,&lt;br /&gt;
with a planned duration of a single year and the following aims:&lt;br /&gt;
&lt;br /&gt;
# Agree on the goal of the group, i.e., we should have a unifying framework for distributed computing. Define success metric.&lt;br /&gt;
# Brainstorm on what primitives should be included in the API. We can use ddR’s API of distributed data-structures and dmapply as the starting proposal. Understand relationship with existing packages such as parallel, foreach, etc.&lt;br /&gt;
# Explore how ddR like interface will interact with databases. Are there connections or redundancies with dplyr and multiplyr?&lt;br /&gt;
# Decide on a reference implementation for the API.&lt;br /&gt;
# Decide on whether we should also implement a few ecosystem packages, e.g., distributed algorithms written using the API.&lt;br /&gt;
&lt;br /&gt;
We declared the following milestones:&lt;br /&gt;
&lt;br /&gt;
# Mid-year milestone: Finalize API. Decide who all will help with developing the top-level implementation and backends.&lt;br /&gt;
# End-year milestone: Summary report and reference implementation. Socialize the final package.&lt;br /&gt;
&lt;br /&gt;
This report outlines the progress we have made on the above goals and&lt;br /&gt;
milestones, and how we plan to continue progress in the second half of&lt;br /&gt;
the working group term.&lt;br /&gt;
&lt;br /&gt;
== Results and Current Status ==&lt;br /&gt;
&lt;br /&gt;
The working group has achieved the first goal by agreeing that we&lt;br /&gt;
should aim for a unifying distributed computing abstraction, and we&lt;br /&gt;
have treated ddR as an informal API proposal.&lt;br /&gt;
&lt;br /&gt;
We have discussed many of the issues related to the second goal,&lt;br /&gt;
deciding which primitives should be part of the API.  We aim for the&lt;br /&gt;
API to support three shapes of data --- lists, arrays and data frames&lt;br /&gt;
--- and to enable the loading and basic manipulation of distributed&lt;br /&gt;
data, including multiple modes of functional iteration (e.g., apply()&lt;br /&gt;
operations). We aim to preserve consistency with base R data&lt;br /&gt;
structures and functions, so as to provide a simple path for users to&lt;br /&gt;
port computations to distributed systems.&lt;br /&gt;
&lt;br /&gt;
The ddR constructs permit a user to express a wide variety of&lt;br /&gt;
applications, including machine-learning algorithms, that will run on&lt;br /&gt;
different backends. We have successfully implemented distributed&lt;br /&gt;
versions of algorithms such as K-means, Regression, Random Forest, and&lt;br /&gt;
PageRank using the ddR API. Some of these ddR algorithms are now&lt;br /&gt;
available on CRAN.  In addition, the package provides several generic&lt;br /&gt;
definitions of common operators (such as colSums) that can be invoked&lt;br /&gt;
on distributed objects residing in the supporting backends.&lt;br /&gt;
&lt;br /&gt;
Each custom ddR backend is encapsulated in its own driver package. In&lt;br /&gt;
the conventional style of functional OOP, the driver registers methods&lt;br /&gt;
for generics declared by the backend API, such that ddR can dispatch&lt;br /&gt;
the backend-specific instructions by only calling the generics.&lt;br /&gt;
&lt;br /&gt;
The working group explored potential new backends with the aim of&lt;br /&gt;
broadening the applicability of the ddR interface. We hosted&lt;br /&gt;
presentations from external speakers on Spark and TensorFlow, and also&lt;br /&gt;
considered a generic SQL backend. The discussion focused on Spark&lt;br /&gt;
integration, and the R Consortium-funded intern Clark Fitzgerald took&lt;br /&gt;
on the task of developing a prototype Spark backend. The development&lt;br /&gt;
of the Spark backend encountered some obstacles, including the&lt;br /&gt;
immaturity of Spark and its R interfaces. Development is currently&lt;br /&gt;
paused, as we await additional funding.&lt;br /&gt;
&lt;br /&gt;
During the monthly meetings, the working group deliberated on&lt;br /&gt;
different design improvements for ddR itself. We list two key topics&lt;br /&gt;
that were discussed.  First, Michael Kane and Bryan Lewis argued for a&lt;br /&gt;
lower level API that directly operates on chunks of data. While ddR&lt;br /&gt;
supports chunk-wise data processing, via a combination of dmapply()&lt;br /&gt;
and parts(), its focus on distributed data structures means that&lt;br /&gt;
the chunk-based processing is exposed as the manipulation of these&lt;br /&gt;
data structures. Second, Clark Fitzgerald proposed restructuring the&lt;br /&gt;
ddR code into two layers that includes chunk-wise processing while&lt;br /&gt;
retaining the emphasis on distributed data structures[2]. The lower&lt;br /&gt;
level API, which will interface with backends, will use a Map() like&lt;br /&gt;
primitive to evaluate functions on chunks of data, while the higher&lt;br /&gt;
level ddR API will expose distributed data structures, dmapply, and&lt;br /&gt;
other convenience functions. This refactoring would facilitate the&lt;br /&gt;
implementation of additional backends.&lt;br /&gt;
&lt;br /&gt;
== Discussion and Future Plans ==&lt;br /&gt;
&lt;br /&gt;
The R Consortium-funded working group and internship has helped us&lt;br /&gt;
start a conversation on distributed computing APIs for R.  The ddR&lt;br /&gt;
CRAN package is a concrete outcome of this working group, and serves&lt;br /&gt;
as a platform for exploring APIs and their integration with different&lt;br /&gt;
backends. While ddR is still maturing, we have arrived at a consensus&lt;br /&gt;
for how we should improve and finalize the ddR API.&lt;br /&gt;
&lt;br /&gt;
As part of our goal for a reference implementation, we aim to develop&lt;br /&gt;
one or more prototype backends that will make the ddR interface useful&lt;br /&gt;
in practice. A good candidate backend is any open-source system that&lt;br /&gt;
is effective at R use cases and has strong community support. Spark&lt;br /&gt;
remains a viable candidate, and we also aim to further explore&lt;br /&gt;
TensorFlow.&lt;br /&gt;
&lt;br /&gt;
We plan for a second intern to perform three tasks: (1) refactor the&lt;br /&gt;
ddR API to a more final form, (2) compare Spark and TensorFlow in&lt;br /&gt;
detail, with an eye towards the feasibility of implementing a useful&lt;br /&gt;
backend, and (3) implement a prototype backend based on Spark or&lt;br /&gt;
TensorFlow, depending on the recommendation of the working group.&lt;br /&gt;
&lt;br /&gt;
By the conclusion of the working group, it will have produced:&lt;br /&gt;
* A stable version of the ddR package and at least one practical backend, released on CRAN,&lt;br /&gt;
* A list of requirements that are relevant and of interest to the community but have not yet been met by ddR, including alternative implementations that remain independent,&lt;br /&gt;
* A list of topics that the group believes worthy of further investigation.&lt;br /&gt;
&lt;br /&gt;
[1] http://h30507.www3.hp.com/t5/Behind-the-scenes-Labs/Enhancing-R-for-Distributed-Computing/ba-p/6795535#.VjE1K7erQQj&lt;br /&gt;
&lt;br /&gt;
[2] Clark Fitzgerald. https://github.com/vertica/ddR/wiki/Design&lt;/div&gt;</summary>
		<author><name>MichaelLawrence</name></author>	</entry>

	<entry>
		<id>https://wiki.r-consortium.org/view/Distributed_Computing_Working_Group</id>
		<title>Distributed Computing Working Group</title>
		<link rel="alternate" type="text/html" href="https://wiki.r-consortium.org/view/Distributed_Computing_Working_Group"/>
				<updated>2017-05-01T17:56:12Z</updated>
		
		<summary type="html">&lt;p&gt;MichaelLawrence: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Goals and Purpose ==&lt;br /&gt;
&lt;br /&gt;
The Distributed Computing Working Group will endorse the design of a common abstraction for distributed data structures in R. We aim to have at least one open-source implementation, as well as a SQL implementation, released within a year of forming the group.&lt;br /&gt;
&lt;br /&gt;
== Members ==&lt;br /&gt;
&lt;br /&gt;
* '''Michael Lawrence''' (Genentech)&lt;br /&gt;
* '''Indrajit Roy''' (HP Enterprise)&lt;br /&gt;
* ''Joe Rickert'' (ISC liason, RStudio)&lt;br /&gt;
* Bernd Bischl (LMU)&lt;br /&gt;
* Matt Dowle (H2O)&lt;br /&gt;
* Mario Inchiosa (Microsoft)&lt;br /&gt;
* Michael Kane (Yale)&lt;br /&gt;
* Javier Luraschi (RStudio)&lt;br /&gt;
* Edward Ma (HP Enterprise)&lt;br /&gt;
* Luke Tierney (University of Iowa)&lt;br /&gt;
* Simon Urbanek (AT&amp;amp;T)&lt;br /&gt;
* Bryan Lewis (Paradigm4)&lt;br /&gt;
* Hossein Falaki (databricks)&lt;br /&gt;
&lt;br /&gt;
== Reports ==&lt;br /&gt;
&lt;br /&gt;
[[ddr2016|2016 Progress Report]]&lt;br /&gt;
&lt;br /&gt;
== Milestones ==&lt;br /&gt;
&lt;br /&gt;
=== Achieved ===&lt;br /&gt;
&lt;br /&gt;
* Adopt ddR as a prototype for a standard API for distributed computing in R&lt;br /&gt;
&lt;br /&gt;
=== 2016 Internship ===&lt;br /&gt;
&lt;br /&gt;
Clark Fitzgerald, a PhD student in the UC Davis Statistics department, worked on ddR and Spark integration.&lt;br /&gt;
&lt;br /&gt;
* Wrote [https://github.com/clarkfitzg/sparklite sparklite] and [https://github.com/clarkfitzg/rddlist rddlist] as minimal proof-of-concept R packages to connect and store general data on Spark. [https://docs.google.com/presentation/d/1WfUQ2ockNku90GWMXonEhUEcVOWcgBmWwt5uYSSBYPY/edit?usp=sharing slides]&lt;br /&gt;
* [https://issues.apache.org/jira/browse/SPARK-16785 Patched SparkR] to allow user defined functions returning binary columns. This allows implementation of different data structures in SparkR.&lt;br /&gt;
* Updated [https://github.com/vertica/ddR/wiki/Design design documents] with suggested changes to DDR's internal design and object oriented model. &lt;br /&gt;
* Improved [https://github.com/vertica/ddR/pull/15 testing and ddR internals].&lt;br /&gt;
&lt;br /&gt;
=== Outstanding ===&lt;br /&gt;
&lt;br /&gt;
* Agree on a final standard API for distributed computing in R&lt;br /&gt;
* Implement at least one scalable backend based on an open-source technology like Spark, SQL, etc&lt;br /&gt;
&lt;br /&gt;
== Open Questions ==&lt;br /&gt;
&lt;br /&gt;
* How can we address the needs of both the end user data scientists and the algorithm implementers?&lt;br /&gt;
* How should we share data between R and a system like Spark?&lt;br /&gt;
* Is there any way to unify SparkR and sparklyr?&lt;br /&gt;
* Could we use the abstractions of tensorflow to partially or fully integrate with platforms like Spark?&lt;br /&gt;
&lt;br /&gt;
== Minutes ==&lt;br /&gt;
=== 03/09/2017 ===&lt;br /&gt;
* Talk by Brian Lewis&lt;br /&gt;
** Has created a Github page with notes on Singularity.&lt;br /&gt;
** Singularity is a container technology for HPC applications&lt;br /&gt;
** No daemon. Minimum virtualization to get application running. Light weight and has very low overheads. &lt;br /&gt;
** Used widely in supercomputers&lt;br /&gt;
** All distributed computing platforms even with R skins are difficult to use. &lt;br /&gt;
** Containers make it much easier to abstract away the long tail of software dependencies and focus on R &lt;br /&gt;
** Demonstrated an example of using Singularity with Tensorflow&lt;br /&gt;
** Tried MPI and dopar on the 1000Genome data&lt;br /&gt;
** The program parses the variant data and stores chunks as files. Then ran principal components on each file.&lt;br /&gt;
** Overload matrix operations to use foreach/MPI underneath.&lt;br /&gt;
** Overall: Use existing R operators and overloading them with the appropriate backend.&lt;br /&gt;
** Will spend time working on Tensorflow, e.g., take a number of algorithms such as PCA and write them on top of Tensorflow using existing R primitives. &lt;br /&gt;
&lt;br /&gt;
=== 12/08/2016 ===&lt;br /&gt;
* Yuan Tang from Uptake was the presenter&lt;br /&gt;
** Michael and Indrajit will write a status report for the working group sometime in December or January&lt;br /&gt;
** Yuan gave an overview of TensorFlow&lt;br /&gt;
** JJ, Dirk and Yuan are working on R layer for TensorFlow &lt;br /&gt;
** TensorFlow is a platform for machine learning as well as other computations (even math proofs).&lt;br /&gt;
** It is GPU optimized and distributed. &lt;br /&gt;
** It is used in search, speech recognition, Google photos, etc.&lt;br /&gt;
** TensorFlow computations are directed graphs. odes are operations and edges are tensors.&lt;br /&gt;
** A lot of array, matrix, etc. operations are available&lt;br /&gt;
** Backend is mostly C++. Python front end exists. &lt;br /&gt;
** TensorFlow R is based on the python fronted&lt;br /&gt;
** In multi-device setting, TensorFlow figures out which devices to use and manages communication between devices. &lt;br /&gt;
** Computations are fault tolerant&lt;br /&gt;
** Yuan has previously worked on Scikit Flow which is now TF.Learn. It’s a easy transition for Scikit learn users.&lt;br /&gt;
** Yuan gave a brief overview of the python interface&lt;br /&gt;
** TensorFlow in R handles conversion between R and Python. Syntax is very similar to python API&lt;br /&gt;
** Future work: Adding more examples and tutorials, integration with Kubernetes/Marathon like framework.&lt;br /&gt;
** During the Q/A there were questions related to whether R kernels can be supported in TensorFlow, and whether R dataframes are a natural wrapper for TensorFlow objects.&lt;br /&gt;
&lt;br /&gt;
=== 11/10/2016 ===&lt;br /&gt;
&lt;br /&gt;
* SparkR slides were presented by Hossein Faliki and Shivaram from Databricks and UC Berkeley:&lt;br /&gt;
** SparkR was a prototype from AMPLab (2014). Initially it had the RDD API and was similar to PySpark API&lt;br /&gt;
** In 2015, the merge with upstream Spark, the decision was made to integrate with the Dataframe API, and hide the RDD API&lt;br /&gt;
** In 2016 more MLLib algorithms have been integrated and new APIs have been added. A CRAN package will be released soon&lt;br /&gt;
** Original SparkR architecture runs R on the master that communicates with the JVM processes in the driver. the driver sends commands  to the worker JVM processes, and executes them as scala/java statements.&lt;br /&gt;
** The system can read distributed data inside the JVM from different sources such as S3, HDFS, etc.&lt;br /&gt;
** The driver has a socket based connection between SparkR and the RBackend. RBackend runs on the JVM, deserializes the R code, and converts the R statements into Java calls.&lt;br /&gt;
** collect() and createDataFrame() are used to move data between R and JVM processes. createDataFrame will convert your local R data into a JVM based distributed data frame.&lt;br /&gt;
** The API has IO, Caching, MLLib, and SQL related commands&lt;br /&gt;
** Since Spark 2.0, we can run R processes inside the JVM worker processes. There is no need to keep long running R processes. &lt;br /&gt;
** There are 3 UDF functions (1) lapply, runs function on different value of a list (2) dapply, runs function on each partition of a data frame. You have to careful about how data is partitioned, and (3) gapply, performs a grouping on different column names and then runs the function on each group. &lt;br /&gt;
** The new CRAN package install.spark() will automatically download and install Spark. Automated CRAN checks have been added to every commit to the code. Should be available with Spark 2.1.0&lt;br /&gt;
&lt;br /&gt;
* Q/A &lt;br /&gt;
** Currently trying to get zero copy dataframe between python and Spark. Spark 2.0 has an off heap manager that uses Arrow. Once this feature is tested on the Python API, the next step will be integration R. &lt;br /&gt;
** Spark dataframes gain from plan optimizations. It is not SparkR specific. R UDFs are still treated as black boxes by the optimizer&lt;br /&gt;
** Spark doesn't directly support matrixes. There is no immediate intent to do so either. One can store an array or vector as a single column of a Spark dataframe.&lt;br /&gt;
&lt;br /&gt;
=== 10/13/2016 ===&lt;br /&gt;
&lt;br /&gt;
''Detailed minutes were not taken for this meeting''&lt;br /&gt;
&lt;br /&gt;
* Mario Inchiosa: Microsoft's perspective on distributed computing with R&lt;br /&gt;
** Microsoft R Server: abstractions and algorithms for distributed computation on top of open-source R&lt;br /&gt;
** Desired features of a distributed API like ddR:&lt;br /&gt;
*** Supports PEMA (initialize, processData, updateResults, processResults)&lt;br /&gt;
*** Cross-platform&lt;br /&gt;
*** Fast runtime&lt;br /&gt;
*** Supports algorithm writer and data scientist&lt;br /&gt;
*** Comes with a comprehensive set of algorithms&lt;br /&gt;
*** Easy deployment&lt;br /&gt;
** ddR is making good progress but does not yet meet those requirements&lt;br /&gt;
* Indrajit: ddR progress report and next steps&lt;br /&gt;
** Recap of Clark's internship&lt;br /&gt;
** Next step: implement some of Clark's design suggestions: https://github.com/vertica/ddR/wiki/Design&lt;br /&gt;
** Spark integration will be based on sparklyr&lt;br /&gt;
** Should we limit Spark interaction to the DataFrame API or directly interact with RDDs?&lt;br /&gt;
*** Consensus: will likely need flexibility of RDDs to implement everything we need, e.g., arrays and lists&lt;br /&gt;
** Clark and Javier raised concerns about the scalability of sharing data between R and Spark&lt;br /&gt;
*** Michael: Spark is a platform in its own right, so interoperability is important, should figure something out&lt;br /&gt;
*** Bryan Lewis: Why not use tensor abstraction from tensorflow? Spark supports tensorflow and an R interface is already in the works.&lt;br /&gt;
** Michael raised the issue of additional funding from the R Consortium to continue Clark's work&lt;br /&gt;
*** Joe Rickert suggested that the working group develop one or more white papers summarizing the findings of the working group for presentation to the Infrastructure Steering Committee.&lt;br /&gt;
*** Consensus was in favor of this, and several pointed out that the progress so far has been worthwhile, despite not meeting the specific goals laid out in the proposal.&lt;br /&gt;
* Michael: do we want to invite some external speakers, one per meeting, from groups like databricks, tensorflow, etc?&lt;br /&gt;
** Consensus was in favor.&lt;br /&gt;
&lt;br /&gt;
=== 9/8/2016 ===&lt;br /&gt;
&lt;br /&gt;
''Detailed minutes were not taken for this meeting''&lt;br /&gt;
&lt;br /&gt;
* Clark Fitzgerald: internship report&lt;br /&gt;
** Developed two packages for low-level Spark integration: rddlist, sparklite&lt;br /&gt;
** Patched a bug in Spark&lt;br /&gt;
** ddR needs refactoring before Spark integration is feasible:&lt;br /&gt;
*** dlist, dframe, and darray should be formal classes.&lt;br /&gt;
*** Partitions of data should be represented by a distributed list abstraction, and most functions (e.g., dmapply) should be implemented on top of that list.&lt;br /&gt;
* Javier: sparklyr update&lt;br /&gt;
** Preparing for CRAN release&lt;br /&gt;
** Mario: what happened to sparkapi?&lt;br /&gt;
*** Javier: sparkapi has been merged into sparklyr in order to avoid overhead of maintaining two packages. ddR can do everything it needs with sparklyr.&lt;br /&gt;
* Luke Tierney: Update on the low-level vector abstraction, which might support interfaces like ddR and sparklyr.&lt;br /&gt;
** Overall approach seems feasible, but still working out a few details.&lt;br /&gt;
** Will land in a branch soon.&lt;br /&gt;
* Bernd Bischl: update on the batchtools package&lt;br /&gt;
** Successor to BatchJobs based on in-memory database&lt;br /&gt;
&lt;br /&gt;
=== 8/11/2016 ===&lt;br /&gt;
&lt;br /&gt;
''Meeting was canceled due to lack of availability.''&lt;br /&gt;
&lt;br /&gt;
=== 7/14/2016 ===&lt;br /&gt;
&lt;br /&gt;
* Introduced Clark who is the intern funded by R Consortium. Clark is a graduate student from UC Davis. He will work on ddR integration with Spark and improving the core ddR API as well such as adding a distributed apply() for matrices, split function, etc.&lt;br /&gt;
* Bernd: Can I play around with ddR now? What backend should I use? How robust is the code?&lt;br /&gt;
** Clark: It's in good enough shape to be played around with. We will continue to improve it. Hopefully the spark integration will be done before the end of my internship in September.&lt;br /&gt;
* Q: Is anyone working on using ddR to make ML scale better.&lt;br /&gt;
** Indrajit: We have kmeans, glm, etc. already in CRAN.&lt;br /&gt;
** Michael Kane: We are working on glmnet and other packages related to algorithm development.&lt;br /&gt;
* Javier gave a demo of sparklyr and sparkapi.&lt;br /&gt;
** Motivation for the pacakage: The SparkR package overrides the dplyr interface. This is an issue for RStudio. SparkR is not a CRAN package which makes it difficult to add changes. dplyr is the most popular tool by RStudio and is broken on SparkR.&lt;br /&gt;
** Sparklyr provides a dplyr interface. It will also support ML like interfaces, such as consuming a ML model.&lt;br /&gt;
** Sparklyr does not currently support any distributed computing features. Instead we can recommend ddR as the distributed computing framework on top of sparkapi. We will put the code in CRAN in a couple of weeks.&lt;br /&gt;
** Simon: Can you talk more about the wrapper/low level API to work with Spark?&lt;br /&gt;
*** Javier: The package underneath the cover is called &amp;quot;sparkapi&amp;quot; it is to be used by pacakge builders. &amp;quot;spark_context()&amp;quot; and &amp;quot;invoke()&amp;quot; are the functionality to call scala methods. It does not you to currently run R user defined functions. I am currently working on enabling that feature. Depending upon the interest in using ddR with sparkapi, I can spend more time to make sparkapi feature rich.&lt;br /&gt;
** Indrajit: What versions of Spark are supported&lt;br /&gt;
*** Javier: Anything after 1.6&lt;br /&gt;
** Bernd: How do you export data?&lt;br /&gt;
*** Javier: We are using all the code from SparkR. So everything in SparkR should continue to work. We don't need to change SparkR. We just need to maintain compatibility.&lt;br /&gt;
** Bernd: What happens when the RDDs are very large?&lt;br /&gt;
*** Javier: Spark will spill on disk.&lt;br /&gt;
* Michael Kane: Presented examples that he implemented on ddR.&lt;br /&gt;
** Talked about how the different distributed packages compare to each other in terms of functionality.&lt;br /&gt;
** Michael K. looked at glm and truncated SVD on ddR. Was able to implement irls on ddR by implementing two distributed functions such as &amp;quot;cross&amp;quot;. In truncated SVD only needed to overload two different distributed multiplications. &lt;br /&gt;
** Ran these algorithms on the 1000 genome dataset.&lt;br /&gt;
** Overall liked ddR since it was easy to implement the algorithms in the package.&lt;br /&gt;
** New ideas:&lt;br /&gt;
*** Trying to separate the data layer from the execution layer&lt;br /&gt;
*** Create an API that works on &amp;quot;chunks&amp;quot; (which is similar to the &amp;quot;parts&amp;quot; API in ddR). Would like to add these APIs to ddR.&lt;br /&gt;
*** Indrajit: You should be able to get some of the chunk like features by using parts and dmapply. E.g., you can call dmapply to read 10 different files, which correspond 10 chunks now. These are however wrapped as a darray or dframe. But you can continue to work on the individual chunks by using parts(i).&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== 6/2/2016 ===&lt;br /&gt;
&lt;br /&gt;
* Round table introduction&lt;br /&gt;
* (Michael) Goals for the group:&lt;br /&gt;
**  Make a common abstraction/interfaces to make it easier to work with distributed data and R &lt;br /&gt;
**  Unify the interface  &lt;br /&gt;
**  Working group will run for a year. Get an API defined, get at least one open source reference implementations&lt;br /&gt;
**  not everyone needs to work hands on. We will create smaller groups to focus on those aspects.&lt;br /&gt;
**  We tried to get a diverse group of participants&lt;br /&gt;
* Logistics: meet monthly, focus groups may meet more often&lt;br /&gt;
* R Consoritum may be able to figure ways to fund smaller projects that come out of the working group&lt;br /&gt;
* Michael Kane: Should we start with an inventory of what is available and people are using?&lt;br /&gt;
** Michael Lawrence: Yes, we should find the collection of tools as well as the use cases that are common.&lt;br /&gt;
** Joe: I will figure out a wiki space.&lt;br /&gt;
* Javier: Who are the end users? Simon: Common layer needed to get algorithms working. We started from algos and tried to find the minimal common api. One of the goals is to make sure everyone is on the same page and not trying to create his/her own custom interface.&lt;br /&gt;
* Javier: Should we try to get people with more algo expertise?&lt;br /&gt;
* Joe: Simon do you have a stack diagram?&lt;br /&gt;
* Simon: Can we get R Consortium to help write things up and draw things?&lt;br /&gt;
* Next meeting: Javier is going to present SparkR next time.&lt;/div&gt;</summary>
		<author><name>MichaelLawrence</name></author>	</entry>

	<entry>
		<id>https://wiki.r-consortium.org/view/Distributed_Computing_Working_Group</id>
		<title>Distributed Computing Working Group</title>
		<link rel="alternate" type="text/html" href="https://wiki.r-consortium.org/view/Distributed_Computing_Working_Group"/>
				<updated>2016-12-09T18:48:40Z</updated>
		
		<summary type="html">&lt;p&gt;MichaelLawrence: Yuan works at Uptake&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Goals and Purpose ==&lt;br /&gt;
&lt;br /&gt;
The Distributed Computing Working Group will endorse the design of a common abstraction for distributed data structures in R. We aim to have at least one open-source implementation, as well as a SQL implementation, released within a year of forming the group.&lt;br /&gt;
&lt;br /&gt;
== Members ==&lt;br /&gt;
&lt;br /&gt;
* '''Michael Lawrence''' (Genentech)&lt;br /&gt;
* '''Indrajit Roy''' (HP Enterprise)&lt;br /&gt;
* ''Joe Rickert'' (ISC liason, RStudio)&lt;br /&gt;
* Bernd Bischl (LMU)&lt;br /&gt;
* Matt Dowle (H2O)&lt;br /&gt;
* Mario Inchiosa (Microsoft)&lt;br /&gt;
* Michael Kane (Yale)&lt;br /&gt;
* Javier Luraschi (RStudio)&lt;br /&gt;
* Edward Ma (HP Enterprise)&lt;br /&gt;
* Luke Tierney (University of Iowa)&lt;br /&gt;
* Simon Urbanek (AT&amp;amp;T)&lt;br /&gt;
* Bryan Lewis (Paradigm4)&lt;br /&gt;
* Hossein Falaki (databricks)&lt;br /&gt;
&lt;br /&gt;
== Milestones ==&lt;br /&gt;
&lt;br /&gt;
=== Achieved ===&lt;br /&gt;
&lt;br /&gt;
* Adopt ddR as a prototype for a standard API for distributed computing in R&lt;br /&gt;
&lt;br /&gt;
=== 2016 Internship ===&lt;br /&gt;
&lt;br /&gt;
Clark Fitzgerald, a PhD student in the UC Davis Statistics department, worked on ddR and Spark integration.&lt;br /&gt;
&lt;br /&gt;
* Wrote [https://github.com/clarkfitzg/sparklite sparklite] and [https://github.com/clarkfitzg/rddlist rddlist] as minimal proof-of-concept R packages to connect and store general data on Spark. [https://docs.google.com/presentation/d/1WfUQ2ockNku90GWMXonEhUEcVOWcgBmWwt5uYSSBYPY/edit?usp=sharing slides]&lt;br /&gt;
* [https://issues.apache.org/jira/browse/SPARK-16785 Patched SparkR] to allow user defined functions returning binary columns. This allows implementation of different data structures in SparkR.&lt;br /&gt;
* Updated [https://github.com/vertica/ddR/wiki/Design design documents] with suggested changes to DDR's internal design and object oriented model. &lt;br /&gt;
* Improved [https://github.com/vertica/ddR/pull/15 testing and ddR internals].&lt;br /&gt;
&lt;br /&gt;
=== Outstanding ===&lt;br /&gt;
&lt;br /&gt;
* Agree on a final standard API for distributed computing in R&lt;br /&gt;
* Implement at least one scalable backend based on an open-source technology like Spark, SQL, etc&lt;br /&gt;
&lt;br /&gt;
== Open Questions ==&lt;br /&gt;
&lt;br /&gt;
* How can we address the needs of both the end user data scientists and the algorithm implementers?&lt;br /&gt;
* How should we share data between R and a system like Spark?&lt;br /&gt;
* Is there any way to unify SparkR and sparklyr?&lt;br /&gt;
* Could we use the abstractions of tensorflow to partially or fully integrate with platforms like Spark?&lt;br /&gt;
&lt;br /&gt;
== Minutes ==&lt;br /&gt;
=== 12/08/2016 ===&lt;br /&gt;
* Yuan Tang from Uptake was the presenter&lt;br /&gt;
** Michael and Indrajit will write a status report for the working group sometime in December or January&lt;br /&gt;
** Yuan gave an overview of TensorFlow&lt;br /&gt;
** JJ, Dirk and Yuan are working on R layer for TensorFlow &lt;br /&gt;
** TensorFlow is a platform for machine learning as well as other computations (even math proofs).&lt;br /&gt;
** It is GPU optimized and distributed. &lt;br /&gt;
** It is used in search, speech recognition, Google photos, etc.&lt;br /&gt;
** TensorFlow computations are directed graphs. odes are operations and edges are tensors.&lt;br /&gt;
** A lot of array, matrix, etc. operations are available&lt;br /&gt;
** Backend is mostly C++. Python front end exists. &lt;br /&gt;
** TensorFlow R is based on the python fronted&lt;br /&gt;
** In multi-device setting, TensorFlow figures out which devices to use and manages communication between devices. &lt;br /&gt;
** Computations are fault tolerant&lt;br /&gt;
** Yuan has previously worked on Scikit Flow which is now TF.Learn. It’s a easy transition for Scikit learn users.&lt;br /&gt;
** Yuan gave a brief overview of the python interface&lt;br /&gt;
** TensorFlow in R handles conversion between R and Python. Syntax is very similar to python API&lt;br /&gt;
** Future work: Adding more examples and tutorials, integration with Kubernetes/Marathon like framework.&lt;br /&gt;
** During the Q/A there were questions related to whether R kernels can be supported in TensorFlow, and whether R dataframes are a natural wrapper for TensorFlow objects.&lt;br /&gt;
&lt;br /&gt;
=== 11/10/2016 ===&lt;br /&gt;
&lt;br /&gt;
* SparkR slides were presented by Hossein Faliki and Shivaram from Databricks and UC Berkeley:&lt;br /&gt;
** SparkR was a prototype from AMPLab (2014). Initially it had the RDD API and was similar to PySpark API&lt;br /&gt;
** In 2015, the merge with upstream Spark, the decision was made to integrate with the Dataframe API, and hide the RDD API&lt;br /&gt;
** In 2016 more MLLib algorithms have been integrated and new APIs have been added. A CRAN package will be released soon&lt;br /&gt;
** Original SparkR architecture runs R on the master that communicates with the JVM processes in the driver. the driver sends commands  to the worker JVM processes, and executes them as scala/java statements.&lt;br /&gt;
** The system can read distributed data inside the JVM from different sources such as S3, HDFS, etc.&lt;br /&gt;
** The driver has a socket based connection between SparkR and the RBackend. RBackend runs on the JVM, deserializes the R code, and converts the R statements into Java calls.&lt;br /&gt;
** collect() and createDataFrame() are used to move data between R and JVM processes. createDataFrame will convert your local R data into a JVM based distributed data frame.&lt;br /&gt;
** The API has IO, Caching, MLLib, and SQL related commands&lt;br /&gt;
** Since Spark 2.0, we can run R processes inside the JVM worker processes. There is no need to keep long running R processes. &lt;br /&gt;
** There are 3 UDF functions (1) lapply, runs function on different value of a list (2) dapply, runs function on each partition of a data frame. You have to careful about how data is partitioned, and (3) gapply, performs a grouping on different column names and then runs the function on each group. &lt;br /&gt;
** The new CRAN package install.spark() will automatically download and install Spark. Automated CRAN checks have been added to every commit to the code. Should be available with Spark 2.1.0&lt;br /&gt;
&lt;br /&gt;
* Q/A &lt;br /&gt;
** Currently trying to get zero copy dataframe between python and Spark. Spark 2.0 has an off heap manager that uses Arrow. Once this feature is tested on the Python API, the next step will be integration R. &lt;br /&gt;
** Spark dataframes gain from plan optimizations. It is not SparkR specific. R UDFs are still treated as black boxes by the optimizer&lt;br /&gt;
** Spark doesn't directly support matrixes. There is no immediate intent to do so either. One can store an array or vector as a single column of a Spark dataframe.&lt;br /&gt;
&lt;br /&gt;
=== 10/13/2016 ===&lt;br /&gt;
&lt;br /&gt;
''Detailed minutes were not taken for this meeting''&lt;br /&gt;
&lt;br /&gt;
* Mario Inchiosa: Microsoft's perspective on distributed computing with R&lt;br /&gt;
** Microsoft R Server: abstractions and algorithms for distributed computation on top of open-source R&lt;br /&gt;
** Desired features of a distributed API like ddR:&lt;br /&gt;
*** Supports PEMA (initialize, processData, updateResults, processResults)&lt;br /&gt;
*** Cross-platform&lt;br /&gt;
*** Fast runtime&lt;br /&gt;
*** Supports algorithm writer and data scientist&lt;br /&gt;
*** Comes with a comprehensive set of algorithms&lt;br /&gt;
*** Easy deployment&lt;br /&gt;
** ddR is making good progress but does not yet meet those requirements&lt;br /&gt;
* Indrajit: ddR progress report and next steps&lt;br /&gt;
** Recap of Clark's internship&lt;br /&gt;
** Next step: implement some of Clark's design suggestions: https://github.com/vertica/ddR/wiki/Design&lt;br /&gt;
** Spark integration will be based on sparklyr&lt;br /&gt;
** Should we limit Spark interaction to the DataFrame API or directly interact with RDDs?&lt;br /&gt;
*** Consensus: will likely need flexibility of RDDs to implement everything we need, e.g., arrays and lists&lt;br /&gt;
** Clark and Javier raised concerns about the scalability of sharing data between R and Spark&lt;br /&gt;
*** Michael: Spark is a platform in its own right, so interoperability is important, should figure something out&lt;br /&gt;
*** Bryan Lewis: Why not use tensor abstraction from tensorflow? Spark supports tensorflow and an R interface is already in the works.&lt;br /&gt;
** Michael raised the issue of additional funding from the R Consortium to continue Clark's work&lt;br /&gt;
*** Joe Rickert suggested that the working group develop one or more white papers summarizing the findings of the working group for presentation to the Infrastructure Steering Committee.&lt;br /&gt;
*** Consensus was in favor of this, and several pointed out that the progress so far has been worthwhile, despite not meeting the specific goals laid out in the proposal.&lt;br /&gt;
* Michael: do we want to invite some external speakers, one per meeting, from groups like databricks, tensorflow, etc?&lt;br /&gt;
** Consensus was in favor.&lt;br /&gt;
&lt;br /&gt;
=== 9/8/2016 ===&lt;br /&gt;
&lt;br /&gt;
''Detailed minutes were not taken for this meeting''&lt;br /&gt;
&lt;br /&gt;
* Clark Fitzgerald: internship report&lt;br /&gt;
** Developed two packages for low-level Spark integration: rddlist, sparklite&lt;br /&gt;
** Patched a bug in Spark&lt;br /&gt;
** ddR needs refactoring before Spark integration is feasible:&lt;br /&gt;
*** dlist, dframe, and darray should be formal classes.&lt;br /&gt;
*** Partitions of data should be represented by a distributed list abstraction, and most functions (e.g., dmapply) should be implemented on top of that list.&lt;br /&gt;
* Javier: sparklyr update&lt;br /&gt;
** Preparing for CRAN release&lt;br /&gt;
** Mario: what happened to sparkapi?&lt;br /&gt;
*** Javier: sparkapi has been merged into sparklyr in order to avoid overhead of maintaining two packages. ddR can do everything it needs with sparklyr.&lt;br /&gt;
* Luke Tierney: Update on the low-level vector abstraction, which might support interfaces like ddR and sparklyr.&lt;br /&gt;
** Overall approach seems feasible, but still working out a few details.&lt;br /&gt;
** Will land in a branch soon.&lt;br /&gt;
* Bernd Bischl: update on the batchtools package&lt;br /&gt;
** Successor to BatchJobs based on in-memory database&lt;br /&gt;
&lt;br /&gt;
=== 8/11/2016 ===&lt;br /&gt;
&lt;br /&gt;
''Meeting was canceled due to lack of availability.''&lt;br /&gt;
&lt;br /&gt;
=== 7/14/2016 ===&lt;br /&gt;
&lt;br /&gt;
* Introduced Clark who is the intern funded by R Consortium. Clark is a graduate student from UC Davis. He will work on ddR integration with Spark and improving the core ddR API as well such as adding a distributed apply() for matrices, split function, etc.&lt;br /&gt;
* Bernd: Can I play around with ddR now? What backend should I use? How robust is the code?&lt;br /&gt;
** Clark: It's in good enough shape to be played around with. We will continue to improve it. Hopefully the spark integration will be done before the end of my internship in September.&lt;br /&gt;
* Q: Is anyone working on using ddR to make ML scale better.&lt;br /&gt;
** Indrajit: We have kmeans, glm, etc. already in CRAN.&lt;br /&gt;
** Michael Kane: We are working on glmnet and other packages related to algorithm development.&lt;br /&gt;
* Javier gave a demo of sparklyr and sparkapi.&lt;br /&gt;
** Motivation for the pacakage: The SparkR package overrides the dplyr interface. This is an issue for RStudio. SparkR is not a CRAN package which makes it difficult to add changes. dplyr is the most popular tool by RStudio and is broken on SparkR.&lt;br /&gt;
** Sparklyr provides a dplyr interface. It will also support ML like interfaces, such as consuming a ML model.&lt;br /&gt;
** Sparklyr does not currently support any distributed computing features. Instead we can recommend ddR as the distributed computing framework on top of sparkapi. We will put the code in CRAN in a couple of weeks.&lt;br /&gt;
** Simon: Can you talk more about the wrapper/low level API to work with Spark?&lt;br /&gt;
*** Javier: The package underneath the cover is called &amp;quot;sparkapi&amp;quot; it is to be used by pacakge builders. &amp;quot;spark_context()&amp;quot; and &amp;quot;invoke()&amp;quot; are the functionality to call scala methods. It does not you to currently run R user defined functions. I am currently working on enabling that feature. Depending upon the interest in using ddR with sparkapi, I can spend more time to make sparkapi feature rich.&lt;br /&gt;
** Indrajit: What versions of Spark are supported&lt;br /&gt;
*** Javier: Anything after 1.6&lt;br /&gt;
** Bernd: How do you export data?&lt;br /&gt;
*** Javier: We are using all the code from SparkR. So everything in SparkR should continue to work. We don't need to change SparkR. We just need to maintain compatibility.&lt;br /&gt;
** Bernd: What happens when the RDDs are very large?&lt;br /&gt;
*** Javier: Spark will spill on disk.&lt;br /&gt;
* Michael Kane: Presented examples that he implemented on ddR.&lt;br /&gt;
** Talked about how the different distributed packages compare to each other in terms of functionality.&lt;br /&gt;
** Michael K. looked at glm and truncated SVD on ddR. Was able to implement irls on ddR by implementing two distributed functions such as &amp;quot;cross&amp;quot;. In truncated SVD only needed to overload two different distributed multiplications. &lt;br /&gt;
** Ran these algorithms on the 1000 genome dataset.&lt;br /&gt;
** Overall liked ddR since it was easy to implement the algorithms in the package.&lt;br /&gt;
** New ideas:&lt;br /&gt;
*** Trying to separate the data layer from the execution layer&lt;br /&gt;
*** Create an API that works on &amp;quot;chunks&amp;quot; (which is similar to the &amp;quot;parts&amp;quot; API in ddR). Would like to add these APIs to ddR.&lt;br /&gt;
*** Indrajit: You should be able to get some of the chunk like features by using parts and dmapply. E.g., you can call dmapply to read 10 different files, which correspond 10 chunks now. These are however wrapped as a darray or dframe. But you can continue to work on the individual chunks by using parts(i).&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== 6/2/2016 ===&lt;br /&gt;
&lt;br /&gt;
* Round table introduction&lt;br /&gt;
* (Michael) Goals for the group:&lt;br /&gt;
**  Make a common abstraction/interfaces to make it easier to work with distributed data and R &lt;br /&gt;
**  Unify the interface  &lt;br /&gt;
**  Working group will run for a year. Get an API defined, get at least one open source reference implementations&lt;br /&gt;
**  not everyone needs to work hands on. We will create smaller groups to focus on those aspects.&lt;br /&gt;
**  We tried to get a diverse group of participants&lt;br /&gt;
* Logistics: meet monthly, focus groups may meet more often&lt;br /&gt;
* R Consoritum may be able to figure ways to fund smaller projects that come out of the working group&lt;br /&gt;
* Michael Kane: Should we start with an inventory of what is available and people are using?&lt;br /&gt;
** Michael Lawrence: Yes, we should find the collection of tools as well as the use cases that are common.&lt;br /&gt;
** Joe: I will figure out a wiki space.&lt;br /&gt;
* Javier: Who are the end users? Simon: Common layer needed to get algorithms working. We started from algos and tried to find the minimal common api. One of the goals is to make sure everyone is on the same page and not trying to create his/her own custom interface.&lt;br /&gt;
* Javier: Should we try to get people with more algo expertise?&lt;br /&gt;
* Joe: Simon do you have a stack diagram?&lt;br /&gt;
* Simon: Can we get R Consortium to help write things up and draw things?&lt;br /&gt;
* Next meeting: Javier is going to present SparkR next time.&lt;/div&gt;</summary>
		<author><name>MichaelLawrence</name></author>	</entry>

	<entry>
		<id>https://wiki.r-consortium.org/view/Distributed_Computing_Working_Group</id>
		<title>Distributed Computing Working Group</title>
		<link rel="alternate" type="text/html" href="https://wiki.r-consortium.org/view/Distributed_Computing_Working_Group"/>
				<updated>2016-11-10T21:14:05Z</updated>
		
		<summary type="html">&lt;p&gt;MichaelLawrence: add Hossein Falaki&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Goals and Purpose ==&lt;br /&gt;
&lt;br /&gt;
The Distributed Computing Working Group will endorse the design of a common abstraction for distributed data structures in R. We aim to have at least one open-source implementation, as well as a SQL implementation, released within a year of forming the group.&lt;br /&gt;
&lt;br /&gt;
== Members ==&lt;br /&gt;
&lt;br /&gt;
* '''Michael Lawrence''' (Genentech)&lt;br /&gt;
* '''Indrajit Roy''' (HP Enterprise)&lt;br /&gt;
* ''Joe Rickert'' (ISC liason, RStudio)&lt;br /&gt;
* Bernd Bischl (LMU)&lt;br /&gt;
* Matt Dowle (H2O)&lt;br /&gt;
* Mario Inchiosa (Microsoft)&lt;br /&gt;
* Michael Kane (Yale)&lt;br /&gt;
* Javier Luraschi (RStudio)&lt;br /&gt;
* Edward Ma (HP Enterprise)&lt;br /&gt;
* Luke Tierney (University of Iowa)&lt;br /&gt;
* Simon Urbanek (AT&amp;amp;T)&lt;br /&gt;
* Bryan Lewis (Paradigm4)&lt;br /&gt;
* Hossein Falaki (databricks)&lt;br /&gt;
&lt;br /&gt;
== Milestones ==&lt;br /&gt;
&lt;br /&gt;
=== Achieved ===&lt;br /&gt;
&lt;br /&gt;
* Adopt ddR as a prototype for a standard API for distributed computing in R&lt;br /&gt;
&lt;br /&gt;
=== 2016 Internship ===&lt;br /&gt;
&lt;br /&gt;
Clark Fitzgerald, a PhD student in the UC Davis Statistics department, worked on ddR and Spark integration.&lt;br /&gt;
&lt;br /&gt;
* Wrote [https://github.com/clarkfitzg/sparklite sparklite] and [https://github.com/clarkfitzg/rddlist rddlist] as minimal proof-of-concept R packages to connect and store general data on Spark. [https://docs.google.com/presentation/d/1WfUQ2ockNku90GWMXonEhUEcVOWcgBmWwt5uYSSBYPY/edit?usp=sharing slides]&lt;br /&gt;
* [https://issues.apache.org/jira/browse/SPARK-16785 Patched SparkR] to allow user defined functions returning binary columns. This allows implementation of different data structures in SparkR.&lt;br /&gt;
* Updated [https://github.com/vertica/ddR/wiki/Design design documents] with suggested changes to DDR's internal design and object oriented model. &lt;br /&gt;
* Improved [https://github.com/vertica/ddR/pull/15 testing and ddR internals].&lt;br /&gt;
&lt;br /&gt;
=== Outstanding ===&lt;br /&gt;
&lt;br /&gt;
* Agree on a final standard API for distributed computing in R&lt;br /&gt;
* Implement at least one scalable backend based on an open-source technology like Spark, SQL, etc&lt;br /&gt;
&lt;br /&gt;
== Open Questions ==&lt;br /&gt;
&lt;br /&gt;
* How can we address the needs of both the end user data scientists and the algorithm implementers?&lt;br /&gt;
* How should we share data between R and a system like Spark?&lt;br /&gt;
* Is there any way to unify SparkR and sparklyr?&lt;br /&gt;
* Could we use the abstractions of tensorflow to partially or fully integrate with platforms like Spark?&lt;br /&gt;
&lt;br /&gt;
== Minutes ==&lt;br /&gt;
&lt;br /&gt;
=== 10/13/2016 ===&lt;br /&gt;
&lt;br /&gt;
''Detailed minutes were not taken for this meeting''&lt;br /&gt;
&lt;br /&gt;
* Mario Inchiosa: Microsoft's perspective on distributed computing with R&lt;br /&gt;
** Microsoft R Server: abstractions and algorithms for distributed computation on top of open-source R&lt;br /&gt;
** Desired features of a distributed API like ddR:&lt;br /&gt;
*** Supports PEMA (initialize, processData, updateResults, processResults)&lt;br /&gt;
*** Cross-platform&lt;br /&gt;
*** Fast runtime&lt;br /&gt;
*** Supports algorithm writer and data scientist&lt;br /&gt;
*** Comes with a comprehensive set of algorithms&lt;br /&gt;
*** Easy deployment&lt;br /&gt;
** ddR is making good progress but does not yet meet those requirements&lt;br /&gt;
* Indrajit: ddR progress report and next steps&lt;br /&gt;
** Recap of Clark's internship&lt;br /&gt;
** Next step: implement some of Clark's design suggestions: https://github.com/vertica/ddR/wiki/Design&lt;br /&gt;
** Spark integration will be based on sparklyr&lt;br /&gt;
** Should we limit Spark interaction to the DataFrame API or directly interact with RDDs?&lt;br /&gt;
*** Consensus: will likely need flexibility of RDDs to implement everything we need, e.g., arrays and lists&lt;br /&gt;
** Clark and Javier raised concerns about the scalability of sharing data between R and Spark&lt;br /&gt;
*** Michael: Spark is a platform in its own right, so interoperability is important, should figure something out&lt;br /&gt;
*** Bryan Lewis: Why not use tensor abstraction from tensorflow? Spark supports tensorflow and an R interface is already in the works.&lt;br /&gt;
** Michael raised the issue of additional funding from the R Consortium to continue Clark's work&lt;br /&gt;
*** Joe Rickert suggested that the working group develop one or more white papers summarizing the findings of the working group for presentation to the Infrastructure Steering Committee.&lt;br /&gt;
*** Consensus was in favor of this, and several pointed out that the progress so far has been worthwhile, despite not meeting the specific goals laid out in the proposal.&lt;br /&gt;
* Michael: do we want to invite some external speakers, one per meeting, from groups like databricks, tensorflow, etc?&lt;br /&gt;
** Consensus was in favor.&lt;br /&gt;
&lt;br /&gt;
=== 9/8/2016 ===&lt;br /&gt;
&lt;br /&gt;
''Detailed minutes were not taken for this meeting''&lt;br /&gt;
&lt;br /&gt;
* Clark Fitzgerald: internship report&lt;br /&gt;
** Developed two packages for low-level Spark integration: rddlist, sparklite&lt;br /&gt;
** Patched a bug in Spark&lt;br /&gt;
** ddR needs refactoring before Spark integration is feasible:&lt;br /&gt;
*** dlist, dframe, and darray should be formal classes.&lt;br /&gt;
*** Partitions of data should be represented by a distributed list abstraction, and most functions (e.g., dmapply) should be implemented on top of that list.&lt;br /&gt;
* Javier: sparklyr update&lt;br /&gt;
** Preparing for CRAN release&lt;br /&gt;
** Mario: what happened to sparkapi?&lt;br /&gt;
*** Javier: sparkapi has been merged into sparklyr in order to avoid overhead of maintaining two packages. ddR can do everything it needs with sparklyr.&lt;br /&gt;
* Luke Tierney: Update on the low-level vector abstraction, which might support interfaces like ddR and sparklyr.&lt;br /&gt;
** Overall approach seems feasible, but still working out a few details.&lt;br /&gt;
** Will land in a branch soon.&lt;br /&gt;
* Bernd Bischl: update on the batchtools package&lt;br /&gt;
** Successor to BatchJobs based on in-memory database&lt;br /&gt;
&lt;br /&gt;
=== 8/11/2016 ===&lt;br /&gt;
&lt;br /&gt;
''Meeting was canceled due to lack of availability.''&lt;br /&gt;
&lt;br /&gt;
=== 7/14/2016 ===&lt;br /&gt;
&lt;br /&gt;
* Introduced Clark who is the intern funded by R Consortium. Clark is a graduate student from UC Davis. He will work on ddR integration with Spark and improving the core ddR API as well such as adding a distributed apply() for matrices, split function, etc.&lt;br /&gt;
* Bernd: Can I play around with ddR now? What backend should I use? How robust is the code?&lt;br /&gt;
** Clark: It's in good enough shape to be played around with. We will continue to improve it. Hopefully the spark integration will be done before the end of my internship in September.&lt;br /&gt;
* Q: Is anyone working on using ddR to make ML scale better.&lt;br /&gt;
** Indrajit: We have kmeans, glm, etc. already in CRAN.&lt;br /&gt;
** Michael Kane: We are working on glmnet and other packages related to algorithm development.&lt;br /&gt;
* Javier gave a demo of sparklyr and sparkapi.&lt;br /&gt;
** Motivation for the pacakage: The SparkR package overrides the dplyr interface. This is an issue for RStudio. SparkR is not a CRAN package which makes it difficult to add changes. dplyr is the most popular tool by RStudio and is broken on SparkR.&lt;br /&gt;
** Sparklyr provides a dplyr interface. It will also support ML like interfaces, such as consuming a ML model.&lt;br /&gt;
** Sparklyr does not currently support any distributed computing features. Instead we can recommend ddR as the distributed computing framework on top of sparkapi. We will put the code in CRAN in a couple of weeks.&lt;br /&gt;
** Simon: Can you talk more about the wrapper/low level API to work with Spark?&lt;br /&gt;
*** Javier: The package underneath the cover is called &amp;quot;sparkapi&amp;quot; it is to be used by pacakge builders. &amp;quot;spark_context()&amp;quot; and &amp;quot;invoke()&amp;quot; are the functionality to call scala methods. It does not you to currently run R user defined functions. I am currently working on enabling that feature. Depending upon the interest in using ddR with sparkapi, I can spend more time to make sparkapi feature rich.&lt;br /&gt;
** Indrajit: What versions of Spark are supported&lt;br /&gt;
*** Javier: Anything after 1.6&lt;br /&gt;
** Bernd: How do you export data?&lt;br /&gt;
*** Javier: We are using all the code from SparkR. So everything in SparkR should continue to work. We don't need to change SparkR. We just need to maintain compatibility.&lt;br /&gt;
** Bernd: What happens when the RDDs are very large?&lt;br /&gt;
*** Javier: Spark will spill on disk.&lt;br /&gt;
* Michael Kane: Presented examples that he implemented on ddR.&lt;br /&gt;
** Talked about how the different distributed packages compare to each other in terms of functionality.&lt;br /&gt;
** Michael K. looked at glm and truncated SVD on ddR. Was able to implement irls on ddR by implementing two distributed functions such as &amp;quot;cross&amp;quot;. In truncated SVD only needed to overload two different distributed multiplications. &lt;br /&gt;
** Ran these algorithms on the 1000 genome dataset.&lt;br /&gt;
** Overall liked ddR since it was easy to implement the algorithms in the package.&lt;br /&gt;
** New ideas:&lt;br /&gt;
*** Trying to separate the data layer from the execution layer&lt;br /&gt;
*** Create an API that works on &amp;quot;chunks&amp;quot; (which is similar to the &amp;quot;parts&amp;quot; API in ddR). Would like to add these APIs to ddR.&lt;br /&gt;
*** Indrajit: You should be able to get some of the chunk like features by using parts and dmapply. E.g., you can call dmapply to read 10 different files, which correspond 10 chunks now. These are however wrapped as a darray or dframe. But you can continue to work on the individual chunks by using parts(i).&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== 6/2/2016 ===&lt;br /&gt;
&lt;br /&gt;
* Round table introduction&lt;br /&gt;
* (Michael) Goals for the group:&lt;br /&gt;
**  Make a common abstraction/interfaces to make it easier to work with distributed data and R &lt;br /&gt;
**  Unify the interface  &lt;br /&gt;
**  Working group will run for a year. Get an API defined, get at least one open source reference implementations&lt;br /&gt;
**  not everyone needs to work hands on. We will create smaller groups to focus on those aspects.&lt;br /&gt;
**  We tried to get a diverse group of participants&lt;br /&gt;
* Logistics: meet monthly, focus groups may meet more often&lt;br /&gt;
* R Consoritum may be able to figure ways to fund smaller projects that come out of the working group&lt;br /&gt;
* Michael Kane: Should we start with an inventory of what is available and people are using?&lt;br /&gt;
** Michael Lawrence: Yes, we should find the collection of tools as well as the use cases that are common.&lt;br /&gt;
** Joe: I will figure out a wiki space.&lt;br /&gt;
* Javier: Who are the end users? Simon: Common layer needed to get algorithms working. We started from algos and tried to find the minimal common api. One of the goals is to make sure everyone is on the same page and not trying to create his/her own custom interface.&lt;br /&gt;
* Javier: Should we try to get people with more algo expertise?&lt;br /&gt;
* Joe: Simon do you have a stack diagram?&lt;br /&gt;
* Simon: Can we get R Consortium to help write things up and draw things?&lt;br /&gt;
* Next meeting: Javier is going to present SparkR next time.&lt;/div&gt;</summary>
		<author><name>MichaelLawrence</name></author>	</entry>

	<entry>
		<id>https://wiki.r-consortium.org/view/Distributed_Computing_Working_Group</id>
		<title>Distributed Computing Working Group</title>
		<link rel="alternate" type="text/html" href="https://wiki.r-consortium.org/view/Distributed_Computing_Working_Group"/>
				<updated>2016-10-18T16:52:07Z</updated>
		
		<summary type="html">&lt;p&gt;MichaelLawrence: change Joe's affiliation&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Goals and Purpose ==&lt;br /&gt;
&lt;br /&gt;
The Distributed Computing Working Group will endorse the design of a common abstraction for distributed data structures in R. We aim to have at least one open-source implementation, as well as a SQL implementation, released within a year of forming the group.&lt;br /&gt;
&lt;br /&gt;
== Members ==&lt;br /&gt;
&lt;br /&gt;
* '''Michael Lawrence''' (Genentech)&lt;br /&gt;
* '''Indrajit Roy''' (HP Enterprise)&lt;br /&gt;
* ''Joe Rickert'' (ISC liason, RStudio)&lt;br /&gt;
* Bernd Bischl (LMU)&lt;br /&gt;
* Matt Dowle (H2O)&lt;br /&gt;
* Mario Inchiosa (Microsoft)&lt;br /&gt;
* Michael Kane (Yale)&lt;br /&gt;
* Javier Luraschi (RStudio)&lt;br /&gt;
* Edward Ma (HP Enterprise)&lt;br /&gt;
* Luke Tierney (University of Iowa)&lt;br /&gt;
* Simon Urbanek (AT&amp;amp;T)&lt;br /&gt;
* Bryan Lewis (Paradigm4)&lt;br /&gt;
&lt;br /&gt;
== Milestones ==&lt;br /&gt;
&lt;br /&gt;
=== Achieved ===&lt;br /&gt;
&lt;br /&gt;
* Adopt ddR as a prototype for a standard API for distributed computing in R&lt;br /&gt;
* Explore the interfacing of R and Spark in the context of the ddR package (Clark Fitzgerald)&lt;br /&gt;
&lt;br /&gt;
=== Outstanding ===&lt;br /&gt;
&lt;br /&gt;
* Agree on a final standard API for distributed computing in R&lt;br /&gt;
* Implement at least one scalable backend based on an open-source technology like Spark, SQL, etc&lt;br /&gt;
&lt;br /&gt;
== Open Questions ==&lt;br /&gt;
&lt;br /&gt;
* How can we address the needs of both the end user data scientists and the algorithm implementers?&lt;br /&gt;
* How should we share data between R and a system like Spark?&lt;br /&gt;
* Is there any way to unify SparkR and sparklyr?&lt;br /&gt;
* Could we use the abstractions of tensorflow to partially or fully integrate with platforms like Spark?&lt;br /&gt;
&lt;br /&gt;
== Minutes ==&lt;br /&gt;
&lt;br /&gt;
=== 10/13/2016 ===&lt;br /&gt;
&lt;br /&gt;
''Detailed minutes were not taken for this meeting''&lt;br /&gt;
&lt;br /&gt;
* Mario Inchiosa: Microsoft's perspective on distributed computing with R&lt;br /&gt;
** Microsoft R Server: abstractions and algorithms for distributed computation on top of open-source R&lt;br /&gt;
** Desired features of a distributed API like ddR:&lt;br /&gt;
*** Supports PEMA (initialize, processData, updateResults, processResults)&lt;br /&gt;
*** Cross-platform&lt;br /&gt;
*** Fast runtime&lt;br /&gt;
*** Supports algorithm writer and data scientist&lt;br /&gt;
*** Comes with a comprehensive set of algorithms&lt;br /&gt;
*** Easy deployment&lt;br /&gt;
** ddR is making good progress but does not yet meet those requirements&lt;br /&gt;
* Indrajit: ddR progress report and next steps&lt;br /&gt;
** Recap of Clark's internship&lt;br /&gt;
** Next step: implement some of Clark's design suggestions: https://github.com/vertica/ddR/wiki/Design&lt;br /&gt;
** Spark integration will be based on sparklyr&lt;br /&gt;
** Should we limit Spark interaction to the DataFrame API or directly interact with RDDs?&lt;br /&gt;
*** Consensus: will likely need flexibility of RDDs to implement everything we need, e.g., arrays and lists&lt;br /&gt;
** Clark and Javier raised concerns about the scalability of sharing data between R and Spark&lt;br /&gt;
*** Michael: Spark is a platform in its own right, so interoperability is important, should figure something out&lt;br /&gt;
*** Bryan Lewis: Why not use tensor abstraction from tensorflow? Spark supports tensorflow and an R interface is already in the works.&lt;br /&gt;
** Michael raised the issue of additional funding from the R Consortium to continue Clark's work&lt;br /&gt;
*** Joe Rickert suggested that the working group develop one or more white papers summarizing the findings of the working group for presentation to the Infrastructure Steering Committee.&lt;br /&gt;
*** Consensus was in favor of this, and several pointed out that the progress so far has been worthwhile, despite not meeting the specific goals laid out in the proposal.&lt;br /&gt;
* Michael: do we want to invite some external speakers, one per meeting, from groups like databricks, tensorflow, etc?&lt;br /&gt;
** Consensus was in favor.&lt;br /&gt;
&lt;br /&gt;
=== 9/8/2016 ===&lt;br /&gt;
&lt;br /&gt;
''Detailed minutes were not taken for this meeting''&lt;br /&gt;
&lt;br /&gt;
* Clark Fitzgerald: internship report&lt;br /&gt;
** Developed two packages for low-level Spark integration: rddlist, sparklite&lt;br /&gt;
** Patched a bug in Spark&lt;br /&gt;
** ddR needs refactoring before Spark integration is feasible:&lt;br /&gt;
*** dlist, dframe, and darray should be formal classes.&lt;br /&gt;
*** Partitions of data should be represented by a distributed list abstraction, and most functions (e.g., dmapply) should be implemented on top of that list.&lt;br /&gt;
* Javier: sparklyr update&lt;br /&gt;
** Preparing for CRAN release&lt;br /&gt;
** Mario: what happened to sparkapi?&lt;br /&gt;
*** Javier: sparkapi has been merged into sparklyr in order to avoid overhead of maintaining two packages. ddR can do everything it needs with sparklyr.&lt;br /&gt;
* Luke Tierney: Update on the low-level vector abstraction, which might support interfaces like ddR and sparklyr.&lt;br /&gt;
** Overall approach seems feasible, but still working out a few details.&lt;br /&gt;
** Will land in a branch soon.&lt;br /&gt;
* Bernd Bischl: update on the batchtools package&lt;br /&gt;
** Successor to BatchJobs based on in-memory database&lt;br /&gt;
&lt;br /&gt;
=== 8/11/2016 ===&lt;br /&gt;
&lt;br /&gt;
''Meeting was canceled due to lack of availability.''&lt;br /&gt;
&lt;br /&gt;
=== 7/14/2016 ===&lt;br /&gt;
&lt;br /&gt;
* Introduced Clark who is the intern funded by R Consortium. Clark is a graduate student from UC Davis. He will work on ddR integration with Spark and improving the core ddR API as well such as adding a distributed apply() for matrices, split function, etc.&lt;br /&gt;
* Bernd: Can I play around with ddR now? What backend should I use? How robust is the code?&lt;br /&gt;
** Clark: It's in good enough shape to be played around with. We will continue to improve it. Hopefully the spark integration will be done before the end of my internship in September.&lt;br /&gt;
* Q: Is anyone working on using ddR to make ML scale better.&lt;br /&gt;
** Indrajit: We have kmeans, glm, etc. already in CRAN.&lt;br /&gt;
** Michael Kane: We are working on glmnet and other packages related to algorithm development.&lt;br /&gt;
* Javier gave a demo of sparklyr and sparkapi.&lt;br /&gt;
** Motivation for the pacakage: The SparkR package overrides the dplyr interface. This is an issue for RStudio. SparkR is not a CRAN package which makes it difficult to add changes. dplyr is the most popular tool by RStudio and is broken on SparkR.&lt;br /&gt;
** Sparklyr provides a dplyr interface. It will also support ML like interfaces, such as consuming a ML model.&lt;br /&gt;
** Sparklyr does not currently support any distributed computing features. Instead we can recommend ddR as the distributed computing framework on top of sparkapi. We will put the code in CRAN in a couple of weeks.&lt;br /&gt;
** Simon: Can you talk more about the wrapper/low level API to work with Spark?&lt;br /&gt;
*** Javier: The package underneath the cover is called &amp;quot;sparkapi&amp;quot; it is to be used by pacakge builders. &amp;quot;spark_context()&amp;quot; and &amp;quot;invoke()&amp;quot; are the functionality to call scala methods. It does not you to currently run R user defined functions. I am currently working on enabling that feature. Depending upon the interest in using ddR with sparkapi, I can spend more time to make sparkapi feature rich.&lt;br /&gt;
** Indrajit: What versions of Spark are supported&lt;br /&gt;
*** Javier: Anything after 1.6&lt;br /&gt;
** Bernd: How do you export data?&lt;br /&gt;
*** Javier: We are using all the code from SparkR. So everything in SparkR should continue to work. We don't need to change SparkR. We just need to maintain compatibility.&lt;br /&gt;
** Bernd: What happens when the RDDs are very large?&lt;br /&gt;
*** Javier: Spark will spill on disk.&lt;br /&gt;
* Michael Kane: Presented examples that he implemented on ddR.&lt;br /&gt;
** Talked about how the different distributed packages compare to each other in terms of functionality.&lt;br /&gt;
** Michael K. looked at glm and truncated SVD on ddR. Was able to implement irls on ddR by implementing two distributed functions such as &amp;quot;cross&amp;quot;. In truncated SVD only needed to overload two different distributed multiplications. &lt;br /&gt;
** Ran these algorithms on the 1000 genome dataset.&lt;br /&gt;
** Overall liked ddR since it was easy to implement the algorithms in the package.&lt;br /&gt;
** New ideas:&lt;br /&gt;
*** Trying to separate the data layer from the execution layer&lt;br /&gt;
*** Create an API that works on &amp;quot;chunks&amp;quot; (which is similar to the &amp;quot;parts&amp;quot; API in ddR). Would like to add these APIs to ddR.&lt;br /&gt;
*** Indrajit: You should be able to get some of the chunk like features by using parts and dmapply. E.g., you can call dmapply to read 10 different files, which correspond 10 chunks now. These are however wrapped as a darray or dframe. But you can continue to work on the individual chunks by using parts(i).&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== 6/2/2016 ===&lt;br /&gt;
&lt;br /&gt;
* Round table introduction&lt;br /&gt;
* (Michael) Goals for the group:&lt;br /&gt;
**  Make a common abstraction/interfaces to make it easier to work with distributed data and R &lt;br /&gt;
**  Unify the interface  &lt;br /&gt;
**  Working group will run for a year. Get an API defined, get at least one open source reference implementations&lt;br /&gt;
**  not everyone needs to work hands on. We will create smaller groups to focus on those aspects.&lt;br /&gt;
**  We tried to get a diverse group of participants&lt;br /&gt;
* Logistics: meet monthly, focus groups may meet more often&lt;br /&gt;
* R Consoritum may be able to figure ways to fund smaller projects that come out of the working group&lt;br /&gt;
* Michael Kane: Should we start with an inventory of what is available and people are using?&lt;br /&gt;
** Michael Lawrence: Yes, we should find the collection of tools as well as the use cases that are common.&lt;br /&gt;
** Joe: I will figure out a wiki space.&lt;br /&gt;
* Javier: Who are the end users? Simon: Common layer needed to get algorithms working. We started from algos and tried to find the minimal common api. One of the goals is to make sure everyone is on the same page and not trying to create his/her own custom interface.&lt;br /&gt;
* Javier: Should we try to get people with more algo expertise?&lt;br /&gt;
* Joe: Simon do you have a stack diagram?&lt;br /&gt;
* Simon: Can we get R Consortium to help write things up and draw things?&lt;br /&gt;
* Next meeting: Javier is going to present SparkR next time.&lt;/div&gt;</summary>
		<author><name>MichaelLawrence</name></author>	</entry>

	<entry>
		<id>https://wiki.r-consortium.org/view/Distributed_Computing_Working_Group</id>
		<title>Distributed Computing Working Group</title>
		<link rel="alternate" type="text/html" href="https://wiki.r-consortium.org/view/Distributed_Computing_Working_Group"/>
				<updated>2016-10-18T16:27:16Z</updated>
		
		<summary type="html">&lt;p&gt;MichaelLawrence: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Goals and Purpose ==&lt;br /&gt;
&lt;br /&gt;
The Distributed Computing Working Group will endorse the design of a common abstraction for distributed data structures in R. We aim to have at least one open-source implementation, as well as a SQL implementation, released within a year of forming the group.&lt;br /&gt;
&lt;br /&gt;
== Members ==&lt;br /&gt;
&lt;br /&gt;
* '''Michael Lawrence''' (Genentech)&lt;br /&gt;
* '''Indrajit Roy''' (HP Enterprise)&lt;br /&gt;
* ''Joe Rickert'' (Microsoft)&lt;br /&gt;
* Bernd Bischl (LMU)&lt;br /&gt;
* Matt Dowle (H2O)&lt;br /&gt;
* Mario Inchiosa (Microsoft)&lt;br /&gt;
* Michael Kane (Yale)&lt;br /&gt;
* Javier Luraschi (RStudio)&lt;br /&gt;
* Edward Ma (HP Enterprise)&lt;br /&gt;
* Luke Tierney (University of Iowa)&lt;br /&gt;
* Simon Urbanek (AT&amp;amp;T)&lt;br /&gt;
* Bryan Lewis (Paradigm4)&lt;br /&gt;
&lt;br /&gt;
== Milestones ==&lt;br /&gt;
&lt;br /&gt;
=== Achieved ===&lt;br /&gt;
&lt;br /&gt;
* Adopt ddR as a prototype for a standard API for distributed computing in R&lt;br /&gt;
* Explore the interfacing of R and Spark in the context of the ddR package (Clark Fitzgerald)&lt;br /&gt;
&lt;br /&gt;
=== Outstanding ===&lt;br /&gt;
&lt;br /&gt;
* Agree on a final standard API for distributed computing in R&lt;br /&gt;
* Implement at least one scalable backend based on an open-source technology like Spark, SQL, etc&lt;br /&gt;
&lt;br /&gt;
== Open Questions ==&lt;br /&gt;
&lt;br /&gt;
* How can we address the needs of both the end user data scientists and the algorithm implementers?&lt;br /&gt;
* How should we share data between R and a system like Spark?&lt;br /&gt;
* Is there any way to unify SparkR and sparklyr?&lt;br /&gt;
* Could we use the abstractions of tensorflow to partially or fully integrate with platforms like Spark?&lt;br /&gt;
&lt;br /&gt;
== Minutes ==&lt;br /&gt;
&lt;br /&gt;
=== 10/13/2016 ===&lt;br /&gt;
&lt;br /&gt;
''Detailed minutes were not taken for this meeting''&lt;br /&gt;
&lt;br /&gt;
* Mario Inchiosa: Microsoft's perspective on distributed computing with R&lt;br /&gt;
** Microsoft R Server: abstractions and algorithms for distributed computation on top of open-source R&lt;br /&gt;
** Desired features of a distributed API like ddR:&lt;br /&gt;
*** Supports PEMA (initialize, processData, updateResults, processResults)&lt;br /&gt;
*** Cross-platform&lt;br /&gt;
*** Fast runtime&lt;br /&gt;
*** Supports algorithm writer and data scientist&lt;br /&gt;
*** Comes with a comprehensive set of algorithms&lt;br /&gt;
*** Easy deployment&lt;br /&gt;
** ddR is making good progress but does not yet meet those requirements&lt;br /&gt;
* Indrajit: ddR progress report and next steps&lt;br /&gt;
** Recap of Clark's internship&lt;br /&gt;
** Next step: implement some of Clark's design suggestions: https://github.com/vertica/ddR/wiki/Design&lt;br /&gt;
** Spark integration will be based on sparklyr&lt;br /&gt;
** Should we limit Spark interaction to the DataFrame API or directly interact with RDDs?&lt;br /&gt;
*** Consensus: will likely need flexibility of RDDs to implement everything we need, e.g., arrays and lists&lt;br /&gt;
** Clark and Javier raised concerns about the scalability of sharing data between R and Spark&lt;br /&gt;
*** Michael: Spark is a platform in its own right, so interoperability is important, should figure something out&lt;br /&gt;
*** Bryan Lewis: Why not use tensor abstraction from tensorflow? Spark supports tensorflow and an R interface is already in the works.&lt;br /&gt;
** Michael raised the issue of additional funding from the R Consortium to continue Clark's work&lt;br /&gt;
*** Joe Rickert suggested that the working group develop one or more white papers summarizing the findings of the working group for presentation to the Infrastructure Steering Committee.&lt;br /&gt;
*** Consensus was in favor of this, and several pointed out that the progress so far has been worthwhile, despite not meeting the specific goals laid out in the proposal.&lt;br /&gt;
* Michael: do we want to invite some external speakers, one per meeting, from groups like databricks, tensorflow, etc?&lt;br /&gt;
** Consensus was in favor.&lt;br /&gt;
&lt;br /&gt;
=== 9/8/2016 ===&lt;br /&gt;
&lt;br /&gt;
''Detailed minutes were not taken for this meeting''&lt;br /&gt;
&lt;br /&gt;
* Clark Fitzgerald: internship report&lt;br /&gt;
** Developed two packages for low-level Spark integration: rddlist, sparklite&lt;br /&gt;
** Patched a bug in Spark&lt;br /&gt;
** ddR needs refactoring before Spark integration is feasible:&lt;br /&gt;
*** dlist, dframe, and darray should be formal classes.&lt;br /&gt;
*** Partitions of data should be represented by a distributed list abstraction, and most functions (e.g., dmapply) should be implemented on top of that list.&lt;br /&gt;
* Javier: sparklyr update&lt;br /&gt;
** Preparing for CRAN release&lt;br /&gt;
** Mario: what happened to sparkapi?&lt;br /&gt;
*** Javier: sparkapi has been merged into sparklyr in order to avoid overhead of maintaining two packages. ddR can do everything it needs with sparklyr.&lt;br /&gt;
* Luke Tierney: Update on the low-level vector abstraction, which might support interfaces like ddR and sparklyr.&lt;br /&gt;
** Overall approach seems feasible, but still working out a few details.&lt;br /&gt;
** Will land in a branch soon.&lt;br /&gt;
* Bernd Bischl: update on the batchtools package&lt;br /&gt;
** Successor to BatchJobs based on in-memory database&lt;br /&gt;
&lt;br /&gt;
=== 8/11/2016 ===&lt;br /&gt;
&lt;br /&gt;
''Meeting was canceled due to lack of availability.''&lt;br /&gt;
&lt;br /&gt;
=== 7/14/2016 ===&lt;br /&gt;
&lt;br /&gt;
* Introduced Clark who is the intern funded by R Consortium. Clark is a graduate student from UC Davis. He will work on ddR integration with Spark and improving the core ddR API as well such as adding a distributed apply() for matrices, split function, etc.&lt;br /&gt;
* Bernd: Can I play around with ddR now? What backend should I use? How robust is the code?&lt;br /&gt;
** Clark: It's in good enough shape to be played around with. We will continue to improve it. Hopefully the spark integration will be done before the end of my internship in September.&lt;br /&gt;
* Q: Is anyone working on using ddR to make ML scale better.&lt;br /&gt;
** Indrajit: We have kmeans, glm, etc. already in CRAN.&lt;br /&gt;
** Michael Kane: We are working on glmnet and other packages related to algorithm development.&lt;br /&gt;
* Javier gave a demo of sparklyr and sparkapi.&lt;br /&gt;
** Motivation for the pacakage: The SparkR package overrides the dplyr interface. This is an issue for RStudio. SparkR is not a CRAN package which makes it difficult to add changes. dplyr is the most popular tool by RStudio and is broken on SparkR.&lt;br /&gt;
** Sparklyr provides a dplyr interface. It will also support ML like interfaces, such as consuming a ML model.&lt;br /&gt;
** Sparklyr does not currently support any distributed computing features. Instead we can recommend ddR as the distributed computing framework on top of sparkapi. We will put the code in CRAN in a couple of weeks.&lt;br /&gt;
** Simon: Can you talk more about the wrapper/low level API to work with Spark?&lt;br /&gt;
*** Javier: The package underneath the cover is called &amp;quot;sparkapi&amp;quot; it is to be used by pacakge builders. &amp;quot;spark_context()&amp;quot; and &amp;quot;invoke()&amp;quot; are the functionality to call scala methods. It does not you to currently run R user defined functions. I am currently working on enabling that feature. Depending upon the interest in using ddR with sparkapi, I can spend more time to make sparkapi feature rich.&lt;br /&gt;
** Indrajit: What versions of Spark are supported&lt;br /&gt;
*** Javier: Anything after 1.6&lt;br /&gt;
** Bernd: How do you export data?&lt;br /&gt;
*** Javier: We are using all the code from SparkR. So everything in SparkR should continue to work. We don't need to change SparkR. We just need to maintain compatibility.&lt;br /&gt;
** Bernd: What happens when the RDDs are very large?&lt;br /&gt;
*** Javier: Spark will spill on disk.&lt;br /&gt;
* Michael Kane: Presented examples that he implemented on ddR.&lt;br /&gt;
** Talked about how the different distributed packages compare to each other in terms of functionality.&lt;br /&gt;
** Michael K. looked at glm and truncated SVD on ddR. Was able to implement irls on ddR by implementing two distributed functions such as &amp;quot;cross&amp;quot;. In truncated SVD only needed to overload two different distributed multiplications. &lt;br /&gt;
** Ran these algorithms on the 1000 genome dataset.&lt;br /&gt;
** Overall liked ddR since it was easy to implement the algorithms in the package.&lt;br /&gt;
** New ideas:&lt;br /&gt;
*** Trying to separate the data layer from the execution layer&lt;br /&gt;
*** Create an API that works on &amp;quot;chunks&amp;quot; (which is similar to the &amp;quot;parts&amp;quot; API in ddR). Would like to add these APIs to ddR.&lt;br /&gt;
*** Indrajit: You should be able to get some of the chunk like features by using parts and dmapply. E.g., you can call dmapply to read 10 different files, which correspond 10 chunks now. These are however wrapped as a darray or dframe. But you can continue to work on the individual chunks by using parts(i).&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== 6/2/2016 ===&lt;br /&gt;
&lt;br /&gt;
* Round table introduction&lt;br /&gt;
* (Michael) Goals for the group:&lt;br /&gt;
**  Make a common abstraction/interfaces to make it easier to work with distributed data and R &lt;br /&gt;
**  Unify the interface  &lt;br /&gt;
**  Working group will run for a year. Get an API defined, get at least one open source reference implementations&lt;br /&gt;
**  not everyone needs to work hands on. We will create smaller groups to focus on those aspects.&lt;br /&gt;
**  We tried to get a diverse group of participants&lt;br /&gt;
* Logistics: meet monthly, focus groups may meet more often&lt;br /&gt;
* R Consoritum may be able to figure ways to fund smaller projects that come out of the working group&lt;br /&gt;
* Michael Kane: Should we start with an inventory of what is available and people are using?&lt;br /&gt;
** Michael Lawrence: Yes, we should find the collection of tools as well as the use cases that are common.&lt;br /&gt;
** Joe: I will figure out a wiki space.&lt;br /&gt;
* Javier: Who are the end users? Simon: Common layer needed to get algorithms working. We started from algos and tried to find the minimal common api. One of the goals is to make sure everyone is on the same page and not trying to create his/her own custom interface.&lt;br /&gt;
* Javier: Should we try to get people with more algo expertise?&lt;br /&gt;
* Joe: Simon do you have a stack diagram?&lt;br /&gt;
* Simon: Can we get R Consortium to help write things up and draw things?&lt;br /&gt;
* Next meeting: Javier is going to present SparkR next time.&lt;/div&gt;</summary>
		<author><name>MichaelLawrence</name></author>	</entry>

	<entry>
		<id>https://wiki.r-consortium.org/view/Distributed_Computing_Working_Group</id>
		<title>Distributed Computing Working Group</title>
		<link rel="alternate" type="text/html" href="https://wiki.r-consortium.org/view/Distributed_Computing_Working_Group"/>
				<updated>2016-10-18T11:52:48Z</updated>
		
		<summary type="html">&lt;p&gt;MichaelLawrence: October minutes&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Goals and Purpose ==&lt;br /&gt;
&lt;br /&gt;
The Distributed Computing Working Group will endorse the design of a common abstraction for distributed data structures in R. We aim to have at least one open-source implementation, as well as a SQL implementation, released within a year of forming the group.&lt;br /&gt;
&lt;br /&gt;
== Members ==&lt;br /&gt;
&lt;br /&gt;
* '''Michael Lawrence''' (Genentech)&lt;br /&gt;
* '''Indrajit Roy''' (HP Enterprise)&lt;br /&gt;
* ''Joe Rickert'' (Microsoft)&lt;br /&gt;
* Bernd Bischl (LMU)&lt;br /&gt;
* Matt Dowle (H2O)&lt;br /&gt;
* Mario Inchiosa (Microsoft)&lt;br /&gt;
* Michael Kane (Yale)&lt;br /&gt;
* Javier Luraschi (RStudio)&lt;br /&gt;
* Edward Ma (HP Enterprise)&lt;br /&gt;
* Luke Tierney (University of Iowa)&lt;br /&gt;
* Simon Urbanek (AT&amp;amp;T)&lt;br /&gt;
&lt;br /&gt;
== Minutes ==&lt;br /&gt;
&lt;br /&gt;
=== 10/13/2016 ===&lt;br /&gt;
&lt;br /&gt;
''Detailed minutes were not taken for this meeting''&lt;br /&gt;
&lt;br /&gt;
* Mario Inchiosa: Microsoft's perspective on distributed computing with R&lt;br /&gt;
** Microsoft R Server: abstractions and algorithms for distributed computation on top of open-source R&lt;br /&gt;
** Desired features of a distributed API like ddR:&lt;br /&gt;
*** Supports PEMA (initialize, processData, updateResults, processResults)&lt;br /&gt;
*** Cross-platform&lt;br /&gt;
*** Fast runtime&lt;br /&gt;
*** Supports algorithm writer and data scientist&lt;br /&gt;
*** Comes with a comprehensive set of algorithms&lt;br /&gt;
*** Easy deployment&lt;br /&gt;
** ddR is making good progress but does not yet meet those requirements&lt;br /&gt;
* Indrajit: ddR progress report and next steps&lt;br /&gt;
** Recap of Clark's internship&lt;br /&gt;
** Next step: implement some of Clark's design suggestions: https://github.com/vertica/ddR/wiki/Design&lt;br /&gt;
** Spark integration will be based on sparklyr&lt;br /&gt;
** Should we limit Spark interaction to the DataFrame API or directly interact with RDDs?&lt;br /&gt;
*** Consensus: will likely need flexibility of RDDs to implement everything we need, e.g., arrays and lists&lt;br /&gt;
** Clark and Javier raised concerns about the scalability of sharing data between R and Spark&lt;br /&gt;
*** Michael: Spark is a platform in its own right, so interoperability is important, should figure something out&lt;br /&gt;
*** Bryan Lewis: Why not use tensor abstraction from tensorflow? Spark supports tensorflow and an R interface is already in the works.&lt;br /&gt;
** Michael raised the issue of additional funding from the R Consortium to continue Clark's work&lt;br /&gt;
*** Joe Rickert suggested that the working group develop one or more white papers summarizing the findings of the working group for presentation to the Infrastructure Steering Committee.&lt;br /&gt;
*** Consensus was in favor of this, and several pointed out that the progress so far has been worthwhile, despite not meeting the specific goals laid out in the proposal.&lt;br /&gt;
* Michael: do we want to invite some external speakers, one per meeting, from groups like databricks, tensorflow, etc?&lt;br /&gt;
** Consensus was in favor.&lt;br /&gt;
&lt;br /&gt;
=== 9/8/2016 ===&lt;br /&gt;
&lt;br /&gt;
''Detailed minutes were not taken for this meeting''&lt;br /&gt;
&lt;br /&gt;
* Clark Fitzgerald: internship report&lt;br /&gt;
** Developed two packages for low-level Spark integration: rddlist, sparklite&lt;br /&gt;
** Patched a bug in Spark&lt;br /&gt;
** ddR needs refactoring before Spark integration is feasible:&lt;br /&gt;
*** dlist, dframe, and darray should be formal classes.&lt;br /&gt;
*** Partitions of data should be represented by a distributed list abstraction, and most functions (e.g., dmapply) should be implemented on top of that list.&lt;br /&gt;
* Javier: sparklyr update&lt;br /&gt;
** Preparing for CRAN release&lt;br /&gt;
** Mario: what happened to sparkapi?&lt;br /&gt;
*** Javier: sparkapi has been merged into sparklyr in order to avoid overhead of maintaining two packages. ddR can do everything it needs with sparklyr.&lt;br /&gt;
* Luke Tierney: Update on the low-level vector abstraction, which might support interfaces like ddR and sparklyr.&lt;br /&gt;
** Overall approach seems feasible, but still working out a few details.&lt;br /&gt;
** Will land in a branch soon.&lt;br /&gt;
* Bernd Bischl: update on the batchtools package&lt;br /&gt;
** Successor to BatchJobs based on in-memory database&lt;br /&gt;
&lt;br /&gt;
=== 8/11/2016 ===&lt;br /&gt;
&lt;br /&gt;
''Meeting was canceled due to lack of availability.''&lt;br /&gt;
&lt;br /&gt;
=== 7/14/2016 ===&lt;br /&gt;
&lt;br /&gt;
* Introduced Clark who is the intern funded by R Consortium. Clark is a graduate student from UC Davis. He will work on ddR integration with Spark and improving the core ddR API as well such as adding a distributed apply() for matrices, split function, etc.&lt;br /&gt;
* Bernd: Can I play around with ddR now? What backend should I use? How robust is the code?&lt;br /&gt;
** Clark: It's in good enough shape to be played around with. We will continue to improve it. Hopefully the spark integration will be done before the end of my internship in September.&lt;br /&gt;
* Q: Is anyone working on using ddR to make ML scale better.&lt;br /&gt;
** Indrajit: We have kmeans, glm, etc. already in CRAN.&lt;br /&gt;
** Michael Kane: We are working on glmnet and other packages related to algorithm development.&lt;br /&gt;
* Javier gave a demo of sparklyr and sparkapi.&lt;br /&gt;
** Motivation for the pacakage: The SparkR package overrides the dplyr interface. This is an issue for RStudio. SparkR is not a CRAN package which makes it difficult to add changes. dplyr is the most popular tool by RStudio and is broken on SparkR.&lt;br /&gt;
** Sparklyr provides a dplyr interface. It will also support ML like interfaces, such as consuming a ML model.&lt;br /&gt;
** Sparklyr does not currently support any distributed computing features. Instead we can recommend ddR as the distributed computing framework on top of sparkapi. We will put the code in CRAN in a couple of weeks.&lt;br /&gt;
** Simon: Can you talk more about the wrapper/low level API to work with Spark?&lt;br /&gt;
*** Javier: The package underneath the cover is called &amp;quot;sparkapi&amp;quot; it is to be used by pacakge builders. &amp;quot;spark_context()&amp;quot; and &amp;quot;invoke()&amp;quot; are the functionality to call scala methods. It does not you to currently run R user defined functions. I am currently working on enabling that feature. Depending upon the interest in using ddR with sparkapi, I can spend more time to make sparkapi feature rich.&lt;br /&gt;
** Indrajit: What versions of Spark are supported&lt;br /&gt;
*** Javier: Anything after 1.6&lt;br /&gt;
** Bernd: How do you export data?&lt;br /&gt;
*** Javier: We are using all the code from SparkR. So everything in SparkR should continue to work. We don't need to change SparkR. We just need to maintain compatibility.&lt;br /&gt;
** Bernd: What happens when the RDDs are very large?&lt;br /&gt;
*** Javier: Spark will spill on disk.&lt;br /&gt;
* Michael Kane: Presented examples that he implemented on ddR.&lt;br /&gt;
** Talked about how the different distributed packages compare to each other in terms of functionality.&lt;br /&gt;
** Michael K. looked at glm and truncated SVD on ddR. Was able to implement irls on ddR by implementing two distributed functions such as &amp;quot;cross&amp;quot;. In truncated SVD only needed to overload two different distributed multiplications. &lt;br /&gt;
** Ran these algorithms on the 1000 genome dataset.&lt;br /&gt;
** Overall liked ddR since it was easy to implement the algorithms in the package.&lt;br /&gt;
** New ideas:&lt;br /&gt;
*** Trying to separate the data layer from the execution layer&lt;br /&gt;
*** Create an API that works on &amp;quot;chunks&amp;quot; (which is similar to the &amp;quot;parts&amp;quot; API in ddR). Would like to add these APIs to ddR.&lt;br /&gt;
*** Indrajit: You should be able to get some of the chunk like features by using parts and dmapply. E.g., you can call dmapply to read 10 different files, which correspond 10 chunks now. These are however wrapped as a darray or dframe. But you can continue to work on the individual chunks by using parts(i).&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== 6/2/2016 ===&lt;br /&gt;
&lt;br /&gt;
* Round table introduction&lt;br /&gt;
* (Michael) Goals for the group:&lt;br /&gt;
**  Make a common abstraction/interfaces to make it easier to work with distributed data and R &lt;br /&gt;
**  Unify the interface  &lt;br /&gt;
**  Working group will run for a year. Get an API defined, get at least one open source reference implementations&lt;br /&gt;
**  not everyone needs to work hands on. We will create smaller groups to focus on those aspects.&lt;br /&gt;
**  We tried to get a diverse group of participants&lt;br /&gt;
* Logistics: meet monthly, focus groups may meet more often&lt;br /&gt;
* R Consoritum may be able to figure ways to fund smaller projects that come out of the working group&lt;br /&gt;
* Michael Kane: Should we start with an inventory of what is available and people are using?&lt;br /&gt;
** Michael Lawrence: Yes, we should find the collection of tools as well as the use cases that are common.&lt;br /&gt;
** Joe: I will figure out a wiki space.&lt;br /&gt;
* Javier: Who are the end users? Simon: Common layer needed to get algorithms working. We started from algos and tried to find the minimal common api. One of the goals is to make sure everyone is on the same page and not trying to create his/her own custom interface.&lt;br /&gt;
* Javier: Should we try to get people with more algo expertise?&lt;br /&gt;
* Joe: Simon do you have a stack diagram?&lt;br /&gt;
* Simon: Can we get R Consortium to help write things up and draw things?&lt;br /&gt;
* Next meeting: Javier is going to present SparkR next time.&lt;/div&gt;</summary>
		<author><name>MichaelLawrence</name></author>	</entry>

	<entry>
		<id>https://wiki.r-consortium.org/view/Distributed_Computing_Working_Group</id>
		<title>Distributed Computing Working Group</title>
		<link rel="alternate" type="text/html" href="https://wiki.r-consortium.org/view/Distributed_Computing_Working_Group"/>
				<updated>2016-10-18T11:29:29Z</updated>
		
		<summary type="html">&lt;p&gt;MichaelLawrence: September minutes&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Goals and Purpose ==&lt;br /&gt;
&lt;br /&gt;
The Distributed Computing Working Group will endorse the design of a common abstraction for distributed data structures in R. We aim to have at least one open-source implementation, as well as a SQL implementation, released within a year of forming the group.&lt;br /&gt;
&lt;br /&gt;
== Members ==&lt;br /&gt;
&lt;br /&gt;
* '''Michael Lawrence''' (Genentech)&lt;br /&gt;
* '''Indrajit Roy''' (HP Enterprise)&lt;br /&gt;
* ''Joe Rickert'' (Microsoft)&lt;br /&gt;
* Bernd Bischl (LMU)&lt;br /&gt;
* Matt Dowle (H2O)&lt;br /&gt;
* Mario Inchiosa (Microsoft)&lt;br /&gt;
* Michael Kane (Yale)&lt;br /&gt;
* Javier Luraschi (RStudio)&lt;br /&gt;
* Edward Ma (HP Enterprise)&lt;br /&gt;
* Luke Tierney (University of Iowa)&lt;br /&gt;
* Simon Urbanek (AT&amp;amp;T)&lt;br /&gt;
&lt;br /&gt;
== Minutes ==&lt;br /&gt;
&lt;br /&gt;
=== 9/8/2016 ===&lt;br /&gt;
&lt;br /&gt;
''Detailed notes were not taken for this meeting''&lt;br /&gt;
&lt;br /&gt;
* Clark Fitzgerald: internship report&lt;br /&gt;
** Developed two packages for low-level Spark integration: rddlist, sparklite&lt;br /&gt;
** Patched a bug in Spark&lt;br /&gt;
** ddR needs refactoring before Spark integration is feasible:&lt;br /&gt;
*** dlist, dframe, and darray should be formal classes.&lt;br /&gt;
*** Partitions of data should be represented by a distributed list abstraction, and most functions (e.g., dmapply) should be implemented on top of that list.&lt;br /&gt;
* Javier: sparklyr update&lt;br /&gt;
** Preparing for CRAN release&lt;br /&gt;
** Mario: what happened to sparkapi?&lt;br /&gt;
*** Javier: sparkapi has been merged into sparklyr in order to avoid overhead of maintaining two packages. ddR can do everything it needs with sparklyr.&lt;br /&gt;
* Luke Tierney: Update on the low-level vector abstraction, which might support interfaces like ddR and sparklyr.&lt;br /&gt;
** Overall approach seems feasible, but still working out a few details.&lt;br /&gt;
** Will land in a branch soon.&lt;br /&gt;
* Bernd Bischl: update on the batchtools package&lt;br /&gt;
** Successor to BatchJobs based on in-memory database&lt;br /&gt;
&lt;br /&gt;
=== 8/11/2016 ===&lt;br /&gt;
&lt;br /&gt;
''Meeting was canceled due to lack of availability.''&lt;br /&gt;
&lt;br /&gt;
=== 7/14/2016 ===&lt;br /&gt;
&lt;br /&gt;
* Introduced Clark who is the intern funded by R Consortium. Clark is a graduate student from UC Davis. He will work on ddR integration with Spark and improving the core ddR API as well such as adding a distributed apply() for matrices, split function, etc.&lt;br /&gt;
* Bernd: Can I play around with ddR now? What backend should I use? How robust is the code?&lt;br /&gt;
** Clark: It's in good enough shape to be played around with. We will continue to improve it. Hopefully the spark integration will be done before the end of my internship in September.&lt;br /&gt;
* Q: Is anyone working on using ddR to make ML scale better.&lt;br /&gt;
** Indrajit: We have kmeans, glm, etc. already in CRAN.&lt;br /&gt;
** Michael Kane: We are working on glmnet and other packages related to algorithm development.&lt;br /&gt;
* Javier gave a demo of sparklyr and sparkapi.&lt;br /&gt;
** Motivation for the pacakage: The SparkR package overrides the dplyr interface. This is an issue for RStudio. SparkR is not a CRAN package which makes it difficult to add changes. dplyr is the most popular tool by RStudio and is broken on SparkR.&lt;br /&gt;
** Sparklyr provides a dplyr interface. It will also support ML like interfaces, such as consuming a ML model.&lt;br /&gt;
** Sparklyr does not currently support any distributed computing features. Instead we can recommend ddR as the distributed computing framework on top of sparkapi. We will put the code in CRAN in a couple of weeks.&lt;br /&gt;
** Simon: Can you talk more about the wrapper/low level API to work with Spark?&lt;br /&gt;
*** Javier: The package underneath the cover is called &amp;quot;sparkapi&amp;quot; it is to be used by pacakge builders. &amp;quot;spark_context()&amp;quot; and &amp;quot;invoke()&amp;quot; are the functionality to call scala methods. It does not you to currently run R user defined functions. I am currently working on enabling that feature. Depending upon the interest in using ddR with sparkapi, I can spend more time to make sparkapi feature rich.&lt;br /&gt;
** Indrajit: What versions of Spark are supported&lt;br /&gt;
*** Javier: Anything after 1.6&lt;br /&gt;
** Bernd: How do you export data?&lt;br /&gt;
*** Javier: We are using all the code from SparkR. So everything in SparkR should continue to work. We don't need to change SparkR. We just need to maintain compatibility.&lt;br /&gt;
** Bernd: What happens when the RDDs are very large?&lt;br /&gt;
*** Javier: Spark will spill on disk.&lt;br /&gt;
* Michael Kane: Presented examples that he implemented on ddR.&lt;br /&gt;
** Talked about how the different distributed packages compare to each other in terms of functionality.&lt;br /&gt;
** Michael K. looked at glm and truncated SVD on ddR. Was able to implement irls on ddR by implementing two distributed functions such as &amp;quot;cross&amp;quot;. In truncated SVD only needed to overload two different distributed multiplications. &lt;br /&gt;
** Ran these algorithms on the 1000 genome dataset.&lt;br /&gt;
** Overall liked ddR since it was easy to implement the algorithms in the package.&lt;br /&gt;
** New ideas:&lt;br /&gt;
*** Trying to separate the data layer from the execution layer&lt;br /&gt;
*** Create an API that works on &amp;quot;chunks&amp;quot; (which is similar to the &amp;quot;parts&amp;quot; API in ddR). Would like to add these APIs to ddR.&lt;br /&gt;
*** Indrajit: You should be able to get some of the chunk like features by using parts and dmapply. E.g., you can call dmapply to read 10 different files, which correspond 10 chunks now. These are however wrapped as a darray or dframe. But you can continue to work on the individual chunks by using parts(i).&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== 6/2/2016 ===&lt;br /&gt;
&lt;br /&gt;
* Round table introduction&lt;br /&gt;
* (Michael) Goals for the group:&lt;br /&gt;
**  Make a common abstraction/interfaces to make it easier to work with distributed data and R &lt;br /&gt;
**  Unify the interface  &lt;br /&gt;
**  Working group will run for a year. Get an API defined, get at least one open source reference implementations&lt;br /&gt;
**  not everyone needs to work hands on. We will create smaller groups to focus on those aspects.&lt;br /&gt;
**  We tried to get a diverse group of participants&lt;br /&gt;
* Logistics: meet monthly, focus groups may meet more often&lt;br /&gt;
* R Consoritum may be able to figure ways to fund smaller projects that come out of the working group&lt;br /&gt;
* Michael Kane: Should we start with an inventory of what is available and people are using?&lt;br /&gt;
** Michael Lawrence: Yes, we should find the collection of tools as well as the use cases that are common.&lt;br /&gt;
** Joe: I will figure out a wiki space.&lt;br /&gt;
* Javier: Who are the end users? Simon: Common layer needed to get algorithms working. We started from algos and tried to find the minimal common api. One of the goals is to make sure everyone is on the same page and not trying to create his/her own custom interface.&lt;br /&gt;
* Javier: Should we try to get people with more algo expertise?&lt;br /&gt;
* Joe: Simon do you have a stack diagram?&lt;br /&gt;
* Simon: Can we get R Consortium to help write things up and draw things?&lt;br /&gt;
* Next meeting: Javier is going to present SparkR next time.&lt;/div&gt;</summary>
		<author><name>MichaelLawrence</name></author>	</entry>

	<entry>
		<id>https://wiki.r-consortium.org/view/Distributed_Computing_Working_Group</id>
		<title>Distributed Computing Working Group</title>
		<link rel="alternate" type="text/html" href="https://wiki.r-consortium.org/view/Distributed_Computing_Working_Group"/>
				<updated>2016-10-18T11:11:44Z</updated>
		
		<summary type="html">&lt;p&gt;MichaelLawrence: add July minutes&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Goals and Purpose ==&lt;br /&gt;
&lt;br /&gt;
The Distributed Computing Working Group will endorse the design of a common abstraction for distributed data structures in R. We aim to have at least one open-source implementation, as well as a SQL implementation, released within a year of forming the group.&lt;br /&gt;
&lt;br /&gt;
== Members ==&lt;br /&gt;
&lt;br /&gt;
* '''Michael Lawrence''' (Genentech)&lt;br /&gt;
* '''Indrajit Roy''' (HP Enterprise)&lt;br /&gt;
* ''Joe Rickert'' (Microsoft)&lt;br /&gt;
* Bernd Bischl (LMU)&lt;br /&gt;
* Matt Dowle (H2O)&lt;br /&gt;
* Mario Inchiosa (Microsoft)&lt;br /&gt;
* Michael Kane (Yale)&lt;br /&gt;
* Javier Luraschi (RStudio)&lt;br /&gt;
* Edward Ma (HP Enterprise)&lt;br /&gt;
* Luke Tierney (University of Iowa)&lt;br /&gt;
* Simon Urbanek (AT&amp;amp;T)&lt;br /&gt;
&lt;br /&gt;
== Minutes ==&lt;br /&gt;
&lt;br /&gt;
=== 7/14/2016 ===&lt;br /&gt;
&lt;br /&gt;
* Introduced Clark who is the intern funded by R Consortium. Clark is a graduate student from UC Davis. He will work on ddR integration with Spark and improving the core ddR API as well such as adding a distributed apply() for matrices, split function, etc.&lt;br /&gt;
* Bernd: Can I play around with ddR now? What backend should I use? How robust is the code?&lt;br /&gt;
** Clark: It's in good enough shape to be played around with. We will continue to improve it. Hopefully the spark integration will be done before the end of my internship in September.&lt;br /&gt;
* Q: Is anyone working on using ddR to make ML scale better.&lt;br /&gt;
** Indrajit: We have kmeans, glm, etc. already in CRAN.&lt;br /&gt;
** Michael Kane: We are working on glmnet and other packages related to algorithm development.&lt;br /&gt;
* Javier gave a demo of sparklyr and sparkapi.&lt;br /&gt;
** Motivation for the pacakage: The SparkR package overrides the dplyr interface. This is an issue for RStudio. SparkR is not a CRAN package which makes it difficult to add changes. dplyr is the most popular tool by RStudio and is broken on SparkR.&lt;br /&gt;
** Sparklyr provides a dplyr interface. It will also support ML like interfaces, such as consuming a ML model.&lt;br /&gt;
** Sparklyr does not currently support any distributed computing features. Instead we can recommend ddR as the distributed computing framework on top of sparkapi. We will put the code in CRAN in a couple of weeks.&lt;br /&gt;
** Simon: Can you talk more about the wrapper/low level API to work with Spark?&lt;br /&gt;
*** Javier: The package underneath the cover is called &amp;quot;sparkapi&amp;quot; it is to be used by pacakge builders. &amp;quot;spark_context()&amp;quot; and &amp;quot;invoke()&amp;quot; are the functionality to call scala methods. It does not you to currently run R user defined functions. I am currently working on enabling that feature. Depending upon the interest in using ddR with sparkapi, I can spend more time to make sparkapi feature rich.&lt;br /&gt;
** Indrajit: What versions of Spark are supported&lt;br /&gt;
*** Javier: Anything after 1.6&lt;br /&gt;
** Bernd: How do you export data?&lt;br /&gt;
*** Javier: We are using all the code from SparkR. So everything in SparkR should continue to work. We don't need to change SparkR. We just need to maintain compatibility.&lt;br /&gt;
** Bernd: What happens when the RDDs are very large?&lt;br /&gt;
*** Javier: Spark will spill on disk.&lt;br /&gt;
* Michael Kane: Presented examples that he implemented on ddR.&lt;br /&gt;
** Talked about how the different distributed packages compare to each other in terms of functionality.&lt;br /&gt;
** Michael K. looked at glm and truncated SVD on ddR. Was able to implement irls on ddR by implementing two distributed functions such as &amp;quot;cross&amp;quot;. In truncated SVD only needed to overload two different distributed multiplications. &lt;br /&gt;
** Ran these algorithms on the 1000 genome dataset.&lt;br /&gt;
** Overall liked ddR since it was easy to implement the algorithms in the package.&lt;br /&gt;
** New ideas:&lt;br /&gt;
*** Trying to separate the data layer from the execution layer&lt;br /&gt;
*** Create an API that works on &amp;quot;chunks&amp;quot; (which is similar to the &amp;quot;parts&amp;quot; API in ddR). Would like to add these APIs to ddR.&lt;br /&gt;
*** Indrajit: You should be able to get some of the chunk like features by using parts and dmapply. E.g., you can call dmapply to read 10 different files, which correspond 10 chunks now. These are however wrapped as a darray or dframe. But you can continue to work on the individual chunks by using parts(i).&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== 6/2/2016 ===&lt;br /&gt;
&lt;br /&gt;
* Round table introduction&lt;br /&gt;
* (Michael) Goals for the group:&lt;br /&gt;
**  Make a common abstraction/interfaces to make it easier to work with distributed data and R &lt;br /&gt;
**  Unify the interface  &lt;br /&gt;
**  Working group will run for a year. Get an API defined, get at least one open source reference implementations&lt;br /&gt;
**  not everyone needs to work hands on. We will create smaller groups to focus on those aspects.&lt;br /&gt;
**  We tried to get a diverse group of participants&lt;br /&gt;
* Logistics: meet monthly, focus groups may meet more often&lt;br /&gt;
* R Consoritum may be able to figure ways to fund smaller projects that come out of the working group&lt;br /&gt;
* Michael Kane: Should we start with an inventory of what is available and people are using?&lt;br /&gt;
** Michael Lawrence: Yes, we should find the collection of tools as well as the use cases that are common.&lt;br /&gt;
** Joe: I will figure out a wiki space.&lt;br /&gt;
* Javier: Who are the end users? Simon: Common layer needed to get algorithms working. We started from algos and tried to find the minimal common api. One of the goals is to make sure everyone is on the same page and not trying to create his/her own custom interface.&lt;br /&gt;
* Javier: Should we try to get people with more algo expertise?&lt;br /&gt;
* Joe: Simon do you have a stack diagram?&lt;br /&gt;
* Simon: Can we get R Consortium to help write things up and draw things?&lt;br /&gt;
* Next meeting: Javier is going to present SparkR next time.&lt;/div&gt;</summary>
		<author><name>MichaelLawrence</name></author>	</entry>

	<entry>
		<id>https://wiki.r-consortium.org/view/Distributed_Computing_Working_Group</id>
		<title>Distributed Computing Working Group</title>
		<link rel="alternate" type="text/html" href="https://wiki.r-consortium.org/view/Distributed_Computing_Working_Group"/>
				<updated>2016-07-14T12:39:43Z</updated>
		
		<summary type="html">&lt;p&gt;MichaelLawrence: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Goals and Purpose ==&lt;br /&gt;
&lt;br /&gt;
The Distributed Computing Working Group will endorse the design of a common abstraction for distributed data structures in R. We aim to have at least one open-source implementation, as well as a SQL implementation, released within a year of forming the group.&lt;br /&gt;
&lt;br /&gt;
== Members ==&lt;br /&gt;
&lt;br /&gt;
* '''Michael Lawrence''' (Genentech)&lt;br /&gt;
* '''Indrajit Roy''' (HP Enterprise)&lt;br /&gt;
* ''Joe Rickert'' (Microsoft)&lt;br /&gt;
* Bernd Bischl (LMU)&lt;br /&gt;
* Matt Dowle (H2O)&lt;br /&gt;
* Mario Inchiosa (Microsoft)&lt;br /&gt;
* Michael Kane (Yale)&lt;br /&gt;
* Javier Luraschi (RStudio)&lt;br /&gt;
* Edward Ma (HP Enterprise)&lt;br /&gt;
* Luke Tierney (University of Iowa)&lt;br /&gt;
* Simon Urbanek (AT&amp;amp;T)&lt;br /&gt;
&lt;br /&gt;
== Minutes ==&lt;br /&gt;
&lt;br /&gt;
=== 6/2/2016 ===&lt;br /&gt;
&lt;br /&gt;
* Round table introduction&lt;br /&gt;
* (Michael) Goals for the group:&lt;br /&gt;
**  Make a common abstraction/interfaces to make it easier to work with distributed data and R &lt;br /&gt;
**  Unify the interface  &lt;br /&gt;
**  Working group will run for a year. Get an API defined, get at least one open source reference implementations&lt;br /&gt;
**  not everyone needs to work hands on. We will create smaller groups to focus on those aspects.&lt;br /&gt;
**  We tried to get a diverse group of participants&lt;br /&gt;
* Logistics: meet monthly, focus groups may meet more often&lt;br /&gt;
* R Consoritum may be able to figure ways to fund smaller projects that come out of the working group&lt;br /&gt;
* Michael Kane: Should we start with an inventory of what is available and people are using?&lt;br /&gt;
** Michael Lawrence: Yes, we should find the collection of tools as well as the use cases that are common.&lt;br /&gt;
** Joe: I will figure out a wiki space.&lt;br /&gt;
* Javier: Who are the end users? Simon: Common layer needed to get algorithms working. We started from algos and tried to find the minimal common api. One of the goals is to make sure everyone is on the same page and not trying to create his/her own custom interface.&lt;br /&gt;
* Javier: Should we try to get people with more algo expertise?&lt;br /&gt;
* Joe: Simon do you have a stack diagram?&lt;br /&gt;
* Simon: Can we get R Consortium to help write things up and draw things?&lt;br /&gt;
* Next meeting: Javier is going to present SparkR next time.&lt;/div&gt;</summary>
		<author><name>MichaelLawrence</name></author>	</entry>

	<entry>
		<id>https://wiki.r-consortium.org/view/Distributed_Computing_Working_Group</id>
		<title>Distributed Computing Working Group</title>
		<link rel="alternate" type="text/html" href="https://wiki.r-consortium.org/view/Distributed_Computing_Working_Group"/>
				<updated>2016-06-24T15:44:58Z</updated>
		
		<summary type="html">&lt;p&gt;MichaelLawrence: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Goals and Purpose ==&lt;br /&gt;
&lt;br /&gt;
The Distributed Computing Working Group will endorse the design of a common abstraction for distributed data structures in R. We aim to have at least one open-source implementation, as well as a SQL implementation, released within a year of forming the group.&lt;br /&gt;
&lt;br /&gt;
== Members ==&lt;br /&gt;
&lt;br /&gt;
* '''Michael Lawrence''' (Genentech)&lt;br /&gt;
* '''Indrajit Roy''' (HP Enterprise)&lt;br /&gt;
* ''Joe Rickert'' (Microsoft)&lt;br /&gt;
* Bernd Bischl (LMU)&lt;br /&gt;
* Matt Dowle (H2O)&lt;br /&gt;
* Mario Inchiosa (Microsoft)&lt;br /&gt;
* Michael Kane (Yale)&lt;br /&gt;
* Javier Luraschi (RStudio)&lt;br /&gt;
* Edward Ma (HP Enterprise)&lt;br /&gt;
* Luke Tierney (University of Iowa)&lt;br /&gt;
* Simon Urbanek (AT&amp;amp;T)&lt;br /&gt;
&lt;br /&gt;
== Minutes ==&lt;br /&gt;
&lt;br /&gt;
* Round table introduction&lt;br /&gt;
&lt;br /&gt;
* (Michael) Goals for the group:&lt;br /&gt;
**  Make a common abstraction/interfaces to make it easier to work with distributed data and R &lt;br /&gt;
**  Unify the interface  &lt;br /&gt;
**  Working group will run for a year. Get an API defined, get at least one open source reference implementations&lt;br /&gt;
**  not everyone needs to work hands on. We will create smaller groups to focus on those aspects.&lt;br /&gt;
**  We tried to get a diverse group of participants&lt;br /&gt;
* Logistics: meet monthly, focus groups may meet more often&lt;br /&gt;
* R Consoritum may be able to figure ways to fund smaller projects that come out of the working group&lt;br /&gt;
* Michael Kane: Should we start with an inventory of what is available and people are using?&lt;br /&gt;
** Michael Lawrence: Yes, we should find the collection of tools as well as the use cases that are common.&lt;br /&gt;
** Joe: I will figure out a wiki space.&lt;br /&gt;
* Javier: Who are the end users? Simon: Common layer needed to get algorithms working. We started from algos and tried to find the minimal common api. One of the goals is to make sure everyone is on the same page and not trying to create his/her own custom interface.&lt;br /&gt;
* Javier: Should we try to get people with more algo expertise?&lt;br /&gt;
* Joe: Simon do you have a stack diagram?&lt;br /&gt;
* Simon: Can we get R Consortium to help write things up and draw things?&lt;br /&gt;
* Next meeting: Javier is going to present SparkR next time.&lt;/div&gt;</summary>
		<author><name>MichaelLawrence</name></author>	</entry>

	<entry>
		<id>https://wiki.r-consortium.org/view/Distributed_Computing_Working_Group</id>
		<title>Distributed Computing Working Group</title>
		<link rel="alternate" type="text/html" href="https://wiki.r-consortium.org/view/Distributed_Computing_Working_Group"/>
				<updated>2016-06-24T15:43:59Z</updated>
		
		<summary type="html">&lt;p&gt;MichaelLawrence: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Goals and Purpose ==&lt;br /&gt;
&lt;br /&gt;
The Distributed Computing Working Group will endorse the design of a common abstraction for distributed data structures in R. We aim to have at least one open-source implementation, as well as a SQL implementation, released within a year of forming the group.&lt;br /&gt;
&lt;br /&gt;
== Members ==&lt;br /&gt;
&lt;br /&gt;
* '''Michael Lawrence''' (Genentech)&lt;br /&gt;
* '''Indrajit Roy''' (HP Enterprise)&lt;br /&gt;
* ''Joe Rickert'' (Microsoft)&lt;br /&gt;
* Bernd Bischl (LMU)&lt;br /&gt;
* Matt Dowle (H2O)&lt;br /&gt;
* Mario Inchiosa (Microsoft)&lt;br /&gt;
* Michael Kane (Yale)&lt;br /&gt;
* Javier Luraschi (RStudio)&lt;br /&gt;
* Edward Ma (HP Enterprise)&lt;br /&gt;
* Luke Tierney (University of Iowa)&lt;br /&gt;
* Simon Urbanek (AT&amp;amp;T)&lt;br /&gt;
&lt;br /&gt;
== Minutes ==&lt;br /&gt;
&lt;br /&gt;
* Round table introduction&lt;br /&gt;
&lt;br /&gt;
* (Michael) Goals for the group:&lt;br /&gt;
  *  Make a common abstraction/interfaces to make it easier to work with distributed data and R &lt;br /&gt;
  *  Unify the interface  &lt;br /&gt;
  *  Working group will run for a year. Get an API defined, get at least one open source reference implementations&lt;br /&gt;
  *  not everyone needs to work hands on. We will create smaller groups to focus on those aspects.&lt;br /&gt;
  *  We tried to get a diverse group of participants&lt;br /&gt;
* Logistics: meet monthly, focus groups may meet more often&lt;br /&gt;
* R Consoritum may be able to figure ways to fund smaller projects that come out of the working group&lt;br /&gt;
* Michael Kane: Should we start with an inventory of what is available and people are using?&lt;br /&gt;
   * Michael Lawrence: Yes, we should find the collection of tools as well as the use cases that are common.&lt;br /&gt;
   * Joe: I will figure out a wiki space.&lt;br /&gt;
* javier: Who are the end users? Simon: Common layer needed to get algorithms working. We started from algos and tried to find the minimal common api. One of the goals is to make sure everyone is on the same page and not trying to create his/her own custom interface.&lt;br /&gt;
* Javier: Should we try to get people with more algo expertise?&lt;br /&gt;
* Joe: Simon do you have a stack diagram?&lt;br /&gt;
* Simon: Can we get R Consortium to help write things up and draw things?&lt;br /&gt;
* Next meeting: Javier is going to present SparkR next time.&lt;/div&gt;</summary>
		<author><name>MichaelLawrence</name></author>	</entry>

	<entry>
		<id>https://wiki.r-consortium.org/view/Distributed_Computing_Working_Group</id>
		<title>Distributed Computing Working Group</title>
		<link rel="alternate" type="text/html" href="https://wiki.r-consortium.org/view/Distributed_Computing_Working_Group"/>
				<updated>2016-06-24T15:34:25Z</updated>
		
		<summary type="html">&lt;p&gt;MichaelLawrence: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;Goals and Purpose&lt;br /&gt;
&lt;br /&gt;
The Distributed Computing Working Group will endorse the design of a common abstraction for distributed data structures in R. We aim to have at least one open-source implementation, as well as a SQL implementation, released within a year of forming the group.&lt;br /&gt;
&lt;br /&gt;
Members&lt;br /&gt;
&lt;br /&gt;
* '''Michael Lawrence''' (Genentech)&lt;br /&gt;
* '''Indrajit Roy''' (HP Enterprise)&lt;br /&gt;
* ''Joe Rickert'' (Microsoft)&lt;br /&gt;
* Bernd Bischl (LMU)&lt;br /&gt;
* Matt Dowle (H2O)&lt;br /&gt;
* Mario Inchiosa (Microsoft)&lt;br /&gt;
* Michael Kane (Yale)&lt;br /&gt;
* Javier Luraschi (RStudio)&lt;br /&gt;
* Edward Ma (HP Enterprise)&lt;br /&gt;
* Luke Tierney (University of Iowa)&lt;br /&gt;
* Simon Urbanek (AT&amp;amp;T)&lt;br /&gt;
&lt;br /&gt;
Minutes&lt;/div&gt;</summary>
		<author><name>MichaelLawrence</name></author>	</entry>

	<entry>
		<id>https://wiki.r-consortium.org/view/Distributed_Computing_Working_Group</id>
		<title>Distributed Computing Working Group</title>
		<link rel="alternate" type="text/html" href="https://wiki.r-consortium.org/view/Distributed_Computing_Working_Group"/>
				<updated>2016-06-24T15:34:06Z</updated>
		
		<summary type="html">&lt;p&gt;MichaelLawrence: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;Goals and Purpose&lt;br /&gt;
&lt;br /&gt;
The Distributed Computing Working Group will endorse the design of a common abstraction for distributed data structures in R. We aim to have at least one open-source implementation, as well as a SQL implementation, released within a year of forming the group.&lt;br /&gt;
&lt;br /&gt;
Members&lt;br /&gt;
&lt;br /&gt;
* '''Michael Lawrence''' (Genentech)&lt;br /&gt;
* ''''Indrajit Roy'''' (HP Enterprise)&lt;br /&gt;
* ''Joe Rickert'' (Microsoft)&lt;br /&gt;
* Bernd Bischl (LMU)&lt;br /&gt;
* Matt Dowle (H2O)&lt;br /&gt;
* Mario Inchiosa (Microsoft)&lt;br /&gt;
* Michael Kane (Yale)&lt;br /&gt;
* Javier Luraschi (RStudio)&lt;br /&gt;
* Edward Ma (HP Enterprise)&lt;br /&gt;
* Luke Tierney (University of Iowa)&lt;br /&gt;
* Simon Urbanek (AT&amp;amp;T)&lt;br /&gt;
&lt;br /&gt;
Minutes&lt;/div&gt;</summary>
		<author><name>MichaelLawrence</name></author>	</entry>

	<entry>
		<id>https://wiki.r-consortium.org/view/Distributed_Computing_Working_Group</id>
		<title>Distributed Computing Working Group</title>
		<link rel="alternate" type="text/html" href="https://wiki.r-consortium.org/view/Distributed_Computing_Working_Group"/>
				<updated>2016-06-24T15:33:49Z</updated>
		
		<summary type="html">&lt;p&gt;MichaelLawrence: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;Goals and Purpose&lt;br /&gt;
&lt;br /&gt;
The Distributed Computing Working Group will endorse the design of a common abstraction for distributed data structures in R. We aim to have at least one open-source implementation, as well as a SQL implementation, released within a year of forming the group.&lt;br /&gt;
&lt;br /&gt;
Members&lt;br /&gt;
&lt;br /&gt;
* '''Michael Lawrence''' (Genentech)&lt;br /&gt;
* &amp;quot;&amp;quot;Indrajit Roy&amp;quot;&amp;quot; (HP Enterprise)&lt;br /&gt;
* ''Joe Rickert'' (Microsoft)&lt;br /&gt;
* Bernd Bischl (LMU)&lt;br /&gt;
* Matt Dowle (H2O)&lt;br /&gt;
* Mario Inchiosa (Microsoft)&lt;br /&gt;
* Michael Kane (Yale)&lt;br /&gt;
* Javier Luraschi (RStudio)&lt;br /&gt;
* Edward Ma (HP Enterprise)&lt;br /&gt;
* Luke Tierney (University of Iowa)&lt;br /&gt;
* Simon Urbanek (AT&amp;amp;T)&lt;br /&gt;
&lt;br /&gt;
Minutes&lt;/div&gt;</summary>
		<author><name>MichaelLawrence</name></author>	</entry>

	<entry>
		<id>https://wiki.r-consortium.org/view/Distributed_Computing_Working_Group</id>
		<title>Distributed Computing Working Group</title>
		<link rel="alternate" type="text/html" href="https://wiki.r-consortium.org/view/Distributed_Computing_Working_Group"/>
				<updated>2016-06-24T15:32:47Z</updated>
		
		<summary type="html">&lt;p&gt;MichaelLawrence: Created page with &amp;quot;Goals and Purpose  The Distributed Computing Working Group will endorse the design of a common abstraction for distributed data structures in R. We aim to have at least one op...&amp;quot;&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;Goals and Purpose&lt;br /&gt;
&lt;br /&gt;
The Distributed Computing Working Group will endorse the design of a common abstraction for distributed data structures in R. We aim to have at least one open-source implementation, as well as a SQL implementation, released within a year of forming the group.&lt;br /&gt;
&lt;br /&gt;
Members&lt;br /&gt;
&lt;br /&gt;
'''Michael Lawrence''' (Genentech)&lt;br /&gt;
&amp;quot;&amp;quot;Indrajit Roy&amp;quot;&amp;quot; (HP Enterprise)&lt;br /&gt;
''Joe Rickert'' (Microsoft)&lt;br /&gt;
Bernd Bischl ()&lt;br /&gt;
Matt Dowle (H2O)&lt;br /&gt;
Mario Inchiosa (Microsoft)&lt;br /&gt;
Michael Kane (Yale)&lt;br /&gt;
Javier Luraschi (RStudio)&lt;br /&gt;
Edward Ma (HP Enterprise)&lt;br /&gt;
Luke Tierney (University of Iowa)&lt;br /&gt;
Simon Urbanek (AT&amp;amp;T)&lt;br /&gt;
&lt;br /&gt;
Minutes&lt;/div&gt;</summary>
		<author><name>MichaelLawrence</name></author>	</entry>

	<entry>
		<id>https://wiki.r-consortium.org/view/Main_Page</id>
		<title>Main Page</title>
		<link rel="alternate" type="text/html" href="https://wiki.r-consortium.org/view/Main_Page"/>
				<updated>2016-06-24T15:22:57Z</updated>
		
		<summary type="html">&lt;p&gt;MichaelLawrence: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Welcome to the R Consortium Wiki ==&lt;br /&gt;
&lt;br /&gt;
The R Consortium, Inc. is a group organized under an open source governance and foundation model to provide support to the R community, the R Foundation and groups and individuals, using, maintaining and distributing R software.&lt;br /&gt;
&lt;br /&gt;
This wiki space is for working groups and ISC Project collaboration and documentation. &lt;br /&gt;
&lt;br /&gt;
== Get Involved ==&lt;br /&gt;
&lt;br /&gt;
* [https://www.r-consortium.org/ R Consortium Website]&lt;br /&gt;
* [https://www.r-consortium.org/about/isc/proposals Looking to Submit a Proposal]&lt;br /&gt;
* [https://twitter.com/RConsortium Follow us on Twitter]&lt;br /&gt;
&lt;br /&gt;
== Working Groups ==&lt;br /&gt;
&lt;br /&gt;
* [[R Native API|Native APIs for R]]&lt;br /&gt;
* [[Distributed Computing Working Group|Distributed Computing]]&lt;br /&gt;
&lt;br /&gt;
'''''This wiki supports Linux Foundation ID single-sign-on and registration with the link at the top this page. Other components of the R Consortium that do not support single-sign-on will directly request your Linux Foundation ID username and password for login.'''''&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Consult the [//meta.wikimedia.org/wiki/Help:Contents User's Guide] for information on using the wiki software.&lt;/div&gt;</summary>
		<author><name>MichaelLawrence</name></author>	</entry>

	<entry>
		<id>https://wiki.r-consortium.org/view/Main_Page</id>
		<title>Main Page</title>
		<link rel="alternate" type="text/html" href="https://wiki.r-consortium.org/view/Main_Page"/>
				<updated>2016-06-24T15:20:11Z</updated>
		
		<summary type="html">&lt;p&gt;MichaelLawrence: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Welcome to the R Consortium Wiki ==&lt;br /&gt;
&lt;br /&gt;
The R Consortium, Inc. is a group organized under an open source governance and foundation model to provide support to the R community, the R Foundation and groups and individuals, using, maintaining and distributing R software.&lt;br /&gt;
&lt;br /&gt;
This wiki space is for working groups and ISC Project collaboration and documentation. &lt;br /&gt;
&lt;br /&gt;
== Get Involved ==&lt;br /&gt;
&lt;br /&gt;
* [https://www.r-consortium.org/ R Consortium Website]&lt;br /&gt;
* [https://www.r-consortium.org/about/isc/proposals Looking to Submit a Proposal]&lt;br /&gt;
* [https://twitter.com/RConsortium Follow us on Twitter]&lt;br /&gt;
&lt;br /&gt;
== Working Groups ==&lt;br /&gt;
&lt;br /&gt;
* [[R Native API|Native APIs for R]]&lt;br /&gt;
* [[Distributed Computing]]&lt;br /&gt;
&lt;br /&gt;
'''''This wiki supports Linux Foundation ID single-sign-on and registration with the link at the top this page. Other components of the R Consortium that do not support single-sign-on will directly request your Linux Foundation ID username and password for login.'''''&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Consult the [//meta.wikimedia.org/wiki/Help:Contents User's Guide] for information on using the wiki software.&lt;/div&gt;</summary>
		<author><name>MichaelLawrence</name></author>	</entry>

	</feed>