R Native API call 2016-06-20

From R Consortium Wiki
Jump to: navigation, search

An initial meeting intended to provide an opportunity to get an overview of everybody's motivation and view on the matter.

A round of introductions

  • Short introduction by Stephen Kaluzny (TIBCO, R-consortium ISC member, sponsor for this WG)
  • Lukas Stadler (Oracle Labs, FastR project, WG lead)
  • Is there consensus that the R native API could use an overhaul, more documentation, etc.?
  • Some basic points to consider: mall changes vs. big changes, C/C++, separation into different modules
  • Initial Survey of API Usage - some discussions about the % of API needed by most packages and applications - 10%, 80%, 90%?
  • Simon Urbanek (AT&T Labs)
  • Experience from the Aleph project
  • You can get a long way with a small portion of the API
  • Luke Tierney (University of Iowa)
  • Has to take care of this if it gets into GNUR
  • Generally interested, implements optimizations in R core
  • Alexander Bertram (BeDataDriven, renjin)
  • A lot of the API is just BLAS, etc. - which parts are R specific, how big is the actual interface?
  • APIs should also provide guidance to package developers
  • R core already moved some code out (graphics) which is useful and improves quality and separation
  • Radford Neal (University of Toronto)
  • PQR, which naturally sees less reason for drastic change
  • There's two sides: R->C and C->R
  • .C/.Fortran is IMO the preferred way: small surface, implementation performance can be improved, can, e.g., run in parallel in PQR
  • .Call/.External/...: not very well defined, PROTECT is hard to get right, how to use NAMED (or whatever will replace NAMED)
  • What about the embedded R interface
  • The interface between base R and the included packages (stats, …) should also be well-defined
  • The interface contains much more than just header files, .e.g, config files, databases, ...
  • Discussion: Alex, Simon, Lukas, …
  • data.table - very unique dependency on how internals of GNUR work
  • Validity of .Call functions that modify their arguments without checking NAMED, many uses of this only work given a very specific behavior of the R runtime
  • Mick Jordan (Oracle Labs, FastR)
  • The current R API is an interface to a GNUR-like system, and not to an R runtime
  • A small number of function impls have gotten FastR a long way, most tricky parts are in the "callbacks" (the C->R part)
  • Missing documentation is a problem, makes it hard to implement API
  • (Lukas) Documentation is also a contract that defines what can be expected, which behavior can be depended upon. Otherwise, users will assume that all observed behavior is part of the API.
  • There should be one header file, with documentation
  • Java was in a similar position with the first native API, JNI is the result of that, very well-defined interface that stood the test of time
  • Gregory Warnes (Boehringer Ingelheim)
  • RPy, provides a Python interface, generally interested in R native APIs
  • Edzer Pebesma (University of Münster)
  • Looking into roh project, extensions
  • Interested to see how R is used in the greater world outside GNUR
  • Michael Sannella (TIBCO, TERR, owner of the R-C-level API)
  • Naturally interested in this
  • data.table as an extreme case in native API usage:
  • Managed to get it to work on TERR (talk about this at RIOT [[1]])
  • There are no contracts (in the form of documentation, assertions, ...) in the API, data.table uses this to the extreme
  • It exports some of these extreme uses (e.g., changing attributes) to its users
  • A very interesting/challenging case: it’s a very important package, how to handle this? Make all "API" it uses available?
  • IMO, the real problem are that the functions are not well-defined enough, not that there are too many: "whatever you can get away with is defined"
  • Few package authors use the API to the max, the average package author has probably not delved so deep into the interface, not used all of it, because lacking documentation is a barrier
  • Michael Lawrence (Genentech)
  • Trying to enable package-level (C-level) extensibility of base R packages
  • E.g., new int-vector implementation as a package
  • (Alex) Pushed for this in renjin, e.g., provide new implementation of int vectors
  • Some more discussion:
  • This would clearly not be possible with the current API because of things like “REAL”
  • Gabriel Becker (works with Michael Lawrence) talks at DSC about a modified GNUR
  • (Alex) renjin has DBI-compatible package that simulates a data.frame from a rolling cursor in a DB
  • Should this be “exchange int vector impl in runtime” (for all int vectors) or “create int vector with this implementation” (operator overloading?)
  • (Mick) R is a very complex system that allows modifying many basic assumptions, should there be more or less complexity?
  • Indrajit Roy (HP labs)
  • Extending R with distributed data structures
  • Making API compatible with what people write in the future
  • Questions, playing the “bad guy”:
  • Many here want to make changes to the R internal, to the packages, so that R can be run by alternative implementations
  • A lot of the points - are they about coding practices? or about making R internal code more modular?
  • Maybe we just need to deprecate all the unused functions? What’s the real goal?
  • (Alex) It’s not so much about being able to implement it, it works already, but to make it easier, more efficient, etc.
  • (Mick) There should be an unbiased party in this, with a view not only from R core
  • (Simon) A lot of the discussions on r-devel are about what is the API and what not
  • A couple independent views:
  • Documentation, what is the contract of the API?
  • People have been using internal API and calling for it to be made external
  • High-level stuff: replacing high-level pieces
  • How to make things more flexible
  • The call is a lot about what people think about the API

Additional discussions

  • Is there someone in this call from the rho project?
  • Karl Millar is listening on the mailing list
  • They, e.g., did GC with stack scanning to avoid the need for PROTECT
  • Discussion about whether the API should include GC aspects
  • (Lukas) Why is a large part of the interface duplicated on both the R and C side?
  • The interface could be a lot smaller if eval(...) was used in all cases where there's not performance bottleneck (connection functions, etc.)
  • Functions like “as.vector(…)” and R_asVector: sometimes mismatch between R and C version, sometimes similar
  • (Simon) Historical reasons, stems from the R API being taken from the implementation (which is very powerful, but dangerous)

Additional (in-person) discussions will be scheduled for useR! and RIOT

Wrap-up

(Lukas): One important questions to answer for this WG is: How far do we go

  • Enhance/add documentation
  • Trimm down the interface (by looking at it's current usage, by looking at what makes sense as an API)
  • Extend by replacing tricky parts, with a gradual switchover
  • Introducing a consistent API, with breaking changes
  • Introduce new APIs for parts that are not covered at the moment (or: include provisions for adding new API in the future)

The big tradeoff is between payoff for GNUR and alternative implementations (more efficient, easier to maintain,…), and increasing effort (and less adoption) on the package side.