R Pre-processing Plan

Future plan, project structure, and more.

Kirill Müller https://github.com/krlmlr (cynkra)https://cynkra.com , John Coene https://github.com/JohnCoene (Opifex)https://opifex.org , Joe Thorley https://github.com/joethorley (Poisson)https://www.poissonconsulting.ca , Antoine Fabri https://github.com/moodymudskipper (cynkra)https://cynkra.com , Mark Padgham https://github.com/mpadge (rOpenSci)https://ropensci.org/
2021-10-16

This post delves into the R pre-processing project, its structure, future plans, and more.

Problem

The programming language has evolved a great deal since its inception. While it still excels as an interactive statistical programming language it is now used in many other contexts.

R is used to build and deploy solutions at scale (“in production”) such as APIs with plumber (Schloerke and Allen 2021), web applications with shiny (Chang et al. 2021), ETL pipelines (Extract, Transform, and Load) using tidyverse (Wickham 2021) and pointblank (Iannone and Vargas 2021), and much more.

The R programming language itself has no issue dealing with numerous lines of code. However, errors related to type violations or unsatisfied assertions may be buried deep inside large projects and lead to errors far from the original source of the problem. Tracking down such problems may take a lot of time. In a typed language, the developer is notified of such code problems immediately during compilation, or at the very latest during execution, at the point of the violation.

Ultimate Goal

The ultimate goal of the project is to create a superset of the R language so R code can be pre-processed. Pre-processing code offers a multitude of possibilities, such as:

Decoupling the code we write from the code we run as well as the above bullet points.

We realise this is ambitious and therefore have split the project into manageable “slices.” Read on to discover where we are now, where we’re going with the project, and how we intend on getting there.

Prototype

We’re currently at the prototype stage but you may find it useful nonetheless. This prototype currently consists of the rpp package (Müller 2021).

This prototype preprocesses code inline and works with packages. That is, it reads the R code in R/ processes it and re-outputs it in the same directory and files respectively.

The main reason for doing things inline rather than in a parallel directory is that we can still leverage the existing toolchain for developing R packages: testing, documentation, etc.

Slice 1: Minimum Viable Product

First, the typed package (Fabri 2021) would have to be enhanced so that it provides a more complete type system. It currently includes basic types such as integer, character, lists, and data.frames. However, it is currently not possible to specify e.g. the structure of a data frame.

The minimum viable product should be included in some real-world applications to clearly communicate its benefits to the R community. This includes the dm (Schieferdecker, Müller, and Bergant 2021) and packer (Coene 2021) packages.

The project should replicate the existing infrastructure to build R packages with functions such as rpp::load_all() rpp::test_local(), and rpp::build() analogous to pkgload::load_all(), testthat::test_local(), and devtools::build() respectively. This greatly reduces barrier to entry to users already familiar with the existing package development structure.

Base R and its family of packages (stats, etc.) should be annotated with type information. This implies developing an infrastructure to infer types. This structure could then be extended to derive types from other package from CRAN. This is a complex topic that spans all slices: no concrete results but at least a plan are anticipated for this slice.

Second slice: Useful product

While the previous version of the product mimicked the infrastructure, this version should properly integrate with it so as to require no effort from the community. This includes integration with R CMD build for instance.

Base R and its family of packages (stats, etc.) should be typed annotated. This implies developing an infrastructure to infer types, promising work towards which has been developed extending from the instrumentr package (???). A prototype for inferring types for base and other packages should be part of a useful product, with annotations for at least some base and other popular packages.

Simple but robust static type checking enabled and supported by a custom syntax, a faster lexer, and parser.

A platform to request user feedback.

Third slice: A new language

Improve upon the basic static type checking and provide type annotation for recommended packages as well as a curated list of popular packages. The basic type checking implemented in the previous step should also be much improved in this version.

The new syntax and workflow should be integrated with popular integrated development environments so highlighting, shortcuts, and more work as expected.

To improve the management of the code it should be properly integrated with GitHub and Sourcegraph.

This is also a step where we hope to push for greater adoption of the project.

Stable Product

Types

The usefulness of types needs to be increased by:

  1. Annotating the entirety of base R
  2. R users should also be able to define custom interfaces/structs.
  3. Decoupling of rpp code and R code means implementing static type checking.

Lexer and Parser

Static type checking requires lexing and parsing the rpp code to transpile it to R.

Toolchain

A reason for currently doing inline preprocessing is that the MVP then still benefits from the existing toolchain that accompany R packages: testing, loading, building, etc.

Decoupling rpp and R code also requires integrating rpp with said toolchain.

Community Feedback

Set up required environment to receive feedback from users and allow the community to participate in the testing, development, and design of the extension.

Chang, Winston, Joe Cheng, JJ Allaire, Carson Sievert, Barret Schloerke, Yihui Xie, Jeff Allen, Jonathan McPherson, Alan Dipert, and Barbara Borges. 2021. Shiny: Web Application Framework for R. https://shiny.rstudio.com/.

Coene, John. 2021. Packer: An Opinionated Framework for Using Javascript.

Fabri, Antoine. 2021. Typed: Support Types for Variables, Arguments, and Return Values. https://github.com/moodymudskipper/typed.

Iannone, Richard, and Mauricio Vargas. 2021. Pointblank: Data Validation and Organization of Metadata for Local and Remote Tables. https://CRAN.R-project.org/package=pointblank.

Müller, Kirill. 2021. Rpp: R Preprocessor.

Schieferdecker, Tobias, Kirill Müller, and Darko Bergant. 2021. Dm: Relational Data Models. https://CRAN.R-project.org/package=dm.

Schloerke, Barret, and Jeff Allen. 2021. Plumber: An Api Generator for R. https://CRAN.R-project.org/package=plumber.

Wickham, Hadley. 2021. Tidyverse: Easily Install and Load the Tidyverse. https://CRAN.R-project.org/package=tidyverse.

References

Citation

For attribution, please cite this work as

Müller, et al. (2021, Oct. 16). Pre-processing R code: R Pre-processing Plan. Retrieved from https://rpreprocess.netlify.app/posts/2021-10-16-project/

BibTeX citation

@misc{müller2021r,
  author = {Müller, Kirill and Coene, John and Thorley, Joe and Fabri, Antoine and Padgham, Mark},
  title = {Pre-processing R code: R Pre-processing Plan},
  url = {https://rpreprocess.netlify.app/posts/2021-10-16-project/},
  year = {2021}
}