A trustworthy Coq formalisation of the R programming language
This project is a Coq formalisation of the R programming language, or more accurately, of the core of a R programming language. It follows principles strongly inspired from the JSCert project:
- the code is as close as possible from a reference specification (in this case, the source of GNU R—a mirror can be found here). The formalisation has been written in a monadic form, which enables a line-to-line correspondance with the original source. We believe that anyone can check the code similarity after a short training period.
- the formalisation is runnable, and can be run against test suites.
We tested it against GNU R in random tests
(see the
gen/
folder), custom tests (see thetests/
folder), as well as other test suites for R (see thecompare/RTests/
folder).
The goal of this project is to provide a strong basis for anyone wanting to certify in Coq a higher-level formalisation of R, as well as code analyses, etc. Due to its double certification (both code similarity and testing), this project provides a high amount of trust to any proofs built on it. This trust comes with the cost of complexity: R’s semantics is complex, and this project aims at formalising this complexity. This is why linking this project to higher-level specifications would be a huge addition.
Some parts of the original R source code have not been formalised.
The main example is the garbage collector, which is not supposed to
change the semantics of R programs.
R is furthermore composed of several parts, some in C, and some in R it self.
Our Coq translation focusses the C part,
but we did our best to be able to run the R parts.
These parts can be found in the folder Rlib/base
.
We have tested our formalisation by extracting it to OCaml (see how to build the project below).
We have tested it on several test suites, including random tests
(see the gen/
folder after building th project),
customs tests with an emphase on corner cases
(referred as “Corners” in our publications, see the tests/
folder),
as well as other test suites
(from GNU R and two different versions of the FastR project,
available here and here,
see the compare/RTests/
folder).
Test suites for the R programming language are quite large.
We have filtered out some tests which we consider irrelevant for the current state of this project.
More precisely, we have filtered out:
- tests generating PDFs;
- tests requiring additional packages and libraries (PCRE, JIT, MASS, etc);
- tests about character encodings;
- tests requiring Internet connection;
- tests generating plots in graphical interfaces;
- Windows-only tests;
- tests related to S4 classes;
- tests about input and outputs (typically manipulating files or sockets);
- tests related to limiting CPU time for computations;
- generated tests.
R is a project moving fast. We believe that the approach of sticking as close as possible to R’s source code enables to update our formalisation in the same time than R. But this means that if you are using this repository several years in the future, there might be differences that were not present at the writing time. Most of the project is based on the versions 3.4.2, 3.4.3, and 3.4.4 of R.
More information about this project can be found in the following papers:
- A Coq Formalisation of a Core of R, Martin Bodin, CoqPL, 2018.
- A Trustworthy Mechanized Formalization of R, Martin Bodin, Tomás Diaz, and Éric Tanter, Dynamic Languages Symposium (DLS), 2018.
Here follows a quick description of the main files of interest:
src/Rinternals.v
describes the internal structures used by R thorough its source code;- the
src/core/
folder contains translations of the main functions of R (these functions were called “nucleus functions” in our second paper); - the
src/features/
folder contains translations of additional functions of R (typically present in the default heap); src/Invariants.v
defines some invariants of R’s structures;src/InvariantProofs.v
contains proofs that the invariants defined insrc/Invariants.v
are conserved along the execution.
Note that as we try to be as close as possible to the original source code of R,
we are limited in term of how much comment we can write:
most of the algorithm of this project were by construction not designed by us!
This means that most comments in this project are merely pointers to the original source code of GNU R.
For instance, the file src/features/FArithmetic.v
corresponds
to the file `arithmetic.c`
from the actual GNU R source code.
The simplest way is to use esy
.
One way to install it is as follows.
sudo apt install npm
sudo npm install --global esy@latest # Tested with version 0.6.2 of esy.
One can then clone the project.
git clone https://github.com/Mbodin/CoqR.git CoqR
cd CoqR
git submodule update --init
Then, typing esy
should install all dependencies and build the project.
Once compiled, the following commands are available:
esy lrun
runs the CoqR interpreter with the base libraries loaded (the loading of the base library will take some time the first time).esy run
runs a lighter version of the CoqR interpreter without the base libraries.esy lrun_bisect
andesy run_bisect
are similar to the other ones, but the program is instrumented by the Bisect coverage tool. Once the program is run on some test suites, one can callesy report
to create a coverage report inbisect/report/index.html
.
Another way is to use opam
.
In Debian, installing Opam, Git, as well as the other needed libraries,
can be done as follows.
More information for other systems can be found
here for Opam
and here for Git.
sudo apt install opam aspcud m4 perl git
opam init # This is not needed if you have already installed and initialised Opam before.
It is advised to add the following line into your .bashrc
file.
eval `opam config env`
We recommand a version of OCaml of at least 4.03.
opam switch 4.04.0
The following lines ensure that the Coq repository is known by Opam.
opam repo add coq-released https://coq.inria.fr/opam/released
opam update
The version of Coq used to test this repository is the version 8.6. We can thus pin it in Opam.
opam pin add coq 8.6
We can now install the required packages.
opam install ocamlbuild menhir coq coq-flocq coq-paco extlib bisect coq-tlc
eval `opam config env`
We can now clone the project.
git clone https://github.com/Mbodin/CoqR.git CoqR
cd CoqR
git submodule update --init
The Makefile should now work.
make
This should create the following program.
src/runR.native
As-is, this program should run more or less similarly to R (although most features have not been implemented). It accepts a large number of argument to customize its output, which are (briefly) described by the following invocation:
src/runR.native -expert-mode -help
Initialising the program can take some time. To save computations time, the program enables to save and load previous states from memory. A faster-to-load version is available with the following invocation.
make run
The first make
command also created a html/
folder,
with an arguably more readable version of the files and their documentations.
See for instance the table of contents in html/toc.html
.
The Bisect tool is a coverage tool for OCaml. The following invocation uses it in the context of this R interpreter.
make run_bisect
This will generate bisect.out
files during the computation.
To translate these files into a readable report, run the following command.
This will create some files in the bisect/
folder,
and in particular bisect/report/index.html
providing a coverage report.
make report
Note that make run
does not include the R libraries:
it only includes what is defined in C in GNU R.
Executing the R files of the base library can take some time, but it will lead to the usual R environment.
To include this libraries, use the following invocation.
Note that the first time that this invocation will be performed, all of the base library will be executed, which will take quite a large amount of time.
Once this library will have been executed, the final state will be stored and cached.
make lrun
If you want to also run Bisect with the R libraries, use the following invocation.
make lrun_bisect
This folder contains is a low-level formalisation of R. It is the closest possible to the R reference interpreter (written in C).
This file describes the data structures of R as they are in its C code.
Auxiliary definitions and properties about the data structures defined in
Rinternals.v
.
This file may take some time to compile.
This file provides a model for the C memory.
This file describes various internal data types used in the source of R.
This file lists all global variables used in the C source code of R, that are initialised, then never changed.
This file may take some time to compile.
This file provides monads to manipulate R objects easily.
In some ways, this file provides notations enabling to write imperative code in Coq with respect to memory.
This file provides looping monads to easily manipulate R objects. It also defines the structure used in the interpreter to deal with infinite execution.
In some ways, this file provides notations enabling to write imperative loops in Coq.
This folder contains Coq translations of various core functions (that has been called “nucleus functions” at some point) of R from its C code. This folder is organised by the corresponding file name in the C source code of R.
The core functions of R are the ones used thoroughly in R source code.
They are internal C functions and are not linked to the initial R environment
(defined in the names.c
file in C).
Due to some circular dependencies between C files (which is accepted in C, but not in Coq),
some functions have been put in a special file Conflicts.v
.
Note that despite this special file, there still are some functions separated from their
original file in other file. They are however always preceeded by a comment warning for
the exceptional location of the function definition.
This file wraps up all the files present in the core/
folder.
Initialises global variables.
This folder contains Coq translations of various non-core functions of R from its C code.
This folder is organised by the corresponding file name in the C source code of R.
Most of the functions defined in the files of this folder appear in the initial
environment of R, initialised in the file names.c
in C.
This file wraps up all the files present in the features/
folder.
This file also wraps all these functions into the main function array of names.c
.
Finally, this files defines the runs
structure, enabling Coq functions to loop
without explicitly manipulating a fuel.
Contains useful lemmae about runs
.
This file may take some time to compile.
This file formalises functions for parsing R expressions.
Provides tactics to easily manipulate the monads defined in Monads.v
and Loops.v
.
Provides abstractions to reason about the heap.
States some invariants of R’s heap.
Contains lemmae about the invariants stated in Invariants.v
.
Defines tactics to manipulate the invariants defined in Invariants.v
.
Proofs of some invariants respected by the functions defined in Rcore.v
,
Rinit.v
, and Rfeatures.v
.
Extract R interpreter into OCaml.
These files are extracted by Extraction.v
.
Contains various OCaml function to print Coq objects.
Contains some impure functions used by the Coq extracted code (mainly input and output).
Main interpreter file. It calls the various functions defined in extract.ml.
The following command provides some help about it.
src/runR.native -help
This folder is a random test generator.
It builds random tests from the gram
file, which could in theory be
used for other languages than just R.
The tests generated by this program are however of very relative interest, most of them just throwing an error. It was useful during the development to test non-frequent programs.
The file testExtra.R
has been build by hand during the development.
It consists of more than 500 unit tests, checking for very specific features of R,
subjectively considered difficult.
Instances of these tests include implicit type conversions or some specific
rules for variable look-ahead.
The files in the src/
folder are under the GPL license version 2.
See the file COPYING
for more informations.
As this license is the same than R’s license, as well as most programs about R,
it seemed the best to help development.
If you want to contribute to this project, consider reading the file CODE_OF_CONDUCT.org
.
Some suggestions for future directions can be found in the file CONTRIBUTING.org
.