Skip to content
Daniel Jacob edited this page Mar 9, 2020 · 27 revisions

ODAM: Open Data, Access to data Mining

Give an open access to your data and make them ready to be mined

Purpose

Here, we propose a simple way to make research data broadly accessible and fully available for reuse, including by a script language such as R. The main purpose is to make a dataset accessible online with minimal effort from the data provider, and to allow any scientists or bioinformaticians to be able to explore the dataset and then extract a subpart or the totality of the data according to their needs.

Each time we plan to share data coming from a common experimental design, the classical challenges for fast using data by every partner are data storage and data access. We propose an approach for sharing project data all along its development phase, from the setup of the experimental schema up to the data acquisition from the various analyzes of samples, so that all data is readily available as soon as they are generated. Based on the following criteria:

  • Centrally manage identifiers (plants, harvests, samples, ...) so that they are unique and shared by all
  • Avoid the implementation of a complex data management system (requiring a data model) given that many changes can occur during the project. (possibility of new analysis, new measures or give up some others, ...)
  • Facilitates the subsequent publication of data: either the data can serve to fill in an existing database or the data can be broadcast through a web-service approach with the associated metadata.

For this work, we made the choice to keep the good old way of scientist to use worksheets, thus using the same tool for both data files and metadata definition files. Moreover, our approach gives data access through web-services thus providing a good way to connect distributed data. This approach has to be regarded as complementary with publication of the data online within an institutional data repository as described in re3data.org for instance (e.g. INRAE Data Portal, https://data.inra.fr/), associated or not with a scientific paper. Whereas institutional data repository focus on the experiment description with the corresponding descriptive metadata, our approach, by adjoining some minimal but relevant structural metadata, gives access to the data themselves with the possibility to explore and mine them.

Data Type

Whatever the kind of experiment, this assumes a design of experiment (DoE) involving individuals, samples or whatever things, as the main objects of study. This also assumes the observation of dependent variables resulting from effects of some controlled experimental factors. Moreover, the objects of study usually have an identifier for each of them, and the variables can be quantitative or qualitative.

Figure 1: A link is established for each subset with the subset at its origin, so that links can be interpreted as 'obtained from', given that each column within each data subset must be associated with an experimental data type (called category), especially those corresponding to identifiers knowing that links are based on them.

An ODAM dataset is a bundle that contains a set of TSV files. The TSV files are simple tables containing the data of the dataset. Two specific TSV files, namely s_subsets.tsv and a_attributes.tsv, describe the metadata of the dataset, including informational metadata like descriptions of measures, as well as structural metadata like references between tables. The metadata lets non-expert users explore and visualize your data.

Figure 2: s_subsets.tsv: a file allowing to associate with each subset of data a key concept corresponding to the main entity of the subset and the relations of the type "obtainedFrom" between these concepts

Figure 3: a_attributes.tsv: a metadata file allowing each attribute (concept/variable) to be annotated with some minimal but relevant metadata

Web-Services

Based on REST services using a Resource Naming convention: an understandable resource naming leading to an easily leveraged Web service API (Identification/querying of resources) and easy to implement within R. Output formats: TSV, JSON and XML. Even if the WS outputs are not dedicated to human readers (the script languages as R are the typical clients), the XML outputs can be human readable in a web browser, made possible by using a XSL transformation mechanism which converts the XML outputs to HTML format.

Figure 4: REST Services: hierarchical tree of resource naming (URL)

Data Explorer

The Data Explorer makes data easy to explore, visualize, and subsequently to better understand the data as a whole. A range of univariate, bivariate and multivariate approaches have been implemented so that they are very easy to be interactively used.

Figure 5: The Data Explorer

Clone this wiki locally