Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

undefined/optional values in DLite #661

Open
sygout opened this issue Oct 2, 2023 · 6 comments
Open

undefined/optional values in DLite #661

sygout opened this issue Oct 2, 2023 · 6 comments
Labels
enhancement New feature or request help wanted Extra attention is needed

Comments

@sygout
Copy link
Collaborator

sygout commented Oct 2, 2023

It would be practical to be able to handle undefined values in DLite (different from the default initialization).
The user should be able to detect that a value in undefined.

The idea would be to set a value for type that indicate that it is undefined.

@sygout sygout added enhancement New feature or request help wanted Extra attention is needed labels Oct 2, 2023
@CasperWA
Copy link
Collaborator

CasperWA commented Oct 2, 2023

A use case could be retrieving data from a no-SQL database collection using a single data model/entity.

@jesper-friis
Copy link
Collaborator

jesper-friis commented Oct 2, 2023

It is definitely possible, but one of the strengths with the DLite and SOFT concepts is simplicity. I think that we should have a strong and clear need for such a feature before implementing it.

DLite instances are currently initiated with zeros. For most types it is currently impossible to know whether that zero means uninitialised or is a real zero, but for pointer types, like string and ref types, the uninitialised value will be NULL, and non-NULL if initialised.

@jesper-friis
Copy link
Collaborator

jesper-friis commented Dec 20, 2023

Adding support for fill values may be a general way to address this issue. A "fill value" is a property value with a special meaning. We could allow to attach a list of fill value-key-description triples to a property, where:

  • fill value is a pointer to the exact bit sequence corresponding to the fill value
  • key is a positive integer key associated with the fill value (for use in switch statements in C, etc...)
  • description is a human description of the meaning of the fill value

We probably want to share a set of fill values for a given datatype and reuse it for many properties in different data models. In this case it may be smart to create a ref-counted C struct for a set of fill values.

Since unitialised is a common fill value that needs special handling by DLite during initialisation, one could impose the convention that a fill value with key=1 always means uninitialised.

In the Python interface we could add the methods:

def dlite.Instance.is_fill_value(name: str, index=None) -> int:
    """Check the value of property `name`. If it is a fill value, return the fill value key, otherwise
    zero is returned. `index` is an optional index to the element to check for array properties.
    """

def dlite.Instance.fill_value_description(name: str, index=None) -> str:
    """Check the value of property `name`. If it is a fill value, return the fill value description,
    otherwise None is returned. `index` is an optional index to the element to check for array 
    properties.
    """

to check if the value of a property is a fill value. In addition we would need an API for defining and assigning fill values.

@quaat
Copy link
Collaborator

quaat commented Jan 22, 2024

There are multiple possible conventions that we may encounter. Fill value is one. This is a special value used to represent missing or irrelevant data. For example, in a temperature dataset, if a particular day's data is missing, a fill value like -999 might be used to indicate this. In oceanographic datasets things can become more complicated. Say we have a water surface grid of temperatures, but parts of the grid are on land and not the ocean, here the fill value can indicate that we are on land, i.e. there is no temperature to measure, and in addition a different fill value will be used where the temperature is not measured (a real missing value). In this case we can have multiple fill values to consider.

Valid Min/Max/Range - limits that define what values in the data are acceptable is another possible constraint. Any value outside this range might be considered an error or unusual.

To save space or increase efficiency, data is stored in a simplified form and needs to be multiplied by a certain number (scale factor) to get the actual value. For example, if temperatures are recorded as the short int 123 instead of 12.3, the scale factor would be 0.1. This is often used with an offset value, which is added to the scaled number for the final value.

Strictly speaking, all these properties are data - and we could just treat them as such by adding them as properties. In the mapping we could then align them with concepts defining the different constraints, and the interoperability framework (soft/dlite/+) will need to interpret the data correctly.

Another method would be to just use the knowledge base to document these constraints. This would be useful when the conventions are not documented as part of the dataset, but implicitly understood).

I would suggest making a simple test dataset where we try this out. This could be useful as a future dataset for our test-system.

@quaat
Copy link
Collaborator

quaat commented Jan 22, 2024

Another (pragmatic) approach could be say that missing data is the responsibility of the parser or generator, and simply add it as a special configuration parameter. This would simplify things a lot - and also, it will be stored as part of the data documentation (the partial pipeline)

@jesper-friis
Copy link
Collaborator

jesper-friis commented Feb 2, 2024

Yet another and very general approach is to add an optional "relations" section to the data models (like "dimensions" and "properties") and define a standardised way document uninitialised and fill values as RDF triples.

Examples of such relations could be:

http://onto-ns.com/meta/0.1/MyDataModel#prop onto:hasInitialValue -999 (xsd:int)
http://onto-ns.com/meta/0.1/MyDataModel#prop onto:hasFill _:fill1
_:fill1 onto:hasFillValue -1000 (xsd:int)
_:fill1 onto:hasDescription "No sensor at this grid point." (@en)
http://onto-ns.com/meta/0.1/MyDataModel#prop onto:hasFill _:fill2
_:fill2 onto:hasFillValue -1001 (xsd:int)
_:fill2 onto:hasDescription "Sensor error." (@en)

Since Relation is already an implemented basic type in dlite, this would only require a trivial update of the entity schema (increasing it version number to 0.4). For backward compatibility, we could add a trivial instance mappings between http://onto-ns.com/meta/0.3/EntitySchema and http://onto-ns.com/meta/0.4/EntitySchema. The only part of DLite that would need to interpret this information is the instance creation that should check if the metadata has defined a fill value for uninitialised values for each property, and if so, initialise the property with the given fill value.

A benefit with this is that these datamodel-level relations are very generic and could be used to semantically provide other kind of additional information about a data model. But it should limited to only express things that always will be true for all use cases of this data model. An example of such additional use case could be to state that the property "mass_std" is the standard deviation of the property "mass".

A possible drawback of this approach is that the syntax of listing RDF triples feels complex. This might not be a real issue, since in the future novise users will work with data models via SOFT Studio which could provide a simple graphical interface for common usages of datamodel-level relations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

4 participants