-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
undefined/optional values in DLite #661
Comments
A use case could be retrieving data from a no-SQL database collection using a single data model/entity. |
It is definitely possible, but one of the strengths with the DLite and SOFT concepts is simplicity. I think that we should have a strong and clear need for such a feature before implementing it. DLite instances are currently initiated with zeros. For most types it is currently impossible to know whether that zero means uninitialised or is a real zero, but for pointer types, like string and ref types, the uninitialised value will be NULL, and non-NULL if initialised. |
Adding support for fill values may be a general way to address this issue. A "fill value" is a property value with a special meaning. We could allow to attach a list of fill value-key-description triples to a property, where:
We probably want to share a set of fill values for a given datatype and reuse it for many properties in different data models. In this case it may be smart to create a ref-counted C struct for a set of fill values. Since unitialised is a common fill value that needs special handling by DLite during initialisation, one could impose the convention that a fill value with In the Python interface we could add the methods:
to check if the value of a property is a fill value. In addition we would need an API for defining and assigning fill values. |
There are multiple possible conventions that we may encounter. Fill value is one. This is a special value used to represent missing or irrelevant data. For example, in a temperature dataset, if a particular day's data is missing, a fill value like -999 might be used to indicate this. In oceanographic datasets things can become more complicated. Say we have a water surface grid of temperatures, but parts of the grid are on land and not the ocean, here the fill value can indicate that we are on land, i.e. there is no temperature to measure, and in addition a different fill value will be used where the temperature is not measured (a real missing value). In this case we can have multiple fill values to consider. Valid Min/Max/Range - limits that define what values in the data are acceptable is another possible constraint. Any value outside this range might be considered an error or unusual. To save space or increase efficiency, data is stored in a simplified form and needs to be multiplied by a certain number (scale factor) to get the actual value. For example, if temperatures are recorded as the Strictly speaking, all these properties are data - and we could just treat them as such by adding them as properties. In the mapping we could then align them with concepts defining the different constraints, and the interoperability framework (soft/dlite/+) will need to interpret the data correctly. Another method would be to just use the knowledge base to document these constraints. This would be useful when the conventions are not documented as part of the dataset, but implicitly understood). I would suggest making a simple test dataset where we try this out. This could be useful as a future dataset for our test-system. |
Another (pragmatic) approach could be say that missing data is the responsibility of the parser or generator, and simply add it as a special configuration parameter. This would simplify things a lot - and also, it will be stored as part of the data documentation (the partial pipeline) |
Yet another and very general approach is to add an optional "relations" section to the data models (like "dimensions" and "properties") and define a standardised way document uninitialised and fill values as RDF triples. Examples of such relations could be:
Since Relation is already an implemented basic type in dlite, this would only require a trivial update of the entity schema (increasing it version number to 0.4). For backward compatibility, we could add a trivial instance mappings between http://onto-ns.com/meta/0.3/EntitySchema and http://onto-ns.com/meta/0.4/EntitySchema. The only part of DLite that would need to interpret this information is the instance creation that should check if the metadata has defined a fill value for uninitialised values for each property, and if so, initialise the property with the given fill value. A benefit with this is that these datamodel-level relations are very generic and could be used to semantically provide other kind of additional information about a data model. But it should limited to only express things that always will be true for all use cases of this data model. An example of such additional use case could be to state that the property "mass_std" is the standard deviation of the property "mass". A possible drawback of this approach is that the syntax of listing RDF triples feels complex. This might not be a real issue, since in the future novise users will work with data models via SOFT Studio which could provide a simple graphical interface for common usages of datamodel-level relations. |
It would be practical to be able to handle undefined values in DLite (different from the default initialization).
The user should be able to detect that a value in undefined.
The idea would be to set a value for type that indicate that it is undefined.
The text was updated successfully, but these errors were encountered: