Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

XML predicate mapping repeating child elements getting concatenated if reference includes concatenation #235

Open
schivmeister opened this issue Apr 17, 2024 · 3 comments

Comments

@schivmeister
Copy link

schivmeister commented Apr 17, 2024

Environment

rmlmapper v6.5.1 (reproducible also as far back as v6.1.3)
Linux/WSL2
Java 17, 11

Namespaces

@prefix owl: <http://www.w3.org/2002/07/owl#> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix rr: <http://www.w3.org/ns/r2rml#> .
@prefix rml: <http://semweb.mmlab.be/ns/rml#> .
@prefix ql: <http://semweb.mmlab.be/ns/ql#> .
@prefix ex: <http://data.example.org/resource/> .
@prefix org: <http://www.w3.org/ns/org#> .
@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix fnml:   <http://semweb.mmlab.be/ns/fnml#> .
@prefix fno: <https://w3id.org/function/ontology#> .
@prefix idlab-fn: <http://example.com/idlab/function/> .

Problem

Given the following kind of input XML with two Organization elements, where the first has two child Name elements:

<Directory>
	<Organization>
		<ID>123</ID>
		<Name>ABC Fast Company</Name>
		<Name>ABC FastCo</Name>
	</Organization>
	<Organization>
		<ID>456</ID>
		<Name>XYZ Inc.</Name>
	</Organization>	
</Directory>

and the following kind of RML mapping involving a custom concatenated value in the source reference:

ex:Organizations a rr:TriplesMap;
    rml:logicalSource [
        rml:source "test.xml";
        rml:iterator "/Directory/Organization";
        rml:referenceFormulation ql:XPath
    ];
    rr:subjectMap [
        rr:template "http://data.example.org/resource/Organization_{ID}";
        rr:class org:Organization
    ];
    rr:predicateObjectMap [
        rr:predicate org:name;
        rr:objectMap
            [
                rml:reference "'CustomPrefix ' || Name || ' CustomSuffix'"
            ];
    ]
.

Actual

Results in an unexpected output of the first resource's name concatenating the repeating values in between the prefix and suffix, instead of multiple comma-separated RDF/Turtle values:

ex:Organization_123 a org:Organization;
  org:name "CustomPrefix ABC Fast CompanyABC FastCo CustomSuffix" . # these are values from two Name elements

# the second resource remains unaffected (correctly formed)
ex:Organization_456 a org:Organization;
  org:name "CustomPrefix XYZ Inc. CustomSuffix" .

Expected

Should result in multiple comma-separated values mapped from the XML child elements, adhering to the condition of the reference:

ex:Organization_123 a org:Organization;
  org:name "CustomPrefix ABC Fast Company CustomSuffix", "CustomPrefix ABC FastCo CustomSuffix" .

Workaround

Template bypassing XPath expressions

This is perhaps the closest thing to an actual solution (if you don't need additional XPath complexity):

        rr:objectMap
            [
                rr:template "CustomPrefix {Name} CustomSuffix" ;
                rr:datatype xsd:string ; # an explicit type is required otherwise termType IRI is inferred and error raised
            ];

producing the correct result:

ex:Organization_123 a org:Organization;
  org:name "CustomPrefix ABC Fast Company CustomSuffix", "CustomPrefix ABC FastCo CustomSuffix" .

ex:Organization_456 a org:Organization;
  org:name "CustomPrefix XYZ Inc. CustomSuffix" .

Plain reference with out-of-band strategies

One could skip using the reference altogether and employ a different technique, with something external, to replicate the desired outcome, for e.g. using (custom) functions, or even just looking up a mapping table using a parentTriplesMap.

Removing the concatenation obviously makes it work:

        rr:objectMap
            [
                rml:reference "Name"
            ];

resulting in:

ex:Organization_123 a org:Organization;
  org:name "ABC Fast Company", "ABC FastCo" .

Reoriented iterator

Using an iterator on the child element which repeats but creating the subject using the ancestor element appears to work:

ex:Organizations a rr:TriplesMap;
    rml:logicalSource [
        rml:source "test.xml";
        rml:iterator "/Directory/Organization/Name";
        rml:referenceFormulation ql:XPath
    ];
    rr:subjectMap [
        rr:template "http://data.example.org/resource/Organization_{../ID}";
        rr:class org:Organization
    ];
    rr:predicateObjectMap [
        rr:predicate org:name;
        rr:objectMap
            [
                rml:reference "'CustomPrefix ' || . || ' CustomSuffix'"
            ];
    ]
.

However, this is unintuitive and convoluted. The correct solution would be if repeating child elements were also repeated as values for a predicateObjectMap, as they normally are with a plain reference (or template).

MWE

rml-mwe-concat-multivalue.zip (excludes template example)

Context

This may or may not be related to #227 #228.

@bjdmeest
Copy link
Collaborator

Thanks for the very detailed bug report! I'm afraid this is an old RML spec issue, being underspecified how to work with multiple valued references (resulting in sometimes very weird results as you've detailed here, eg in combination with rr:template or a function). We're working on improving the new version of the spec and a more global solution using the Logical Views extension, with a PoC implementation available (and paper being presented next month), however, that's all still in alpha stage.

So, there are actually 3 paths that can be taken in parallel, I think:

  • you keep following the unintuitive solution, as that's probably most mature
  • you try the logical views PoC as an experiment to see whether that solves this and more problems, and help us with feedback on the spec
  • we double-check this bug report to see whether this is an edge case that should also be fixed in the core RML spec, and could thus result in a bug fix in the current RMLMapper-JAVA

We'll check when we can dedicate some time on this bug report, but as you can imagine as an academic institution, it's always trying to find a balance wrt our research roadmaps/paid projects. If this would be really blocking you, feel free to reach out at [email protected] to see how we can prioritize this!

@schivmeister
Copy link
Author

schivmeister commented Apr 18, 2024

Thank you for the swift response @bjdmeest! It already helps a lot to know that I'm not (likely) making a mistake somewhere. I understand that offering a resolution is not always possible, which is totally fine. We will reach out if it indeed turns out to be a blocker.

There are potentially other solutions depending on the use case, e.g. in our case it was originally related to a lookup based on modified source values, but we decided to encode certain values in the lookup table as a workaround instead, so that we need not modify the reference.

Otherwise, I took a look again at the Logical Views extension, which I did check out briefly once before for tabular lookups. However, I don't see XML as a supported source format in the reference/PoC implementation, and I also think it attacks a different problem.

Nevertheless, I took the liberty to try and figure out where in the code this is likely happening. It appears to be an issue with the dataio library's XMLRecord.get() implementation as called in ReferenceExtractor::extract(). Trying to reproduce the issue record.get() yields:

[CustomPrefix ABC Fast CompanyABC FastCo CustomSuffix]

instead of

[CustomPrefix ABC Fast Company CustomSuffix, CustomPrefix ABC FastCo CustomSuffix]

or in the case of a plain reference:

[ABC Fast Company, ABC FastCo]

It could very well be that the concatenation causes unexpected behaviour in the evaluation of the XPaths (using Saxon?), as a direct concat on repeating elements would otherwise raise an error of the form:

error: A sequence of more than one item is not allowed as the second argument of fn:concat() ...

Tested using:

java -cp saxon-he-12.4.jar net.sf.saxon.Query -s:test.xml -qs:"concat('CustomPrefix', /Directory/Organization[1]/Name, ' CustomSuffix')"

But we are not getting an error in the mapping itself, just unexpected concatenation, which indicates that the function works but is being evaluated on the entire set of XPath query results.

@schivmeister
Copy link
Author

schivmeister commented Apr 18, 2024

I realized after all that we also have template, which works:

        rr:objectMap
            [
                rr:template "CustomPrefix {Name} CustomSuffix" ;
                rr:datatype xsd:string ; # an explicit type is required otherwise termType IRI is inferred and error raised
            ];

So, this is a very valid alternative for simple cases not involving other XPath expressions (added as a workaround in the original post).

schivmeister added a commit to OP-TED/ted-rdf-mapping-eforms that referenced this issue Oct 11, 2024
Extract languageMaps into their own TriplesMaps as RML is
"underspecified" for multi-valued properties, leading to odd behaviour
in beyond-basic use cases (in this case language code transformation via
CSV lookup function).

Effectively fixes gh-66 but implements an unintuitive mapping approach
(predicate at its own iterator).

See also RMLio/rmlmapper-java#235.

Affects:

- ChangeInformation
- ContactPoint
- Lot
- LotGroup
- Organization
- Procedure
- BT-75-Lot versioned, with fix for version range and dupe conflict
- BT-772-Lot versioned

ChangeInformation tested with 673305-2023.xml.

Many of the Lot fields could not be tested due to unavailability of data
(not easily searchable or requirements expressible through currently
available means).

The change for BT-75-Lot fixes the situation where there was a redundant
1.4+ mapping in the common Lot RML file, and a mistaken version
annotation for 1.3-1.3 (being marked as min 1.4). Because later versions
are less restrictive, this cannot easily be caught, but was otherwise
wrong (the common would override).

Additionally, remove some conflicting dupe references in Procedure:

- BT-01(d)-Procedure
- BT-1351-Procedure

This may or may not have led to undefined behaviour in the mapping
(there were no change in outputs so hopefully this had no impact).
schivmeister added a commit to OP-TED/ted-rdf-mapping-eforms that referenced this issue Oct 11, 2024
Extract languageMaps into their own TriplesMaps as RML is
"underspecified" for multi-valued properties, leading to odd behaviour
in beyond-basic use cases (in this case language code transformation via
CSV lookup function).

Effectively fixes gh-66 but implements an unintuitive mapping approach
(predicate at its own iterator).

See also RMLio/rmlmapper-java#235.

Affects:

- Contract
- Procedure CAN
- BT-554-Tender versioned
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants