Create transformers to support subextractors #35

andimou · 2013-05-14T12:46:07Z

Problem

Sometimes we get already linked data in tdt/input. A URI towards another resource or a URL to a file giving more explanation can be handy to include in our harvester as well. Some resources in these files are structured according to another extractor: for instance: a CSV file referencing an ICAL format.

Solution

We should add our first transformer to our ETML workflow. Now the configuration might look like this:

{ "job" : {
   "extract" : {...},
   "transform" : [
       {
          "type" : "subextractor",
          "field" : "icalurl", //this is the field name in which a URL towards an ical file can be found
          "extract" : {
              "type" : "ICAL"
          } 
       }

   ],
   "map" : {...},
   "load" : {...}
}
}

The ICAL will then be put inside the hierarchy and can be mapped as well.

mielvds · 2013-05-16T17:43:26Z

I wanted to comment on this, but I can only say: I agree :) We have to make sure the extractor gives the right feedback when it is unable to read the file (in case of dirty values in the field)

mielvds · 2013-05-16T20:19:09Z

ok after giving it some thought, there is more. A problem arises in the mapping, because we need to link from resources created in the parent job to the child job. So this relation needs to be embedded into the mapping file.

Since each subfile is linked with a row in the original file, I suggest a execution strategy like this:
Job -> file -> row1
-> SubJob -> file -> row1
...
Job -> file -> row2
-> SubJob -> file -> row1
...

Then extend the mapping language to refer to child fields e.g:

<#Event> a :Resource;
:relationship [
:property ex:child; :object_from <#child!OpeningHours> .
]

Implementation wise, you will need to pass the URI of the object (what comes out of <#Event>) and the field of the child (<#OpeningHours>) to the mapping of the child job. This info is then added to the parsed mapping file and the triples are generated automatically.

pietercolpaert · 2013-05-16T21:25:09Z

Started creating transformers in branch https://github.com/tdt/input/tree/transformers

A problem occurs when the file inside this thing is huge. Conclusion of this problem is that we need to stream these chunks as wel. Thus, for every new row inside this subjob, we need a new vertere execution.

The only question now is how we will tell the mapping file that this chunk should be mapped according to subjob of clause X?

coreation · 2014-02-05T16:56:21Z

Can this be closed? @pietercolpaert

pietercolpaert · 2014-02-05T17:15:17Z

No. But it's no priority atm as @andimou is putting effort in RML

mielvds · 2014-02-06T08:02:50Z

This is within the scope of and natively solved by RML. Dontfix?

pietercolpaert · 2014-02-06T08:20:51Z

Maybe someone else out there wants to fix it and pull request the fix?

coreation · 2014-02-13T14:12:00Z

It's not really a fix now is it, more of a very large enhancement/feature?

ghost assigned pietercolpaert May 14, 2013

pietercolpaert mentioned this issue Jun 7, 2013

Create more maps based on a single source_column #47

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create transformers to support subextractors #35

Create transformers to support subextractors #35

andimou commented May 14, 2013

mielvds commented May 16, 2013

mielvds commented May 16, 2013

pietercolpaert commented May 16, 2013

coreation commented Feb 5, 2014

pietercolpaert commented Feb 5, 2014

mielvds commented Feb 6, 2014

pietercolpaert commented Feb 6, 2014

coreation commented Feb 13, 2014

Create transformers to support subextractors #35

Create transformers to support subextractors #35

Comments

andimou commented May 14, 2013

Problem

Solution

mielvds commented May 16, 2013

mielvds commented May 16, 2013

pietercolpaert commented May 16, 2013

coreation commented Feb 5, 2014

pietercolpaert commented Feb 5, 2014

mielvds commented Feb 6, 2014

pietercolpaert commented Feb 6, 2014

coreation commented Feb 13, 2014