Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create transformers to support subextractors #35

Open
andimou opened this issue May 14, 2013 · 8 comments
Open

Create transformers to support subextractors #35

andimou opened this issue May 14, 2013 · 8 comments
Assignees

Comments

@andimou
Copy link
Member

andimou commented May 14, 2013

Problem

Sometimes we get already linked data in tdt/input. A URI towards another resource or a URL to a file giving more explanation can be handy to include in our harvester as well. Some resources in these files are structured according to another extractor: for instance: a CSV file referencing an ICAL format.

Solution

We should add our first transformer to our ETML workflow. Now the configuration might look like this:

{ "job" : {
   "extract" : {...},
   "transform" : [
       {
          "type" : "subextractor",
          "field" : "icalurl", //this is the field name in which a URL towards an ical file can be found
          "extract" : {
              "type" : "ICAL"
          } 
       }

   ],
   "map" : {...},
   "load" : {...}
}
}

The ICAL will then be put inside the hierarchy and can be mapped as well.

@ghost ghost assigned pietercolpaert May 14, 2013
@mielvds
Copy link
Member

mielvds commented May 16, 2013

I wanted to comment on this, but I can only say: I agree :) We have to make sure the extractor gives the right feedback when it is unable to read the file (in case of dirty values in the field)

@mielvds
Copy link
Member

mielvds commented May 16, 2013

ok after giving it some thought, there is more. A problem arises in the mapping, because we need to link from resources created in the parent job to the child job. So this relation needs to be embedded into the mapping file.

Since each subfile is linked with a row in the original file, I suggest a execution strategy like this:
Job -> file -> row1
-> SubJob -> file -> row1
...
Job -> file -> row2
-> SubJob -> file -> row1
...

Then extend the mapping language to refer to child fields e.g:

<#Event> a :Resource;
:relationship [
:property ex:child; :object_from <#child!OpeningHours> .
]

Implementation wise, you will need to pass the URI of the object (what comes out of <#Event>) and the field of the child (<#OpeningHours>) to the mapping of the child job. This info is then added to the parsed mapping file and the triples are generated automatically.

@pietercolpaert
Copy link
Member

Started creating transformers in branch https://github.com/tdt/input/tree/transformers

A problem occurs when the file inside this thing is huge. Conclusion of this problem is that we need to stream these chunks as wel. Thus, for every new row inside this subjob, we need a new vertere execution.

The only question now is how we will tell the mapping file that this chunk should be mapped according to subjob of clause X?

@coreation
Copy link
Member

Can this be closed? @pietercolpaert

@pietercolpaert
Copy link
Member

No. But it's no priority atm as @andimou is putting effort in RML

@mielvds
Copy link
Member

mielvds commented Feb 6, 2014

This is within the scope of and natively solved by RML. Dontfix?

@pietercolpaert
Copy link
Member

Maybe someone else out there wants to fix it and pull request the fix?

@coreation
Copy link
Member

It's not really a fix now is it, more of a very large enhancement/feature?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants