Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open lineage observer #5709

Draft
wants to merge 2 commits into
base: master
Choose a base branch
from
Draft

Open lineage observer #5709

wants to merge 2 commits into from

Conversation

jorgee
Copy link
Contributor

@jorgee jorgee commented Jan 24, 2025

This PR adds an observer for OpenLineage

For each workflow task and publish notification, it generates an Openlineage RunEvent event and emits it to the OpenLineage server. In this case I have used Marquez as server that allow to store and visualize the data lineage.

Limitations

  • Marquez only shows the last execution of a job. So, when running a workflow that runs a process for several datasets, generating a set of datasets, only the latest execution was visualized and only the last used dataset was linked to a job. So, the other generated datasets appear as orphans in the visualization. Datasets are connected to runs but they are not visualized correctly. They are aware of this (#2543) they called it static lineage. To overcome this situation, I had to create a job per task. The same happened for publishing files, I had to add a counter to be able to visualize all the file publications. Despite, it allows to visualize, a single execution, if we run again the workflow generating new datasets, for the same reason, Marque is not visualizing correctly de lineage of previous datasets.
  • I haven’t see a way to see the whole graph in Marquez, but I think this is not the intention of the framework
  • Datasets and Jobs sections in Marquez are not showing anything, and no errors in logs.
  • We are able to detect the dependencies there is a direct dependency by files, we are losing the lineage when data is processed in operators (such as reading file lines, etc). Maybe it could be solved adding an observer for operator runs and manage the inputs and outputs in the same way as in Full execution provenance resolution  #5639.

Copy link

netlify bot commented Jan 24, 2025

Deploy Preview for nextflow-docs-staging ready!

Name Link
🔨 Latest commit 76ff6d8
🔍 Latest deploy log https://app.netlify.com/sites/nextflow-docs-staging/deploys/67939545001ca80008556d77
😎 Deploy Preview https://deploy-preview-5709--nextflow-docs-staging.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant