Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplify schema modification of child tables #1647

Open
akelad opened this issue Jul 29, 2024 · 2 comments · May be fixed by #1855
Open

Simplify schema modification of child tables #1647

akelad opened this issue Jul 29, 2024 · 2 comments · May be fixed by #1855
Assignees
Labels
enhancement New feature or request sprint Marks group of tasks with core team focus at this moment support This issue is monitored by Solution Engineer
Milestone

Comments

@akelad
Copy link
Contributor

akelad commented Jul 29, 2024

Feature description

You should be able to use apply_hints on child tables just like you can on the parent tables. Right now there's no way to modify the schema of child tables without explicitly doing so in a schema.yml

Are you a dlt user?

Yes, I'm already a dlt user.

Use case

When child tables are generated all of the field types are text . This makes them essentially unusable for analysis/aggregation without further transformations. I also can't partition them.

Slack conversation: https://dlthub-community.slack.com/archives/C04DQA7JJN6/p1721944874168369

Proposed solution

No response

Related issues

No response

@akelad akelad added enhancement New feature or request community This issue came from slack community workspace support This issue is monitored by Solution Engineer labels Jul 29, 2024
@sh-rp
Copy link
Collaborator

sh-rp commented Jul 29, 2024

What might work is to add a path parameter of type List[str] to apply_table_hints which would enable support for setting hints on subtables. The DltResource (or rather the DltResourceHints class) would have to be able to hold multiple TTableSchemas which then will be merged into the full schema at the end of the extraction or during normalization.

Example:

@dlt.resource(table_name="items")
def my_resource():
   yield [{
      "id": 1,
      "sub_items": [
         {"id": 11}, 
         {"id": 12}
      ]
      }]
   
# applies primary key hint on main table
my_resource.apply_hints(primary_key="id")

# applies primary key hin on sub_items table
my_resource.apply_hints(path=["sub_items"], primary_key="id")

# ever further nested subtable
my_resource.apply_hints(path=["sub_items", "subsub_items"], primary_key="id")

@burnash burnash added this to the 1.0 release milestone Aug 8, 2024
@rudolfix rudolfix self-assigned this Aug 25, 2024
@rudolfix rudolfix added sprint Marks group of tasks with core team focus at this moment and removed community This issue came from slack community workspace labels Aug 25, 2024
@rudolfix
Copy link
Collaborator

Interface proposal (based on path idea above).

  1. extend the ResourceHints to include nested_hints dictionary that represent nested hints. Note that DltResource derives from hints so what I show below
resource = my_resource()

resource.nested_hints["sub_items"] = make_hints(primary_key="id")
resource.nested_hints["sub_items"]["subsub_items"] = make_hints(primary_key="id")

# works like filesystem.walk, where path is a tuple ie ("sub_items", "subsub_items") that may be converted into path
# the operation is recursive
for path, hint in resource.nested_hints.items():
   ...
  1. extend the @dlt.resource to accept nested_hints ie as dictionary {path: hints} or as ResourceHints instance

Some implementation details:

  1. There's just one resource here - the root one. All the others are hints that share data with the root. Mind that only root table(s) created may have resource name attached (again: there's just one resource name here)
  2. There are a few places where we compute_table_schema. Now we need to do this but also we need to walk over nested hints and generate a table chain that we'll later add to schema
  3. Mind that table schemas are dynamically created. And nested tables must know the name of parent table. You have two options here: or you store the parent hint and compute the parent table name recursively or you pass the previously generated parent table schema to compute_table_schema (where I think the latter is way more efficient)
  4. You still may allow users to specify table_name on the nested hint. If you do so, you'll need to modify the normalizer so it maps paths to those names. IMO this is for another ticket and bigger overhaul of the schema

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request sprint Marks group of tasks with core team focus at this moment support This issue is monitored by Solution Engineer
Projects
Status: Planned
Development

Successfully merging a pull request may close this issue.

4 participants