Skip to content

Commit

Permalink
docs: Customize materialized fields (#1119)
Browse files Browse the repository at this point in the history
* new guide first draft

* refine initial draft

* tweaks per edits
  • Loading branch information
oliviamiannone authored Jul 24, 2023
1 parent a33500a commit 05df14c
Showing 1 changed file with 140 additions and 0 deletions.
140 changes: 140 additions & 0 deletions site/docs/guides/customize-materialization-fields.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,140 @@
# Customize materialized fields

When you first materialize a collection to an endpoint like a database or data warehouse,
the resulting table columns might not be formatted how you want.
You might notice missing columns, extra columns, or columns with names you don't like.
This happens when the collection's JSON schema doesn't map to a table schema appropriate for your use case.

You can control the shape and appearance of materialized tables using a two-step process.

First, you modify the source collection **schema**.
You can change column names by adding **[projections](../concepts/advanced/projections.md)**:
JSON pointers that turn locations in a document's JSON structure into custom named fields.

Then, you add the `fields` stanza to the materialization specification, telling Flow which fields to materialize.

The following sections break down the process in more detail.

:::info Hint
If you just need to add a field that isn't included by default and it's already present in the schema
with a name you like, skip ahead to [include desired fields in your materialization](#include-desired-fields-in-your-materialization).
:::

## Capture desired fields and generate projections

Any field you eventually want to materialize must be included in the collection's schema.
It's ok if the field is nested in the JSON structure; you'll flatten the structure with **projections**.

:::caution
In this workflow, you'll edit a collection. This change can impact other downstream materializations and derivations.
Use caution and be mindful of any edit's consequences before publishing.
:::

### Captured collections

If the collection you're using was captured directly, follow these steps.

1. Go to the [Captures](https://dashboard.estuary.dev/captures) page of the Flow web app
and locate the capture that produced the collection.

2. Click the **Options** button and choose **Edit Specification**.

3. Under **Output Collections**, choose the binding that corresponds to the collection.
Then, click the **Collection** tab.

4. In the list of fields, look for the fields you want to materialize.
If they're present and correctly named, you can skip to
[including them in the materialization](#include-desired-fields-in-your-materialization).

:::info hint:
Compare the field name and pointer.
For nested pointers, you'll probably want to change the field name to omit slashes.
:::

5. If your desired fields aren't present or need to be re-named, edit the collection schema manually:

1. Click **Edit**.

2. Add missing fields to the schema in the correct location based on the source data structure.

3. Click **Close**.

6. Generate projections for new or incorrectly named fields.

1. If available, click the **Schema Inference** button. The Schema Inference Window appears. Flow cleans up your schema and adds projections for new fields.

2. Manually change the names of projected fields. These names will be used by the materialization and shown in the endpoint system as column names or the equivalent.

3. Click **Next**.

:::info
Schema Inference isn't available for all capture types.
You can also add projections manually with `flowctl`.
Refer to the guide to [editing with flowctl](./flowctl/edit-specification-locally.md) and
[how to format projections](../concepts/collections.md#projections).
:::

7. Repeat steps 3 through 6 with other collections, if necessary.

8. Click **Save and Publish**.

### Derived collections

If the collection you're using came from a derivation, follow these steps.

1. [Pull the derived collection's specification locally](./flowctl/edit-specification-locally.md#pull-specifications-locally) using `flowctl`.

```
flowctl catalog pull-specs --name <yourOrg/full/collectionName>
```

2. Review the collection's schema to see if the fields of interest are included. If they're present, you can skip to
[including them in the materialization](#include-desired-fields-in-your-materialization).

3. If your desired fields aren't present or are incorrectly named, add any missing fields to the schema in the correct location based on the source data structure.

4. Use schema inference to generate projections for the fields.

```
flowctl preview --infer-schema --source <full\path\to\flow.yaml> --collection <yourOrg/full/collectionName>
```

5. Review the updated schema. Manually change the names of projected fields. These names will be used by the materialization and shown in the endpoint system as column names or the equivalent.

6. [Re-publish the collection specification](./flowctl/edit-specification-locally.md#edit-source-files-and-re-publish-specifications).

## Include desired fields in your materialization

Now that all your fields are present in the collection schema as projections,
you can choose which ones to include in the materialization.

Every included field will be mapped to a table column or equivalent in the endpoint system.

1. If you haven't created the materialization, [begin the process](./create-dataflow.md#create-a-materialization). Pause once you've selected the collections to materialize.

If your materialization already exists, navigate to the [edit materialization](./edit-data-flows.md#edit-a-materialization) page.

2. In the Collection Selector, choose the collection whose output fields you want to change. Click its **Collection** tab.

3. Review the listed field.

In most cases, Flow automatically detects all fields to materialize, projected or otherwise. However, a projected field may still be missing, or you may want to exclude other fields.

4. If you want to make changes, click **Edit**.

5. Use the editor to add the `fields` stanza to the collection's binding specification.

[Learn more about configuring `fields` and view a sample specification](../concepts/materialization.md#projected-fields).

6. Choose whether to start with Flow's recommended fields. Under `fields`, set `recommended` to `true` or `false`. If you choose `true`, you can exclude fields later.

7. Use `include` to add missing projections, or `exclude` to remove fields.

8. Click **Close**.

9. Repeat steps 2 through 8 with other collections, if necessary.

10. Click **Save and Publish**.

The named, included fields will be reflected in the endpoint system.

0 comments on commit 05df14c

Please sign in to comment.