Feature: Add support for Data Profiling Scan #1392

syou6162 · 2024-11-03T23:08:29Z

resolves #1330

Problem

Dataplex data profiling lets you identify common statistical characteristics of the columns in your BigQuery tables. This information helps you to understand and analyze your data more effectively.

https://cloud.google.com/dataplex/docs/data-profiling-overview?hl=en

If you are managing tables with dbt, it is natural to want to configure Data Profile Scan in a yaml file. If data profiling could be set within dbt after the table is created, it would make it easier for dbt users to use the data profiling function.

Solution

I created this pull request to add support for Data Profiling Scan. If you write the following in dbt_project.yml and then run dbt run, the Data Profile Scan settings will be configured automatically.

models:
  +on_schema_change: "sync_all_columns"
  my_project:
    +persist_docs:
      relation: true
      columns: true
    sandbox:
      +schema: sandbox
      +materialized: table
      +data_profile_scan:
        location: us-central1
        sampling_percent: 10
        enabled: "{{ target.name == 'prod'}}"

You can also specify Data Profile Scan settings for individual model files, rather than dbt_project.yml.

version: 2
models:
  - name: my_table
    config:
      data_profile_scan:
        location: us-central1
        scan_id: my_profile_scan
        sampling_percent: 10
        row_filter: "TRUE"

Checklist

I have read the contributing guide and understand what's expected of me
I have run this code in development and it appears to resolve the stated issue
This PR includes tests, or tests are not required/relevant for this PR
This PR has no interface changes (e.g. macros, cli, logs, json artifacts, config files, adapter interface, etc) or this PR has already received feedback and approval from Product or DX

syou6162 · 2024-11-04T00:01:14Z

dbt/adapters/bigquery/impl.py

@@ -999,3 +1022,142 @@ def validate_sql(self, sql: str) -> AdapterResponse:
        :param str sql: The sql to validate
        """
        return self.connections.dry_run(sql)
+
+    # If the label `dataplex-dp-published-*` is not assigned, we cannot view the results of the Data Profile Scan from BigQuery
+    def _update_labels_with_data_profile_scan_labels(


Data Profile Scan is sometimes used for purposes other than dbt. It is important to have a way to tell whether the information in Data Profile Scan was created via dbt when updating/deleting it mechanically using cli or sdk. You can use scan_id, but I added the managed_by label because it is easier to handle when structured like labels.

syou6162 · 2024-11-04T14:14:35Z

@colin-rogers-dbt @VersusFacit Could you review this pull request?

I can also make a pull request to fix the documentation for BigQuery configurations, so please let me know if you need this 👍. If you need it, it would be helpful if you could let me know if you need it before this pull request is merged or if it would be sufficient after it is merged.

…a_profile_scan

syou6162 added 2 commits September 29, 2024 02:55

Merge remote-tracking branch 'origin/main'

b1b5183

Merge remote-tracking branch 'origin/main'

c3754ed

cla-bot bot added the cla:yes label Nov 3, 2024

install google-cloud-dataplex

8c49667

syou6162 force-pushed the feature/introduce_data_profile_scan branch 3 times, most recently from fd42a67 to 524a19a Compare November 3, 2024 23:25

syou6162 added 2 commits November 4, 2024 08:27

implement create_or_update_data_profile_scan method

6e65508

add create_or_update_data_profile_scan to table materialization

e191796

syou6162 force-pushed the feature/introduce_data_profile_scan branch from 524a19a to e191796 Compare November 3, 2024 23:27

Added changie release

7f80a97

syou6162 force-pushed the feature/introduce_data_profile_scan branch 2 times, most recently from 88f64a4 to 38f1e8c Compare November 3, 2024 23:49

syou6162 commented Nov 4, 2024

View reviewed changes

syou6162 force-pushed the feature/introduce_data_profile_scan branch 4 times, most recently from 03f68e3 to b59a087 Compare November 4, 2024 02:35

Add tests for data profile scan

9937855

syou6162 force-pushed the feature/introduce_data_profile_scan branch 2 times, most recently from 7d9e7c5 to 9fa2586 Compare November 4, 2024 03:30

syou6162 added 2 commits November 4, 2024 12:34

Extract DataProfileScan to a separate module

6073a62

fix test

8a99bfe

syou6162 force-pushed the feature/introduce_data_profile_scan branch from 9fa2586 to 8a99bfe Compare November 4, 2024 03:35

syou6162 changed the title ~~Feature/introduce data profile scan~~ Feature: Add support for Data Profiling Scan Nov 4, 2024

syou6162 marked this pull request as ready for review November 4, 2024 03:52

syou6162 requested a review from a team as a code owner November 4, 2024 03:52

syou6162 mentioned this pull request Nov 4, 2024

[Feature] Support Data Profiling in dbt #1330

Open

3 tasks

add case for incremental model

cd6965b

syou6162 added 3 commits November 6, 2024 12:15

Merge remote-tracking branch 'origin/main' into feature/introduce_dat…

6d63864

…a_profile_scan

Merge remote-tracking branch 'origin/main' into feature/introduce_dat…

4bba7df

…a_profile_scan

Merge branch 'main' into feature/introduce_data_profile_scan

ff42a57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature: Add support for Data Profiling Scan #1392

Feature: Add support for Data Profiling Scan #1392

syou6162 commented Nov 3, 2024 •

edited

Loading

syou6162 Nov 4, 2024

syou6162 commented Nov 4, 2024

Feature: Add support for Data Profiling Scan #1392

Are you sure you want to change the base?

Feature: Add support for Data Profiling Scan #1392

Conversation

syou6162 commented Nov 3, 2024 • edited Loading

Problem

Solution

Checklist

syou6162 Nov 4, 2024

Choose a reason for hiding this comment

syou6162 commented Nov 4, 2024

syou6162 commented Nov 3, 2024 •

edited

Loading