Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recover partitions for Spark #136

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions integration_tests/models/plugins/spark/spark_external.yml
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,12 @@ sources:
columns: *cols-of-the-people
tests: *equal-to-the-people

- name: people_csv_partitioned_inferred_using
external:
<<: *csv-people-using
recover_partitions: true
tests: *equal-to-the-people

# ----- TODO: hive format

# - name: people_csv_unpartitioned_hive_format
Expand Down
3 changes: 2 additions & 1 deletion macros/plugins/spark/helpers/recover_partitions.sql
Original file line number Diff line number Diff line change
@@ -1,7 +1,8 @@
{% macro spark__recover_partitions(source_node) %}
{# https://docs.databricks.com/sql/language-manual/sql-ref-syntax-ddl-alter-table.html #}

{%- if source_node.external.partitions and source_node.external.using and source_node.external.using|lower != 'delta' -%}
{%- if (source_node.external.partitions or source_node.external.recover_partitions)
and source_node.external.using and source_node.external.using|lower != 'delta' -%}
{% set ddl %}
ALTER TABLE {{ source(source_node.source_name, source_node.name) }} RECOVER PARTITIONS
{% endset %}
Expand Down
12 changes: 12 additions & 0 deletions sample_sources/spark.yml
Original file line number Diff line number Diff line change
Expand Up @@ -37,3 +37,15 @@ sources:
- name: contexts
data_type: string
description: "Contexts attached to event by Tracker"

- name: event_inferred_schema
description: "Snowplow events stored as partitioned parquet files in HDFS with inferred schema"
external:
# File path can contain partitions such as: hdfs://.../events/my_partition=2022-03-01/events1.parquet
# These partitions are excluded from 'location'.
location: 'hdfs://.../events/'
using: parquet

# Setting recover_partitions to true causes partitions to be refreshed,
# even though partitions are not explicitly specified.
recover_partitions: true
Loading