Skip to content

Commit

Permalink
Recover partitions for Spark
Browse files Browse the repository at this point in the history
Add 'recover_partitions' option for Spark to run ALTER TABLE RECOVER PARTITIONS even if partitions are not explicitly specified. This makes using inferred partitions possible.
  • Loading branch information
Jarno Rajala committed Mar 1, 2022
1 parent 4ea5203 commit fbc766f
Show file tree
Hide file tree
Showing 3 changed files with 19 additions and 1 deletion.
6 changes: 6 additions & 0 deletions integration_tests/models/plugins/spark/spark_external.yml
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,12 @@ sources:
columns: *cols-of-the-people
tests: *equal-to-the-people

- name: people_csv_partitioned_inferred_using
external:
<<: *csv-people-using
recover_partitions: true
tests: *equal-to-the-people

# ----- TODO: hive format

# - name: people_csv_unpartitioned_hive_format
Expand Down
2 changes: 1 addition & 1 deletion macros/plugins/spark/helpers/recover_partitions.sql
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
{# https://docs.databricks.com/sql/language-manual/sql-ref-syntax-ddl-alter-table.html #}

{% set ddl %}
{%- if source_node.external.partitions and source_node.external.using and source_node.external.using|lower != 'delta' -%}
{%- if (source_node.external.partitions or source_node.external.recover_partitions) and source_node.external.using and source_node.external.using|lower != 'delta' -%}
ALTER TABLE {{ source(source_node.source_name, source_node.name) }} RECOVER PARTITIONS
{%- endif -%}
{% endset %}
Expand Down
12 changes: 12 additions & 0 deletions sample_sources/spark.yml
Original file line number Diff line number Diff line change
Expand Up @@ -30,3 +30,15 @@ sources:
- name: contexts
data_type: string
description: "Contexts attached to event by Tracker"

- name: event_inferred_schema
description: "Snowplow events stored as partitioned parquet files in HDFS with inferred schema"
external:
# File path can contain partitions such as: hdfs://.../events/my_partition=2022-03-01/events1.parquet
# These partitions are excluded from 'location'.
location: 'hdfs://.../events/'
using: parquet

# Setting recover_partitions to true causes partitions to be refreshed,
# even though partitions are not explicitly specified.
recover_partitions: true

0 comments on commit fbc766f

Please sign in to comment.