To support task display_name #1278

t0momi219 · 2024-10-23T16:43:56Z

Description

When running models that have names containing multibyte characters, runtime errors occur in Airflow environments where statsd is enabled (e.g., MWAA uses this statsd metric for collecting metrics in Cloudwatch).

Related Issue: apache/airflow#18010

To address this, Airflow 2.9 introduced the ability to render tasks using display_name, which allows task names to be rendered separately from their task_id.

Reference: https://airflow.apache.org/docs/apache-airflow/stable/_modules/airflow/example_dags/example_display_name.html

This PR adds support for display_name, enabling users who use non-ASCII characters as their native language to display task names in their own language, even in environments like MWAA.

Details

The normalize_task_id parameter is added to RenderConfig.
This option accepts a function to generate a task ID from a node. This allows users to generate arbitrary task IDs from models. If a function is passed to this option, Cosmos will use the model name as the display_name for tasks while rendering them.

def normalize_task_id(node):
    """
    This function takes a node and returns a new task_id.
    """
    if node.name == "ＭＵＬＴＩＢＹＴＥ＿ＭＯＤＥＬ＿ＮＡＭＥ":
        return "MULTIBYTE_MODEL_NAME"

render_config = RenderConfig(
    normalize_task_id=normalize_task_id
)

Related Issue(s)

closes #1277

Breaking Change?

Checklist

I have made corresponding changes to the documentation (if required)
I have added tests that prove my fix is effective or that my feature works

netlify · 2024-10-23T16:44:12Z

❌ Deploy Preview for sunny-pastelito-5ecb04 failed.

Name	Link
🔨 Latest commit	`03076bc`
🔍 Latest deploy log	https://app.netlify.com/sites/sunny-pastelito-5ecb04/deploys/67233876fddb0000086c8612

t0momi219 · 2024-10-26T14:52:01Z

Hi team, ( @tatiana @pankajkoti )
This PR is ready. Could you please review this?

tatiana

HI @t0momi219, thank you very much for the detailed explanation of the problem and for proposing a fix.

I was surprised with the amount of lines changed to fix the issue. It feels like you tried to do two things at once: fix the problem while refactoring the code. Would it be possible to solve the bug with less changes to the code?

As an example, what if we:

Introduced a function (e.g. normalize_node_name that takes a node_name, normalizes it, handling non-ASCII characters)
Replaced the ocurrences of node.name by normalize_node_name(node.name)

tatiana · 2024-10-26T20:09:41Z

cosmos/airflow/graph.py

-    args = {**args, **{"models": node.resource_name}}
-
-    if DbtResourceType(node.resource_type) in DEFAULT_DBT_RESOURCES and node.resource_type in dbt_resource_to_class:
-        extra_context = {
-            "dbt_node_config": node.context_dict,
-            "dbt_dag_task_group_identifier": dbt_dag_task_group_identifier,
-        }
-        if node.resource_type == DbtResourceType.MODEL:
-            task_id = f"{node.name}_run"
-            if use_task_group is True:
-                task_id = "run"
-        elif node.resource_type == DbtResourceType.SOURCE:
-            if (source_rendering_behavior == SourceRenderingBehavior.NONE) or (
-                source_rendering_behavior == SourceRenderingBehavior.WITH_TESTS_OR_FRESHNESS
-                and node.has_freshness is False
-                and node.has_test is False
-            ):
-                return None
-            # TODO: https://github.com/astronomer/astronomer-cosmos
-            # pragma: no cover
-            task_id = f"{node.name}_source"
-            args["select"] = f"source:{node.resource_name}"
-            args.pop("models")
-            if use_task_group is True:
-                task_id = node.resource_type.value
-            if node.has_freshness is False and source_rendering_behavior == SourceRenderingBehavior.ALL:
-                # render sources without freshness as empty operators
-                return TaskMetadata(id=task_id, operator_class="airflow.operators.empty.EmptyOperator")
-        else:
-            task_id = f"{node.name}_{node.resource_type.value}"
-            if use_task_group is True:
-                task_id = node.resource_type.value
-
-        task_metadata = TaskMetadata(
-            id=task_id,
-            owner=node.owner,
-            operator_class=calculate_operator_class(
-                execution_mode=execution_mode, dbt_class=dbt_resource_to_class[node.resource_type]
-            ),
-            arguments=args,
-            extra_context=extra_context,
-        )
-        return task_metadata
-    else:


This is a critical part of the code, and I feel that the bug fix does not justify changing all these lines.

Thank you for reviewing this over the weekend. I apologize for the lack of clarity in my pull request.

The purpose of these changes is as follows:

Adding a parameter to enable the use of display_name

Addressing Ruff errors that occurred as a result of point 1

Updating the documentation (I also noticed that there was some missing information about "source_rendering_behavior" and have added that as well).

tatiana · 2024-10-26T20:11:18Z

cosmos/config.py

@@ -62,6 +62,8 @@ class RenderConfig:
    :param dbt_ls_path: Configures the location of an output of ``dbt ls``. Required when using ``load_method=LoadMode.DBT_LS_FILE``.
    :param enable_mock_profile: Allows to enable/disable mocking profile. Enabled by default. Mock profiles are useful for parsing Cosmos DAGs in the CI, but should be disabled to benefit from partial parsing (since Cosmos 1.4).
    :param source_rendering_behavior: Determines how source nodes are rendered when using cosmos default source node rendering (ALL, NONE, WITH_TESTS_OR_FRESHNESS). Defaults to "NONE" (since Cosmos 1.6).
+    :param airflow_vars_to_purge_dbt_ls_cache: Specify Airflow variables that will affect the LoadMode.DBT_LS cache.
+    :param set_task_id_by_node: A callable that takes a dbt node as input and returns the task ID. This allows users to assign a custom node ID separate from the display name.


We received feedback from end-users that Cosmos already has too many configurations. Would it be possible for us to handle non-ASCII following how other libraries handle this, without enforcing users to define a new configuration?

The reason for not providing users with an option to specify task_id themselves is that automatically generating task_id for names written in non-ASCII characters is a highly challenging task.

For example, while slugify (as mentioned in the documentation) can work in some cases, it’s not suitable for use in actual production code.

Examples:

slugify converts names based on pronunciation, which makes it difficult to keep Task IDs unique due to homophones. In Japanese, for instance, "accounting" and "finance" are represented by the same word.

Chinese and Japanese both use similar kanji characters, but these characters have different pronunciations in each language.

In other words, what’s needed is "translation" rather than a mechanical "conversion" like slugify for the sake of automation.

To ensure unique task IDs, some form of mapping would likely be necessary.
For example: { "顧客": "customers", "注文": "orders" }.

I understand the concern about having too many options for the end users. However, I believe it would be challenging to automatically generate task_id on behalf of the users.

May I please propose adding an option for this purpose? If that’s acceptable, I’d like to work on a revision that minimizes code changes.

Hi @t0momi219, Thanks a lot for explaining these differences. I don't speak Chinese nor Japanese, so your explanation was extremely helpful. I now understand the need

Yes, please. It would be amazing if you could make changes to make this customizable without changing the codebase.

yes, given the explanation I'm inclined to provide & expose it as a configuration. Like that someone who really needs to use it can use.

Also +1 to minimising the code change, easier to review & address the scope of a particular issue. Thanks for working on this @t0momi219

tatiana · 2024-10-26T20:12:07Z

docs/configuration/task-display-name.rst

+    from slugify import slugify
+
+
+    def set_task_id_by_node(node):
+        return slugify(node.name)


Why not to do this on behalf of the users?

As I mentioned in my previous comment.

t0momi219 · 2024-10-31T08:19:21Z

Hi @tatiana @pankajkoti ,
Thank you for the review, and I appreciate your acceptance of this proposal. I have revised the implementation to avoid impacting the core logic as much as possible.

PR Changes

Renamed the added parameter to normalize_task_id.
Modified the code with minimal changes.
Updated the documentation.

To support task display name.

8dc46be

dosubot bot added the size:S This PR changes 10-29 lines, ignoring generated files. label Oct 23, 2024

t0momi219 had a problem deploying to external October 23, 2024 16:44 — with GitHub Actions Error

🎨 [pre-commit.ci] Auto format from pre-commit.com hooks

889861d

dosubot bot added the area:rendering Related to rendering, like Jinja, Airflow tasks, etc label Oct 23, 2024

pre-commit-ci bot had a problem deploying to external October 23, 2024 16:44 Error

add tests

28dc04d

dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. and removed size:S This PR changes 10-29 lines, ignoring generated files. labels Oct 26, 2024

t0momi219 had a problem deploying to external October 26, 2024 09:48 — with GitHub Actions Error

🎨 [pre-commit.ci] Auto format from pre-commit.com hooks

c3a1ee8

pre-commit-ci bot had a problem deploying to external October 26, 2024 09:48 Error

add docs.

9170c3a

t0momi219 had a problem deploying to external October 26, 2024 10:35 — with GitHub Actions Error

fix mypy error

49363b0

t0momi219 had a problem deploying to external October 26, 2024 11:31 — with GitHub Actions Error

t0momi219 and others added 2 commits October 26, 2024 23:46

fix document

1ed6e10

🎨 [pre-commit.ci] Auto format from pre-commit.com hooks

946e183

pre-commit-ci bot had a problem deploying to external October 26, 2024 14:47 Error

fix document

32586dc

t0momi219 temporarily deployed to external October 26, 2024 14:50 — with GitHub Actions Inactive

tatiana reviewed Oct 26, 2024

View reviewed changes

minimize changes

218bbc5

t0momi219 had a problem deploying to external October 31, 2024 05:09 — with GitHub Actions Error

🎨 [pre-commit.ci] Auto format from pre-commit.com hooks

fa6b2d9

pre-commit-ci bot temporarily deployed to external October 31, 2024 05:09 Inactive

fix document

03076bc

t0momi219 temporarily deployed to external October 31, 2024 07:58 — with GitHub Actions Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

To support task display_name #1278

To support task display_name #1278

t0momi219 commented Oct 23, 2024 •

edited

Loading

netlify bot commented Oct 23, 2024 •

edited

Loading

t0momi219 commented Oct 26, 2024

tatiana left a comment

tatiana Oct 26, 2024

t0momi219 Oct 27, 2024

tatiana Oct 26, 2024

t0momi219 Oct 27, 2024

t0momi219 Oct 27, 2024

tatiana Oct 29, 2024

pankajkoti Oct 29, 2024

tatiana Oct 26, 2024

t0momi219 Oct 27, 2024

t0momi219 commented Oct 31, 2024

To support task display_name #1278

Are you sure you want to change the base?

To support task display_name #1278

Conversation

t0momi219 commented Oct 23, 2024 • edited Loading

Description

Details

Related Issue(s)

Breaking Change?

Checklist

netlify bot commented Oct 23, 2024 • edited Loading

❌ Deploy Preview for sunny-pastelito-5ecb04 failed.

t0momi219 commented Oct 26, 2024

tatiana left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

t0momi219 commented Oct 31, 2024

PR Changes

t0momi219 commented Oct 23, 2024 •

edited

Loading

netlify bot commented Oct 23, 2024 •

edited

Loading