Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

To support task display_name #1278

Open
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

t0momi219
Copy link
Contributor

@t0momi219 t0momi219 commented Oct 23, 2024

Description

When running models that have names containing multibyte characters, runtime errors occur in Airflow environments where statsd is enabled (e.g., MWAA uses this statsd metric for collecting metrics in Cloudwatch).

Related Issue: apache/airflow#18010

To address this, Airflow 2.9 introduced the ability to render tasks using display_name, which allows task names to be rendered separately from their task_id.

Reference: https://airflow.apache.org/docs/apache-airflow/stable/_modules/airflow/example_dags/example_display_name.html

This PR adds support for display_name, enabling users who use non-ASCII characters as their native language to display task names in their own language, even in environments like MWAA.

Details

The normalize_task_id parameter is added to RenderConfig.
This option accepts a function to generate a task ID from a node. This allows users to generate arbitrary task IDs from models. If a function is passed to this option, Cosmos will use the model name as the display_name for tasks while rendering them.

def normalize_task_id(node):
    """
    This function takes a node and returns a new task_id.
    """
    if node.name == "MULTIBYTE_MODEL_NAME":
        return "MULTIBYTE_MODEL_NAME"

render_config = RenderConfig(
    normalize_task_id=normalize_task_id
)

Related Issue(s)

closes #1277

Breaking Change?

Checklist

  • I have made corresponding changes to the documentation (if required)
  • I have added tests that prove my fix is effective or that my feature works

@dosubot dosubot bot added the size:S This PR changes 10-29 lines, ignoring generated files. label Oct 23, 2024
Copy link

netlify bot commented Oct 23, 2024

Deploy Preview for sunny-pastelito-5ecb04 failed.

Name Link
🔨 Latest commit 03076bc
🔍 Latest deploy log https://app.netlify.com/sites/sunny-pastelito-5ecb04/deploys/67233876fddb0000086c8612

@dosubot dosubot bot added the area:rendering Related to rendering, like Jinja, Airflow tasks, etc label Oct 23, 2024
@dosubot dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. and removed size:S This PR changes 10-29 lines, ignoring generated files. labels Oct 26, 2024
@t0momi219
Copy link
Contributor Author

Hi team, ( @tatiana @pankajkoti )
This PR is ready. Could you please review this?

Copy link
Collaborator

@tatiana tatiana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HI @t0momi219, thank you very much for the detailed explanation of the problem and for proposing a fix.

I was surprised with the amount of lines changed to fix the issue. It feels like you tried to do two things at once: fix the problem while refactoring the code. Would it be possible to solve the bug with less changes to the code?

As an example, what if we:

  1. Introduced a function (e.g. normalize_node_name that takes a node_name, normalizes it, handling non-ASCII characters)
  2. Replaced the ocurrences of node.name by normalize_node_name(node.name)

Comment on lines 158 to 201
args = {**args, **{"models": node.resource_name}}

if DbtResourceType(node.resource_type) in DEFAULT_DBT_RESOURCES and node.resource_type in dbt_resource_to_class:
extra_context = {
"dbt_node_config": node.context_dict,
"dbt_dag_task_group_identifier": dbt_dag_task_group_identifier,
}
if node.resource_type == DbtResourceType.MODEL:
task_id = f"{node.name}_run"
if use_task_group is True:
task_id = "run"
elif node.resource_type == DbtResourceType.SOURCE:
if (source_rendering_behavior == SourceRenderingBehavior.NONE) or (
source_rendering_behavior == SourceRenderingBehavior.WITH_TESTS_OR_FRESHNESS
and node.has_freshness is False
and node.has_test is False
):
return None
# TODO: https://github.com/astronomer/astronomer-cosmos
# pragma: no cover
task_id = f"{node.name}_source"
args["select"] = f"source:{node.resource_name}"
args.pop("models")
if use_task_group is True:
task_id = node.resource_type.value
if node.has_freshness is False and source_rendering_behavior == SourceRenderingBehavior.ALL:
# render sources without freshness as empty operators
return TaskMetadata(id=task_id, operator_class="airflow.operators.empty.EmptyOperator")
else:
task_id = f"{node.name}_{node.resource_type.value}"
if use_task_group is True:
task_id = node.resource_type.value

task_metadata = TaskMetadata(
id=task_id,
owner=node.owner,
operator_class=calculate_operator_class(
execution_mode=execution_mode, dbt_class=dbt_resource_to_class[node.resource_type]
),
arguments=args,
extra_context=extra_context,
)
return task_metadata
else:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a critical part of the code, and I feel that the bug fix does not justify changing all these lines.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for reviewing this over the weekend. I apologize for the lack of clarity in my pull request.

The purpose of these changes is as follows:

  1. Adding a parameter to enable the use of display_name
  2. Addressing Ruff errors that occurred as a result of point 1
  3. Updating the documentation (I also noticed that there was some missing information about "source_rendering_behavior" and have added that as well).

cosmos/config.py Outdated
@@ -62,6 +62,8 @@ class RenderConfig:
:param dbt_ls_path: Configures the location of an output of ``dbt ls``. Required when using ``load_method=LoadMode.DBT_LS_FILE``.
:param enable_mock_profile: Allows to enable/disable mocking profile. Enabled by default. Mock profiles are useful for parsing Cosmos DAGs in the CI, but should be disabled to benefit from partial parsing (since Cosmos 1.4).
:param source_rendering_behavior: Determines how source nodes are rendered when using cosmos default source node rendering (ALL, NONE, WITH_TESTS_OR_FRESHNESS). Defaults to "NONE" (since Cosmos 1.6).
:param airflow_vars_to_purge_dbt_ls_cache: Specify Airflow variables that will affect the LoadMode.DBT_LS cache.
:param set_task_id_by_node: A callable that takes a dbt node as input and returns the task ID. This allows users to assign a custom node ID separate from the display name.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We received feedback from end-users that Cosmos already has too many configurations. Would it be possible for us to handle non-ASCII following how other libraries handle this, without enforcing users to define a new configuration?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason for not providing users with an option to specify task_id themselves is that automatically generating task_id for names written in non-ASCII characters is a highly challenging task.

For example, while slugify (as mentioned in the documentation) can work in some cases, it’s not suitable for use in actual production code.

Examples:

  • slugify converts names based on pronunciation, which makes it difficult to keep Task IDs unique due to homophones. In Japanese, for instance, "accounting" and "finance" are represented by the same word.
  • Chinese and Japanese both use similar kanji characters, but these characters have different pronunciations in each language.

In other words, what’s needed is "translation" rather than a mechanical "conversion" like slugify for the sake of automation.

To ensure unique task IDs, some form of mapping would likely be necessary.
For example: { "顧客": "customers", "注文": "orders" }.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the concern about having too many options for the end users. However, I believe it would be challenging to automatically generate task_id on behalf of the users.

May I please propose adding an option for this purpose? If that’s acceptable, I’d like to work on a revision that minimizes code changes.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @t0momi219, Thanks a lot for explaining these differences. I don't speak Chinese nor Japanese, so your explanation was extremely helpful. I now understand the need

Yes, please. It would be amazing if you could make changes to make this customizable without changing the codebase.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, given the explanation I'm inclined to provide & expose it as a configuration. Like that someone who really needs to use it can use.

Also +1 to minimising the code change, easier to review & address the scope of a particular issue. Thanks for working on this @t0momi219

Comment on lines 19 to 23
from slugify import slugify


def set_task_id_by_node(node):
return slugify(node.name)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not to do this on behalf of the users?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I mentioned in my previous comment.

@t0momi219
Copy link
Contributor Author

Hi @tatiana @pankajkoti ,
Thank you for the review, and I appreciate your acceptance of this proposal. I have revised the implementation to avoid impacting the core logic as much as possible.

PR Changes

  • Renamed the added parameter to normalize_task_id.
  • Modified the code with minimal changes.
  • Updated the documentation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:rendering Related to rendering, like Jinja, Airflow tasks, etc size:L This PR changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature] To support task display_name
3 participants