partitioner: avoid inserting duplicates into heap #145082

bdhirsh · 2025-01-17T16:14:10Z

This looks like it was a source of quadratic compile times in the torchtitan CP graphs. There's some code in the partitioner that iteratively adds users of a node to a heap, and pops the earliest user. If you have long parallel chains of fusible ops that all eventually feed into some shared ops, then this can result in:
(1) a node getting added to the heap many times
(2) each time we pop that node, we add (duplicates of) each of that node users to the heap
(3) repeat with each user

Stack from ghstack (oldest at bottom):

-> partitioner: avoid inserting duplicates into heap #145082

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @chenyang78 @kadeng @chauhang @amjames

[ghstack-poisoned]

pytorch-bot · 2025-01-17T16:14:14Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/145082

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 New Failures, 1 Cancelled Job, 3 Unrelated Failures

As of commit f38f555 with merge base 727ae13 ():

NEW FAILURES - The following jobs have failed:

Lint / lintrunner-noclang / linux-job (gh)
>>> Lint for test/inductor/test_torchinductor.py:
pull / cuda12.4-py3.10-gcc9-sm75 / test (pr_time_benchmarks, 1, 1, linux.g4dn.metal.nvidia.gpu) (gh)
Process completed with exit code 1.

CANCELLED JOB - The following job was cancelled. Please retry:

Check Labels (gh)

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

inductor / cuda12.4-py3.10-gcc9-sm86 / test (inductor_timm, 1, 2, linux.g5.4xlarge.nvidia.gpu) (gh) (trunk failure)
levit_128
inductor / linux-jammy-cpu-py3.9-gcc11-inductor / test (dynamic_cpu_inductor_timm, 1, 2, linux.8xlarge.amx) (gh) (trunk failure)
levit_128
pull / linux-jammy-py3-clang12-executorch / test (executorch, 1, 1, linux.2xlarge) (gh) (trunk failure)
Process completed with exit code 1.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ghstack-source-id: d8f90b1650ad8c3abd3ddc6874bf160023aa5f71 Pull Request resolved: #145082

github-actions · 2025-01-17T16:14:45Z

This PR needs a `release notes:` label

If your changes are user facing and intended to be a part of release notes, please use a label starting with release notes:.

If not, please add the topic: not user facing label.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "topic: not user facing"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

bdhirsh · 2025-01-17T18:02:36Z

benchmarks/dynamo/pr_time_benchmarks/benchmarks/aotdispatcher_partitioner2.py

+    ]
+
+    for benchmark in all:
+        benchmark.enable_compile_time_instruction_count().collect_all().append_results(


I tested locally and this PR cuts the instruction-count of this microbenchmark from 61B -> 4B instructions (it grows a lot larger if you increase len(tmps) from 16 to 32)

bdhirsh · 2025-01-17T19:45:47Z

benchmarks/dynamo/pr_time_benchmarks/benchmarks/aotdispatcher_partitioner2.py

+        f(self.x)
+
+
+def main():


hey @laithsakka - is there anything else I need to do to ensure this new compile time benchmark runs in CI? (do I need to update one of the expected-instruction-count files locally?

replied privately but posting here for others
(1) when you land this diff the benchmark will run, but it wont fail at regression .
(2) you have two benchmarks with the same name you probably want to rename it.
(3) to enable failing at regressions you need to add a line to /data/users/lsakka/fbsource/fbcode/caffe2/benchmarks/dynamo/pr_time_benchmarks/expected_results.csv you can get the results from the logs https://github.com/pytorch/pytorch/actions/runs/12832763265/job/35794624737

(4) usually i do 3 in different step, first land the diff that enable running the benchmark, then monitor it on https://fburl.com/unidash/vblyya4c to make sure its stable and not noisy for day or two then update the expected results file above .

partitioner: avoid inserting duplicate node in heap

f38f555

[ghstack-poisoned]

bdhirsh added a commit that referenced this pull request Jan 17, 2025

partitioner: avoid inserting duplicate node in heap

0767894

ghstack-source-id: d8f90b1650ad8c3abd3ddc6874bf160023aa5f71 Pull Request resolved: #145082

pytorch-bot bot added ciflow/inductor module: dynamo labels Jan 17, 2025

github-actions bot requested review from albanD, antoniojkim, ezyang, miladm and SherlockNoMad January 17, 2025 16:14

bdhirsh changed the title ~~partitioner: avoid inserting duplicate node in heap~~ partitioner: avoid inserting duplicates into heap Jan 17, 2025

bdhirsh commented Jan 17, 2025

View reviewed changes

xmfan approved these changes Jan 17, 2025

View reviewed changes

bdhirsh commented Jan 17, 2025

View reviewed changes

XilunWu mentioned this pull request Jan 17, 2025

CP 32 + torch.compile hang pytorch/torchtitan#772

Open

tianyu-l linked an issue Jan 17, 2025 that may be closed by this pull request

CP 32 + torch.compile hang pytorch/torchtitan#772

Open

bdhirsh requested a review from Chillee January 21, 2025 19:15

albanD removed their request for review January 22, 2025 21:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

partitioner: avoid inserting duplicates into heap #145082

partitioner: avoid inserting duplicates into heap #145082

bdhirsh commented Jan 17, 2025 •

edited

Loading

pytorch-bot bot commented Jan 17, 2025 •

edited

Loading

github-actions bot commented Jan 17, 2025

bdhirsh Jan 17, 2025

bdhirsh Jan 17, 2025

laithsakka Jan 21, 2025

partitioner: avoid inserting duplicates into heap #145082

Are you sure you want to change the base?

partitioner: avoid inserting duplicates into heap #145082

Conversation

bdhirsh commented Jan 17, 2025 • edited Loading

pytorch-bot bot commented Jan 17, 2025 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/145082

❌ 2 New Failures, 1 Cancelled Job, 3 Unrelated Failures

github-actions bot commented Jan 17, 2025

This PR needs a release notes: label

bdhirsh Jan 17, 2025

Choose a reason for hiding this comment

bdhirsh Jan 17, 2025

Choose a reason for hiding this comment

laithsakka Jan 21, 2025

Choose a reason for hiding this comment

bdhirsh commented Jan 17, 2025 •

edited

Loading

pytorch-bot bot commented Jan 17, 2025 •

edited

Loading

This PR needs a `release notes:` label