Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

partitioner: avoid inserting duplicates into heap #145082

Open
wants to merge 1 commit into
base: gh/bdhirsh/637/base
Choose a base branch
from

Conversation

bdhirsh
Copy link
Contributor

@bdhirsh bdhirsh commented Jan 17, 2025

Fixes #145081

This looks like it was a source of quadratic compile times in the torchtitan CP graphs. There's some code in the partitioner that iteratively adds users of a node to a heap, and pops the earliest user. If you have long parallel chains of fusible ops that all eventually feed into some shared ops, then this can result in:
(1) a node getting added to the heap many times
(2) each time we pop that node, we add (duplicates of) each of that node users to the heap
(3) repeat with each user

Stack from ghstack (oldest at bottom):

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @chenyang78 @kadeng @chauhang @amjames

Copy link

pytorch-bot bot commented Jan 17, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/145082

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 New Failures, 1 Cancelled Job, 3 Unrelated Failures

As of commit f38f555 with merge base 727ae13 (image):

NEW FAILURES - The following jobs have failed:

CANCELLED JOB - The following job was cancelled. Please retry:

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

bdhirsh added a commit that referenced this pull request Jan 17, 2025
ghstack-source-id: d8f90b1650ad8c3abd3ddc6874bf160023aa5f71
Pull Request resolved: #145082
Copy link
Contributor

This PR needs a release notes: label

If your changes are user facing and intended to be a part of release notes, please use a label starting with release notes:.

If not, please add the topic: not user facing label.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "topic: not user facing"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

@bdhirsh bdhirsh changed the title partitioner: avoid inserting duplicate node in heap partitioner: avoid inserting duplicates into heap Jan 17, 2025
]

for benchmark in all:
benchmark.enable_compile_time_instruction_count().collect_all().append_results(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested locally and this PR cuts the instruction-count of this microbenchmark from 61B -> 4B instructions (it grows a lot larger if you increase len(tmps) from 16 to 32)

f(self.x)


def main():
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hey @laithsakka - is there anything else I need to do to ensure this new compile time benchmark runs in CI? (do I need to update one of the expected-instruction-count files locally?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

replied privately but posting here for others
(1) when you land this diff the benchmark will run, but it wont fail at regression .
(2) you have two benchmarks with the same name you probably want to rename it.
(3) to enable failing at regressions you need to add a line to /data/users/lsakka/fbsource/fbcode/caffe2/benchmarks/dynamo/pr_time_benchmarks/expected_results.csv you can get the results from the logs https://github.com/pytorch/pytorch/actions/runs/12832763265/job/35794624737

(4) usually i do 3 in different step, first land the diff that enable running the benchmark, then monitor it on https://fburl.com/unidash/vblyya4c to make sure its stable and not noisy for day or two then update the expected results file above .

@tianyu-l tianyu-l linked an issue Jan 17, 2025 that may be closed by this pull request
@bdhirsh bdhirsh requested a review from Chillee January 21, 2025 19:15
@albanD albanD removed their request for review January 22, 2025 21:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CP 32 + torch.compile hang
3 participants