Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(cc): fix message passing when nghost is 0 and send list is empty #4237

Merged
merged 9 commits into from
Oct 23, 2024

Conversation

CaRoLZhangxy
Copy link
Collaborator

@CaRoLZhangxy CaRoLZhangxy commented Oct 22, 2024

fix errors mentioned in following pr:
#4220
#4209
#4144

Summary by CodeRabbit

  • New Features

    • Enhanced message passing logic in the computation process for improved efficiency.
    • Added new test functions to evaluate DeepMD model performance under various conditions.
  • Bug Fixes

    • Improved error handling and assertions in test cases to ensure robustness.
  • Refactor

    • Streamlined tensor operations in the communication process to enhance clarity and reduce unnecessary computations.
    • Removed outdated test cases related to neighbor list handling in the DeepPot class.

Copy link
Contributor

coderabbitai bot commented Oct 22, 2024

📝 Walkthrough

Walkthrough

The changes in this pull request primarily focus on the DeepPotPT class in the DeepPotPT.cc file, where the compute method has been simplified by removing checks for ghost atoms and modifying how communication data is prepared. The test suite for the LAMMPS DeepMD integration has been enhanced with new tests and improved error handling. Additionally, the Border class in comm.cc has been updated to optimize tensor operations related to MPI communication by conditionally initializing tensors based on the number of elements to send or receive.

Changes

File Change Summary
source/api_cc/src/DeepPotPT.cc Simplified compute method by removing checks for nghost. Updated method signature by removing nghost and lmp_list parameters.
source/lmp/tests/test_lammps_dpa_pt_nopbc.py Added new test functions for DeepMD model performance. Modified existing tests for parameterization and error handling. Enhanced assertions and data handling.
source/op/pt/comm.cc Updated forward_t and backward_t methods in Border class to conditionally initialize tensors based on nsend and nswap. Improved clarity of tensor operations.
source/api_cc/tests/test_deeppot_dpa_pt.cc Removed test cases cpu_lmp_nlist and cpu_lmp_nlist_atomic from TestInferDeepPotDpaPtNopbc.

Possibly related PRs

Suggested reviewers

  • njzjz
  • wanghan-iapcm
  • iProzd

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (1)
source/op/pt/comm.cc (1)

255-268: Efficient optimization for backward pass

The changes effectively optimize the backward pass by conditionally allocating tensors only when nswap != 0. This improvement aligns with the PR objective and enhances memory efficiency.

Consider adding a brief comment explaining the purpose of max_recvnum for improved code clarity:

// Allocate receive buffer large enough for the maximum number of atoms received in any swap
int max_recvnum = sendnum_tensor.max().item<int>();
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between b4701da and 90d0648.

📒 Files selected for processing (3)
  • source/api_cc/src/DeepPotPT.cc (2 hunks)
  • source/lmp/tests/test_lammps_dpa_pt_nopbc.py (0 hunks)
  • source/op/pt/comm.cc (2 hunks)
💤 Files with no reviewable changes (1)
  • source/lmp/tests/test_lammps_dpa_pt_nopbc.py
🧰 Additional context used
🔇 Additional comments (4)
source/op/pt/comm.cc (2)

123-131: Excellent optimization for empty send lists

The changes effectively address the PR objective by introducing a conditional check for nsend != 0. This optimization prevents unnecessary tensor operations when there's nothing to send, improving efficiency and avoiding potential issues with empty tensors.


Line range hint 1-431: Summary: Effective optimization of message passing

The changes in this file successfully address the PR objective of fixing message passing when the send list is empty. The optimizations in both the forward and backward passes improve efficiency by avoiding unnecessary operations and tensor allocations. These changes enhance the overall performance and robustness of the Border class, particularly in edge cases with empty send lists or no swaps.

source/api_cc/src/DeepPotPT.cc (2)

Line range hint 307-332: Verify all call sites for the updated compute method signature

The compute method signature has been modified to remove the nghost and lmp_list parameters. This change suggests that the method no longer relies on external neighbor list information. While this simplification might improve the interface, it's crucial to ensure that all call sites in the codebase have been updated accordingly.

Please run the following verification to check for any remaining calls with the old signature:

#!/bin/bash
# Search for calls to compute with the old signature
rg "compute\s*\([^)]*nghost" source/
rg "compute\s*\([^)]*lmp_list" source/

If any results are found, they will need to be updated to match the new signature. Additionally, consider updating the documentation to reflect this change in the method's interface.

Also applies to: 431-456


Line range hint 171-221: Verify the impact of simplified message passing logic

The changes to the message passing logic have simplified the conditions, which aligns with the PR objective. However, this simplification might have some implications:

  1. The removal of the nghost > 0 check could lead to unnecessary operations when there are no ghost atoms.
  2. The behavior in edge cases (e.g., when nghost is 0 and the send list is empty) might have changed.

To ensure these changes don't introduce any regressions, please run the following verification:

Additionally, consider adding more test cases to cover these edge scenarios if they don't already exist.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (1)
source/api_cc/src/DeepPotPT.cc (1)

Line range hint 171-189: LGTM! Consider adding a comment for clarity.

The changes effectively address the PR objective by simplifying the message passing condition and handling communication data more robustly. This should resolve issues when nghost is 0 and the send list is empty.

Consider adding a brief comment explaining the purpose of the communicator_tensor initialization, especially the reason for the empty tensor when lmp_list.world == 0. This would improve code readability and maintainability.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between 90d0648 and d8541da.

📒 Files selected for processing (1)
  • source/api_cc/src/DeepPotPT.cc (3 hunks)
🧰 Additional context used
🔇 Additional comments (2)
source/api_cc/src/DeepPotPT.cc (2)

Line range hint 1-577: Overall changes align well with PR objectives.

The modifications in this file are focused on the compute method, particularly on simplifying the message passing logic and tensor operations. These changes directly address the PR objective of fixing message passing when nghost is 0 and the send list is empty.

The targeted nature of the changes, along with the unchanged error handling and type mapping, suggests a careful and focused approach to resolving the specific issues mentioned in the PR description.


225-225: Verify the initialization of mapping_tensor.

The simplification of the condition for calling different run_method variants is consistent with the earlier changes. However, mapping_tensor is being used here without a visible initialization in the provided code segment.

Please ensure that mapping_tensor is properly initialized before this line. If it's initialized elsewhere in the file, consider moving the initialization closer to its usage or adding a comment to clarify where it's set up.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (2)
source/api_cc/src/DeepPotPT.cc (2)

Confirmed: const_cast Usage in DeepPotPT.cc

The use of const_cast in source/api_cc/src/DeepPotPT.cc lines 180-189 is confirmed and may lead to undefined behavior. Please refactor to eliminate the need for const_cast.

🔗 Analysis chain

Line range hint 171-186: Review the new tensor operations for MPI communication

The new tensor operations for MPI communication data look good and should improve performance by using zero-copy tensor creation with torch::from_blob. However, there's a potential issue with the use of const_cast for lmp_list.world.

The use of const_cast in line 184-185 could lead to undefined behavior if the underlying data is modified. Consider refactoring to avoid the need for const_cast. For example:

-      communicator_tensor = torch::from_blob(const_cast<void*>(lmp_list.world),
-                                             {1}, torch::kInt64);
+      communicator_tensor = torch::from_blob(const_cast<std::int64_t*>(static_cast<const std::int64_t*>(lmp_list.world)),
+                                             {1}, torch::kInt64);

This change makes the cast more explicit and type-safe. However, it's crucial to ensure that lmp_list.world is indeed of type std::int64_t*.

To ensure that this change doesn't introduce any issues, please run the following script to check for any other instances of const_cast in the codebase:

This will help identify any other potential issues related to const-correctness in the codebase.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for other instances of const_cast
rg "const_cast" --type cpp

Length of output: 570


Remaining nghost References Detected

The search found multiple instances of nghost across various files in the codebase. These remnants indicate that the removal of ghost atom checks was incomplete and may lead to inconsistencies.

  • source/op/pt/comm.cc
  • source/op/tf/pairwise.cc
  • source/lmp/pair_deepmd.cpp
  • source/lmp/fix_dplr.cpp
  • source/lmp/pppm_dplr.cpp
  • source/lmp/compute_deeptensor_atom.cpp
  • source/api_c/tests/test_select_by_type.cc
  • source/api_c/tests/test_dipolecharge.cc
  • source/api_cc/src/common.cc
  • source/api_cc/src/DeepTensorTF.cc
  • source/api_cc/tests/test_select_by_type.cc
  • source/api_cc/src/DeepPot.cc
  • source/api_cc/src/DataModifierTF.cc
  • source/api_cc/src/DataModifier.cc
  • source/api_cc/src/DeepTensor.cc
  • source/api_c/include/deepmd.hpp
  • source/api_c/src/c_api.cc
  • source/api_cc/include/commonTF.h
  • source/api_cc/include/common.h
  • source/api_cc/include/DeepTensorTF.h
  • source/api_cc/include/DeepPotTF.h
  • source/api_cc/include/DeepPotPT.h
  • source/api_cc/include/DeepPot.h
  • source/api_cc/include/DataModifierTF.h
  • source/api_cc/include/DeepTensor.h
  • source/api_cc/include/DataModifier.h

Please address these remaining references to ensure the removal is thorough and the system behaves as expected.

🔗 Analysis chain

Line range hint 171-186: Verify the impact of removing ghost atom checks

The removal of nghost checks simplifies the code, but it may affect the system's behavior, especially in multi-node simulations or when dealing with boundary conditions.

Please confirm that this change doesn't introduce any issues in the following scenarios:

  1. Single-node simulations with periodic boundary conditions
  2. Multi-node simulations with domain decomposition

To assist in verification, you can run the following script to check for any remaining references to nghost in the codebase:

This will help ensure that the removal of ghost atom checks is consistent across the codebase.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for remaining references to nghost
rg "nghost" --type cpp

Length of output: 43957

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between d8541da and 389b914.

📒 Files selected for processing (2)
  • source/api_cc/src/DeepPotPT.cc (3 hunks)
  • source/api_cc/tests/test_deeppot_dpa_pt.cc (0 hunks)
💤 Files with no reviewable changes (1)
  • source/api_cc/tests/test_deeppot_dpa_pt.cc
🧰 Additional context used
🔇 Additional comments (1)
source/api_cc/src/DeepPotPT.cc (1)

Line range hint 221-230: Review changes in the module's forward method call

The modification to the forward method call, which now includes a check for do_message_passing, is a good improvement. It allows for conditional execution of message passing in the neural network model, which can potentially improve performance when message passing is not required.

The change aligns well with the removal of ghost atom checks, as it moves the communication handling to a higher level within the model.

To ensure that this change is consistently applied and doesn't introduce any issues, please run the following script:

#!/bin/bash
# Search for other instances of forward_lower method calls
rg "forward_lower" --type cpp

# Check for any remaining references to ghost atoms in forward calls
rg "forward.*ghost" --type cpp

This will help verify that the changes are consistent across the codebase and that there are no remaining references to ghost atoms in forward method calls that might have been missed.

Copy link

codecov bot commented Oct 22, 2024

Codecov Report

Attention: Patch coverage is 85.71429% with 5 lines in your changes missing coverage. Please review.

Project coverage is 84.55%. Comparing base (b4701da) to head (e3f9433).
Report is 5 commits behind head on devel.

Files with missing lines Patch % Lines
source/op/pt/comm.cc 85.71% 0 Missing and 4 partials ⚠️
source/api_cc/src/DeepPotPT.cc 85.71% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##            devel    #4237      +/-   ##
==========================================
- Coverage   84.55%   84.55%   -0.01%     
==========================================
  Files         537      537              
  Lines       51237    51244       +7     
  Branches     3047     3051       +4     
==========================================
+ Hits        43324    43327       +3     
- Misses       6965     6967       +2     
- Partials      948      950       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Member

@njzjz njzjz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR seems to remove #4144, the minimal DPA-2 support for the old interface? I do need it work for AmberTools.

@njzjz njzjz linked an issue Oct 22, 2024 that may be closed by this pull request
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (1)
source/api_cc/src/DeepPotPT.cc (1)

Line range hint 225-231: Refactor the ternary operator for improved readability

Using a ternary operator with a complex multi-line function call can reduce code readability and maintainability. It can make the code harder to understand and debug.

Consider refactoring this code segment to use an if-else statement for better clarity.

Apply this diff to refactor the code:

-c10::Dict<c10::IValue, c10::IValue> outputs =
-    (do_message_passing == 1)
-        ? module
-              .run_method("forward_lower", coord_wrapped_Tensor, atype_Tensor,
-                          firstneigh_tensor, mapping_tensor, fparam_tensor,
-                          aparam_tensor, do_atom_virial_tensor, comm_dict)
-              .toGenericDict()
-        : module
-              .run_method("forward_lower", coord_wrapped_Tensor, atype_Tensor,
-                          firstneigh_tensor, mapping_tensor, fparam_tensor,
-                          aparam_tensor, do_atom_virial_tensor)
-              .toGenericDict();
+c10::Dict<c10::IValue, c10::IValue> outputs;
+if (do_message_passing == 1) {
+    outputs = module
+        .run_method("forward_lower", coord_wrapped_Tensor, atype_Tensor,
+                    firstneigh_tensor, mapping_tensor, fparam_tensor,
+                    aparam_tensor, do_atom_virial_tensor, comm_dict)
+        .toGenericDict();
+} else {
+    outputs = module
+        .run_method("forward_lower", coord_wrapped_Tensor, atype_Tensor,
+                    firstneigh_tensor, mapping_tensor, fparam_tensor,
+                    aparam_tensor, do_atom_virial_tensor)
+        .toGenericDict();
+}
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between 389b914 and 410da82.

📒 Files selected for processing (2)
  • source/api_cc/src/DeepPotPT.cc (3 hunks)
  • source/op/pt/comm.cc (4 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • source/op/pt/comm.cc
🧰 Additional context used
🔇 Additional comments (1)
source/api_cc/src/DeepPotPT.cc (1)

Line range hint 171-190: Validate the safety of casting lmp_list.world

The use of const_cast<void*>(lmp_list.world) when creating communicator_tensor may lead to undefined behavior if lmp_list.world is not a valid pointer or if it is nullptr. This could potentially cause segmentation faults or other runtime errors.

Please ensure that lmp_list.world is a valid and appropriately typed pointer before casting. Consider adding a check to verify its validity or using safer casting mechanisms.

@iProzd iProzd added this pull request to the merge queue Oct 23, 2024
Merged via the queue into deepmodeling:devel with commit 18026eb Oct 23, 2024
60 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] DPA2 Lammps on nopbc systems causes torchscript error
4 participants