Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

upgrade version 0.3.8 #1280

Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
76 changes: 76 additions & 0 deletions CODE_OF_CONDUCT.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
# Code of Conduct

## Our Pledge

In the interest of fostering an open and welcoming environment, we as
contributors and maintainers pledge to make participation in our project and
our community a harassment-free experience for everyone, regardless of age, body
size, disability, ethnicity, sex characteristics, gender identity and expression,
level of experience, education, socio-economic status, nationality, personal
appearance, race, religion, or sexual identity and orientation.

## Our Standards

Examples of behavior that contributes to creating a positive environment
include:

* Using welcoming and inclusive language
* Being respectful of differing viewpoints and experiences
* Gracefully accepting constructive criticism
* Focusing on what is best for the community
* Showing empathy towards other community members

Examples of unacceptable behavior by participants include:

* The use of sexualized language or imagery and unwelcome sexual attention or
advances
* Trolling, insulting/derogatory comments, and personal or political attacks
* Public or private harassment
* Publishing others' private information, such as a physical or electronic
address, without explicit permission
* Other conduct which could reasonably be considered inappropriate in a
professional setting

## Our Responsibilities

Project maintainers are responsible for clarifying the standards of acceptable
behavior and are expected to take appropriate and fair corrective action in
response to any instances of unacceptable behavior.

Project maintainers have the right and responsibility to remove, edit, or
reject comments, commits, code, wiki edits, issues, and other contributions
that are not aligned to this Code of Conduct, or to ban temporarily or
permanently any contributor for other behaviors that they deem inappropriate,
threatening, offensive, or harmful.

## Scope

This Code of Conduct applies within all project spaces, and it also applies when
an individual is representing the project or its community in public spaces.
Examples of representing a project or community include using an official
project e-mail address, posting via an official social media account, or acting
as an appointed representative at an online or offline event. Representation of
a project may be further defined and clarified by project maintainers.

## Enforcement

Instances of abusive, harassing, or otherwise unacceptable behavior may be
reported by contacting the project team at <[email protected]>. All
complaints will be reviewed and investigated and will result in a response that
is deemed necessary and appropriate to the circumstances. The project team is
obligated to maintain confidentiality with regard to the reporter of an incident.
Further details of specific enforcement policies may be posted separately.

Project maintainers who do not follow or enforce the Code of Conduct in good
faith may face temporary or permanent repercussions as determined by other
members of the project's leadership.

## Attribution

This Code of Conduct is adapted from the [Contributor Covenant][homepage], version 1.4,
available at https://www.contributor-covenant.org/version/1/4/code-of-conduct.html

[homepage]: https://www.contributor-covenant.org

For answers to common questions about this code of conduct, see
https://www.contributor-covenant.org/faq
5 changes: 5 additions & 0 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# Contributing

## This repo is part of [DLRover](https://github.com/intelligent-machine-learning/dlrover)

For details on how to contribute to DLRover, please see the main [contributing document](https://github.com/intelligent-machine-learning/dlrover/blob/master/docs/developer_guide.md).
3 changes: 3 additions & 0 deletions MAINTAINERS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# Maintainers

For details please see [CODEOWNERS](https://github.com/majieyue/dlrover/blob/lfai-20240919/.github/CODEOWNERS).
95 changes: 95 additions & 0 deletions RELEASES.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,95 @@
# Release Notes

The DLRover project follows the semantic versioning scheme and maintains a separate branch for each minor version. The main branch always represents the next upcoming minor or major version.

For laset news about DLRover you can check as following link: https://github.com/intelligent-machine-learning/dlrover?tab=readme-ov-file#latest-news=

## Release 0.3.7 on May 13

Features:
* Flash Checkpoint suppors deleting old checkpoints.

BugFix:
* Save/load the non-params-related variables of dist optimizer in Megatron-LM models.
* The agent waits for async saving checkpoint finishes before exiting.

## Release 0.3.6 on Apr 24

Features:
* Flash checkpoint provides FlashCkptTrainer to support HuggingFace transforemers.Trainer.
* Flash checkpoint supports loading the checkpint of Megatron-LM from the memory.
Flash Checkpoint supports saving and loading FSDP checkpoint with full state dict.
* Job master can sort the node ranks by the access switches of the node.

BugFix:
* Fix the segment fault when restarting the training process.

## Release 0.3.5 on Mar 29

Features:
* Flash checkpoint supports saving and loading Megatron-LM MOE models. #1042
* APIs to extend the module to check the node with different chips. #1023
* Automatically mark the node as unschedulable if the node fails. #1025

BugFix:
* Fix the DDP example of mnist to save and load checkpoint. #1051
* Fix the checkpoint name of DDP. #1034

## Release 0.3.4 on Feb 21

Features:
* Flash checkpoint enables saving and loading Megatron-LM models from multiple ranks in parallel.
* dlrover-run --auto-config Automatically configure the number of nodes and the number of processes per node.
* Users can customize the APIs of storage to save the checkpoint into different file systems.
* Deletion strategy to clean the old checkpoint files.

BugFix:
* The shared memory does not exist if the size of the checkpoint changes.

## Release 0.3.3 on Jan 25

Features:
* Support Python > 3.10.
* Support restarting the training process on Ascend NPU.
* Support asynchronously saving the checkpoint of the distributed optimizer of Megatron-LM to the storage.

BugFix:
* Fix the checkpoint shard inconsistency of all ranks.
* Fix the bug to asynchronously save the Megatron-LM checkpoint of the job with multi-GPUs on multi-nodes.
* Fix the bug to load the Megatron-LM checkpoint.

## Release 0.3.1 on Jan 10

Feature:
* Users can use flash checkpoint using torchrun or python -m torch.distributed.launch.

Bugfix:
* The dlrover master cannot print the error message of the fault node in a kubeflow/PytorchJob.

## Release 0.3.0 on Jan 3

Features:
* Flash Checkpoint to asynchronously persist checkpoint to storage.
* Flash Checkpoint recovers failure in memory.
* Flash Checkpoint supports DDP/FSDP/DeepSpeed/Megatron
* Node detection supports NPU.

Examples
* The example of training nanoGPT using DeepSpeed.
* The example to save/load sharding FSDP checkpoint.


## Release 0.2.2 on Nov 21, 2023

Features:
* dlrover-run can run on any distributed jobs with the NODE_RANK and DLROVER_MASTER_ADDR in the environment.
* DLRover can asynchronously save the checkpoint into the storage which only block the training with a few time.

BugFix:
* Fix the bug to load the FSDP checkpoint.

## Release 0.2.1 on Oct 11, 2023

* Autotuning batch size without restarting the job.
* Automatically detect the straggler (slow worker).
* TFPlus: TFPlus 0.1.0 has been released, see detail in https://github.com/intelligent-machine-learning/dlrover/tree/master/tfplus
3 changes: 2 additions & 1 deletion dlrover/python/elastic_agent/diagnosis/diagnosis_agent.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@
from dlrover.python.common.constants import TrainingExceptionLevel
from dlrover.python.common.error import ProcessError
from dlrover.python.common.log import default_logger as logger
from dlrover.python.common.singleton import Singleton
from dlrover.python.common.worker import WorkerContext
from dlrover.python.diagnosis.common.constants import (
DiagnoseAction,
Expand All @@ -41,7 +42,7 @@
from dlrover.python.elastic_agent.master_client import MasterClient


class DiagnosisAgent:
class DiagnosisAgent(Singleton):
def __init__(self, training_log_file: str, errors: str):
self._client = MasterClient.singleton_instance()
self._training_log_file = training_log_file
Expand Down
2 changes: 1 addition & 1 deletion dlrover/python/elastic_agent/monitor/resource.py
Original file line number Diff line number Diff line change
Expand Up @@ -117,7 +117,7 @@ def start(self):

# The first time called cpu_percent will return a meaningless 0.0
# value which we are supposed to ignore. So, here we call it at
# the begining of monitor and the next value is valid.
# the beginning of monitor and the next value is valid.
get_process_cpu_percent()

def stop(self):
Expand Down
10 changes: 7 additions & 3 deletions dlrover/python/master/elastic_training/rdzv_manager.py
Original file line number Diff line number Diff line change
Expand Up @@ -200,11 +200,15 @@ def _get_lacking_ranks(self) -> List[int]:
"""

lacking_ranks: List[int] = []
if self._rdzv_params is None or self._rdzv_params.min_nodes <= 0:
if (
self._rdzv_params is None
or self._rdzv_params.min_nodes <= 0
or self._rdzv_params.max_nodes <= 0
):
return lacking_ranks

min_required = self._rdzv_params.min_nodes
min_ranks = set([i for i in range(min_required)])
max_required = self._rdzv_params.max_nodes
min_ranks = set([i for i in range(max_required)])
if self._waiting_nodes:
waiting_ranks = set(self._waiting_nodes.keys())
else:
Expand Down
12 changes: 9 additions & 3 deletions dlrover/python/master/node/dist_job_manager.py
Original file line number Diff line number Diff line change
Expand Up @@ -134,9 +134,9 @@ def __init__(
raise ValueError(
f"Distribution strategy {job_args.distribution_strategy} "
"is not supported. You can specify it with "
"ParameterServerStrategy/AllreduceStrategy."
"ParameterServerStrategy/AllReduceStrategy."
)
logger.info("New job optimizer : %s", self._job_optimizer.__class__)
logger.info(f"New job optimizer: {self._job_optimizer.__class__}")

worker_restart_count = node_restart_count.get(NodeType.WORKER, 0)
ps_restart_count = node_restart_count.get(NodeType.PS, 0)
Expand All @@ -150,6 +150,12 @@ def __init__(
self._ps_relaunch_max_num = min(
ps_restart_count, _MAX_POD_RELAUNCH_COUNT
)
logger.info(
f"Worker relaunch number: {self._relaunch_on_worker_failure}; "
f"PS relaunch number: {self._ps_relaunch_max_num}; "
f"Critical worker index: {self._critical_worker_index}."
)

self._node_event_callbacks: List[NodeEventCallback] = []

# Protects followed variables, which are accessed from event_cb.
Expand Down Expand Up @@ -449,7 +455,7 @@ def _monitor_node_heart_beat(self):
logger.warning(detail_trace_back)
time.sleep(15)

def _get_dead_node_event(self, window_interval=600) -> List[NodeEvent]:
def _get_dead_node_event(self, window_interval=900) -> List[NodeEvent]:
now = time.time()
dead_events: List[NodeEvent] = []
logger.debug(f"Current job nodes are: {self._job_nodes}.")
Expand Down
9 changes: 6 additions & 3 deletions dlrover/python/master/node/worker.py
Original file line number Diff line number Diff line change
Expand Up @@ -368,12 +368,12 @@ def is_training_hang_by_pending(self, total_node_num) -> bool:
pending_nodes
):
logger.debug(
"Skip for no required nodes info " "and not all nodes pending."
"Skip for no required nodes info and not all nodes pending."
)
return False
elif 0 < len(pending_nodes) == total_node_num:
# all nodes pending
logger.debug(f"All nodes pending: {pending_nodes}.")
logger.info(f"All nodes pending: {pending_nodes}.")
else:
# partial nodes pending
# with condition 1 + 2
Expand Down Expand Up @@ -404,7 +404,10 @@ def is_training_hang_by_pending(self, total_node_num) -> bool:
if now - first_pending_node.create_time.timestamp() > timeout:
logger.warning(
f"Node {first_pending_node.name} "
f"exceeded pending timeout: {timeout}s."
f"exceeded pending timeout: {timeout}s, "
f"running nodes(size:{len(running_nodes)}): {running_nodes}, "
f"pending nodes(size:{len(pending_nodes)}): {pending_nodes}, "
f"min required nodes size: {self.get_min_nodes_required()}."
)
return True

Expand Down
2 changes: 1 addition & 1 deletion dlrover/python/tests/test_diagnosis_agent.py
Original file line number Diff line number Diff line change
Expand Up @@ -60,7 +60,7 @@ def test_diagnose_training(self):
file_path = os.path.join(path, file)

errors = "error code is 11111"
agent = DiagnosisAgent(file_path, errors)
agent = DiagnosisAgent.singleton_instance(file_path, errors)

spec = _create_worker_spec(
node_rank=0,
Expand Down
11 changes: 6 additions & 5 deletions dlrover/python/tests/test_rdzv_manager.py
Original file line number Diff line number Diff line change
Expand Up @@ -172,23 +172,24 @@ def test_get_lacking_ranks(self):
rdzv_manager = ElasticTrainingRendezvousManager(error_monitor)

rdzv_manager._rdzv_params.min_nodes = 4
rdzv_manager._rdzv_params.max_nodes = 4
rdzv_manager._waiting_nodes = {0: 0, 1: 1, 2: 2, 3: 3}
self.assertEqual(rdzv_manager._get_lacking_ranks(), [])

rdzv_manager._rdzv_params.min_nodes = 5
rdzv_manager._rdzv_params.max_nodes = 5
self.assertEqual(rdzv_manager._get_lacking_ranks(), [4])

rdzv_manager._rdzv_params.min_nodes = 3
rdzv_manager._rdzv_params.max_nodes = 3
self.assertEqual(rdzv_manager._get_lacking_ranks(), [])

rdzv_manager._rdzv_params.min_nodes = 6
rdzv_manager._rdzv_params.max_nodes = 6
self.assertEqual(rdzv_manager._get_lacking_ranks(), [4, 5])

rdzv_manager._rdzv_params.min_nodes = 4
rdzv_manager._rdzv_params.max_nodes = 4
rdzv_manager._waiting_nodes = {}
self.assertEqual(rdzv_manager._get_lacking_ranks(), [0, 1, 2, 3])

rdzv_manager._rdzv_params.min_nodes = 0
rdzv_manager._rdzv_params.max_nodes = 0
self.assertEqual(rdzv_manager._get_lacking_ranks(), [])

def test_multi_updating_waiting_nodes(self):
Expand Down
2 changes: 1 addition & 1 deletion setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@

setup(
name="dlrover",
version="0.3.7rc0",
version="0.3.8",
description="An Automatic Distributed Deep Learning Framework",
long_description="DLRover helps model developers focus on model algorithm"
" itself, without taking care of any engineering stuff,"
Expand Down
Loading