Pythae 0.1.0
New features 🚀
Pythae
now supports distributed training (built on top of PyTorch DDP). Launching a distributed training can be done using a training script in which all of the distributed environment variables are passed to aBaseTrainerConfig
instance as follows:
training_config = BaseTrainerConfig(
num_epochs=10,
learning_rate=1e-3,
per_device_train_batch_size=64,
per_device_eval_batch_size=64,
dist_backend="nccl", # distributed backend
world_size=8 # number of gpus to use (n_nodes x n_gpus_per_node),
rank=0 # process/gpu id,
local_rank=1 # node id,
master_addr="localhost" # master address,
master_port="12345" # master port,
)
The script can then be launched using a launcher such a srun
. This module was tested in both mono-node-multi-gpu and multi-node-multi-gpu settings.
- Thanks to @ravih18,
MSSSIM_VAE
now supports 3D images 🚀
Major Changes
- Selection and definition of custom
optimizers
andschedulers
changed. It is no longer needed to build theoptimizer
(resp.scheduler
) and pass them to theTrainer
. As of v0.1.0, the choice and parameters of theoptimizers
andschedulers
can be passed directly to theTrainerConfig
. See changes below:
As of v0.1.0
my_model = VAE(model_config=model_config)
# Specify instances and params directly in Trainer config
training_config = BaseTrainerConfig(
...,
optimizer_cls="AdamW",
optimizer_params={"betas": (0.91, 0.995)}
scheduler_cls="MultiStepLR",
scheduler_params={"milestones": [10, 20, 30], "gamma": 10**(-1/5)}
)
trainer = BaseTrainer(
model=model,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
training_config=training_config
)
# Launch training
trainer.train()
Before v0.1.0
my_model = VAE(model_config=model_config)
training_config = BaseTrainerConfig(...)
### Optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=training_config.learning_rate, betas=(0.91, 0.995))
### Scheduler
scheduler = torch.optim.lr_scheduler.MultiStepLR(optimizer, milestones=[10, 20, 30], gamma=10**(-1/5))
# Pass instances to Trainer
trainer = BaseTrainer(
model=model,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
training_config=training_config,
optimizer=optimizer,
scheduler=scheduler
)
# Launch training
trainer.train()
batch_size
key no longer available in theTrainer
configurations. It is replaced by the keysper_device_train_batch_size
andper_device_eval_batch_size
where the batch size per device is specified. Please note that if you are in a distributed setting with for instance 4 GPUs and specify aper_device_eval_batch_size=64
, this is equivalent to training on a single GPU using a batch_size of 4*64.
Minor changes
- Added the ability to specify the desired number of workers for data_loading in the
Trainer
configuration under the keystrain_dataloader_num_workers
andeval_dataloader_num_workers
- Cleaned up
__init__
ofTrainers
and moved sanity checks fromtrain
method to__init__
- Moved checks on
optimizers
andschedulers
inTrainerConfing
__post_init_post_parse__