Release Pythae 0.1.0 · clementchadebec/benchmark_VAE

New features 🚀

Pythae now supports distributed training (built on top of PyTorch DDP). Launching a distributed training can be done using a training script in which all of the distributed environment variables are passed to a BaseTrainerConfig instance as follows:

training_config = BaseTrainerConfig(
     num_epochs=10,
     learning_rate=1e-3,
     per_device_train_batch_size=64,
     per_device_eval_batch_size=64,
     dist_backend="nccl", # distributed backend
     world_size=8 # number of gpus to use (n_nodes x n_gpus_per_node),
     rank=0 # process/gpu id,
     local_rank=1 # node id,
     master_addr="localhost" # master address,
     master_port="12345" # master port,
 )

The script can then be launched using a launcher such a srun. This module was tested in both mono-node-multi-gpu and multi-node-multi-gpu settings.

Thanks to @ravih18, MSSSIM_VAE now supports 3D images 🚀

Major Changes

Selection and definition of custom optimizers and schedulers changed. It is no longer needed to build the optimizer (resp. scheduler) and pass them to the Trainer. As of v0.1.0, the choice and parameters of the optimizers and schedulers can be passed directly to the TrainerConfig. See changes below:

As of v0.1.0

my_model = VAE(model_config=model_config)
# Specify instances and params directly in Trainer config
training_config = BaseTrainerConfig(
    ...,
    optimizer_cls="AdamW",
    optimizer_params={"betas": (0.91, 0.995)}
    scheduler_cls="MultiStepLR",
    scheduler_params={"milestones": [10, 20, 30], "gamma": 10**(-1/5)}
)
trainer = BaseTrainer(
    model=model,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    training_config=training_config
)
# Launch training
trainer.train()

Before v0.1.0

my_model = VAE(model_config=model_config)
training_config = BaseTrainerConfig(...)
### Optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=training_config.learning_rate, betas=(0.91, 0.995))
### Scheduler
scheduler = torch.optim.lr_scheduler.MultiStepLR(optimizer, milestones=[10, 20, 30], gamma=10**(-1/5))
# Pass instances to Trainer
trainer = BaseTrainer(
    model=model,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    training_config=training_config,
    optimizer=optimizer,
    scheduler=scheduler
)
# Launch training
trainer.train()

batch_size key no longer available in the Trainer configurations. It is replaced by the keys per_device_train_batch_size and per_device_eval_batch_size where the batch size per device is specified. Please note that if you are in a distributed setting with for instance 4 GPUs and specify a per_device_eval_batch_size=64, this is equivalent to training on a single GPU using a batch_size of 4*64.

Minor changes

Added the ability to specify the desired number of workers for data_loading in the Trainer configuration under the keys train_dataloader_num_workers and eval_dataloader_num_workers
Cleaned up __init__ of Trainers and moved sanity checks from train method to __init__
Moved checks on optimizers and schedulers in TrainerConfing __post_init_post_parse__

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pythae 0.1.0

Contributors