Parallel GPU Training
Automatic
Sapsan relies on Catalyst to implement Distributed Data Parallel (DDP). You can specify ddp=True
in ModelConfig
, which in turn will set ddp=True
parameter for the Catalyst runner.train(). Let's take a look at how it could be done by adjusting cnn_example:
cnn_example.ipynb | |
---|---|
DDP is not supported on Jupyter Notebooks! You will have to prepare a script.
Thus, it is advised to start off developing and testing your model on a single CPU or GPU in a jupyter notebook, then downloading it as a python script to run on multiple GPUs locally or on HPC. In addition, you will have to add the following statement to the beginning of your script in order for torch.multiprocessing to work correctly:
Even though Training will be performed on the GPUs, evaluation will be done on the CPU.
Customizing
For more information and further customization of your parallel setup, see DDP Tutorial from Catalyst. It might come in useful if you want, among other things, to take control over what portion of the data is copied onto which node. The runner itself, torch_backend.py, can be copied to the project directory and accessed when creating a new project by invoking --get_torch_backend
or --gtb
flag as such:
The torch_backend.py
contains lots of important functions, but for customizing DDP
you will need to focus on TorchBackend.torch_train()
as shown below. Most likely you will need to adjust self.runner
to either another Catalyst runner or your own, custom runner. Next, you will need to edit self.runner.train()
parameters accordingly.
- Adjust the Runner here. Check Catalyst's documentation
- Controls automatic Distributed Data Parallel (DDP)