MLflow Tracking
Default MLflow Tracking in Sapsan
Starting MLflow Server
mlflow ui
server will automatically start locally if a designated port is open. If not, Sapsan assumes the mLflow ui
server is already running on that local port and will direct mlflow to write to it. Also you can start mlflow ui
manually via:
Structure
By default, Sapsan will keep the following structure in MLflow:
- Train 1
- Evaluate 1
- Evaluate 2
- Train 2
- Evaluate 1
- Evaluate 2
where all evaluation runs are nested under the trained run entry. This way all evaluations are grouped together under the model that was just trained.
Every Train checks for other active runs, terminates them, and starts a new run. At the end of the Train method, the run does not terminate, awaiting Evaluate runs to be nested under it. Thus, Evaluate runs start and end at the end of the method. However, one can still add extra metrics, artifacts and etc by resuming the previously closed run and writing to it, as discussed in the later section.
Tracked Parameters
Evaluation runs include the training model parameters and metrics to make it easier to parse through. Here is a complete list of what is tracked by default after running Train or Evaluate loop.
Parameter | Train | Evaluate |
---|---|---|
Everything passed to ModelConfig() (including new parameters passed to kwargs ) |
||
model - {parameter} - device, logdir, lr, min_delta, min_lr, n_epochs, patience |
||
data - {parameter} - features, features_label, target, target_label, axis, path, path, shuffle |
||
chkpnt - {parameter} - initial_size, sample to size, batch_size, batch_num, time, time_granularity |
Since Train metrics are recorded for Evaluate runs, they are prefixed as train - {metric}
. Subsequently, all Evaluate metrics are written as eval - {metric}
Metrics | Train | Evaluate |
---|---|---|
eval - MSE Loss - Mean Squared Error (if the target is provided) |
||
eval - KS Stat - Kolmogorov-Smirnov Statistic (if the target is provided) |
||
train - final epoch - final training epoch |
||
All model metrics model.metrics() (provided by Catalyst and PyTorch) |
||
Runtime | ||
**Artifacts ** | Train | Evaluate |
model_details.txt - model layer init & optimizer settings |
||
model_forward.txt - Model.forward() function |
||
runtime_log.html - loss vs. epoch training progress |
||
pdf_cdf.png - Probability Density Function (PDF) & Cummulative Distribution Function (CDF) plots |
||
slices_plot.png - 2D Spatial Distribution (slice snapshots) |
Adding extra parameters
Before Training
In order to add a new_parameter
to be tracked with MLflow per your run, simply pass it to config as such: ModelConfig(new_parameter=value)
.
Since it will be initialized under ModelConfig().kwargs['new_parameter']
, the parameter name can be anything. You will see it in MLflow recorded as model - new_parameter
.
Internally, everything in ModelConfig().parameters
gets recorded to MLflow. By default, all ModelConfig()
variables, including kwargs
are passed to it. Here is the implementation from the CNN3d estimator.
Note: MLflow doesn't like labels that contain /
symbol. Please avoid or you might encounter an error.
After Training or Evaluation
If you want to perform some extra analysis on your model or predictions and record additional metrics after you have called Train.run()
or Evaluation.run()
, Sapsan has an interface to do so in 3 steps:
- Resume MLflow run
-
Since MLflow run is closed at the end of
Evaluation.run()
, it will need to be resumed first before attempting to write to it. For that reason, both Train and Evaluate classes have a parameterrun_id
which contains the MLflow run_id. You can use it to resume the run, and record new metrics. -
Record new parameters
-
To add extra parameters to the most recent Train or Evaluate entry in MLflow, simply use either the
backend()
interface or the standard MLflow interface. -
End the run
- In order to keep MLflow tidy, it is advised to call
backend.end()
after you are done.