Customizing model hyperparameter configurations in Neptune ML
When you start a Neptune ML model-training job, Neptune ML automatically uses the information inferred from the preceding data-processing job. It uses the information to generate hyperparameter configuration ranges that are used to create a SageMaker AI hyperparameter tuning job to train multiple models for your task. That way, you don’t have to specify a long list of hyperparameter values for the models to be trained with. Instead, the model hyperparameter ranges and defaults are selected based on the task type, graph type, and the tuning-job settings.
However, you can also override the default hyperparameter configuration and provide custom hyperparameters by modifying a JSON configuration file that the data-processing job generates.
Using the Neptune ML modelTraining
    API, you can control several high level hyperparameter tuning job settings like
    maxHPONumberOfTrainingJobs, maxHPOParallelTrainingJobs, and
    trainingInstanceType. For more fine-grained control over the model
    hyperparameters, you can customize the model-HPO-configuration.json file
    that the data-processing job generates. The file is saved in the Amazon S3 location that
    you specified for processing-job output.
You can download the file, edit it to override the default hyperparameter configurations, and upload it back to the same Amazon S3 location. Do not change the name of the file, and be careful to follow these instructions as you edit.
To download the file from Amazon S3:
aws s3 cp \ s3://(bucket name)/(path to output folder)/model-HPO-configuration.json \ ./
When you have finished editing, upload the file back to where it was:
aws s3 cp \ model-HPO-configuration.json \ s3://(bucket name)/(path to output folder)/model-HPO-configuration.json
Structure of the model-HPO-configuration.json file
    The model-HPO-configuration.json file specifies the model to be trained,
      the machine learning task_type and the hyperparameters that should be varied
      or fixed for the various runs of model training.
The hyperparameters are categorized as belonging to various tiers that signify the precedence given to the hyperparameters when the hyperparameter tuning job is invoked:
- Tier-1 hyperparameters have the highest precedence. If you set - maxHPONumberOfTrainingJobsto a value less than 10, only Tier-1 hyperparameters are tuned, and the rest take their default values.
- Tier-2 hyperparameters have lower precedence, so if you have more than 10 but less than 50 total training jobs for a tuning job, then both Tier-1 and Tier-2 hyperparameters are tuned. 
- Tier 3 hyperparameters are tuned together with Tier-1 and Tier-2 only if you have more than 50 total training jobs. 
- Finally, fixed hyperparameters are not tuned at all, and always take their default values. 
Example of a model-HPO-configuration.json file
      The following is a sample model-HPO-configuration.json file:
{ "models": [ { "model": "rgcn", "task_type": "node_class", "eval_metric": { "metric": "acc" }, "eval_frequency": { "type": "evaluate_every_epoch", "value": 1 }, "1-tier-param": [ { "param": "num-hidden", "range": [16, 128], "type": "int", "inc_strategy": "power2" }, { "param": "num-epochs", "range": [3,30], "inc_strategy": "linear", "inc_val": 1, "type": "int", "node_strategy": "perM" }, { "param": "lr", "range": [0.001,0.01], "type": "float", "inc_strategy": "log" } ], "2-tier-param": [ { "param": "dropout", "range": [0.0,0.5], "inc_strategy": "linear", "type": "float", "default": 0.3 }, { "param": "layer-norm", "type": "bool", "default": true } ], "3-tier-param": [ { "param": "batch-size", "range": [128, 4096], "inc_strategy": "power2", "type": "int", "default": 1024 }, { "param": "fanout", "type": "int", "options": [[10, 30],[15, 30], [15, 30]], "default": [10, 15, 15] }, { "param": "num-layer", "range": [1, 3], "inc_strategy": "linear", "inc_val": 1, "type": "int", "default": 2 }, { "param": "num-bases", "range": [0, 8], "inc_strategy": "linear", "inc_val": 2, "type": "int", "default": 0 } ], "fixed-param": [ { "param": "concat-node-embed", "type": "bool", "default": true }, { "param": "use-self-loop", "type": "bool", "default": true }, { "param": "low-mem", "type": "bool", "default": true }, { "param": "l2norm", "type": "float", "default": 0 } ] } ] }
Elements of a model-HPO-configuration.json file
      The file contains a JSON object with a single top-level array named models
        that contains a single model-configuration object. When customizing the file, make sure
        the models array only has one model-configuration object in it. If your file
        contains more than one model-configuration object, the tuning job will fail with a warning.
The model-configuration object contains the following top-level elements:
- 
          model– (String) The model type to be trained (do not modify). Valid values are:- "rgcn"– This is the default for node classification and regression tasks, and for heterogeneous link prediction tasks.
- "transe"– This is the default for KGE link prediction tasks.
- "distmult"– This is an alternative model type for KGE link prediction tasks.
- "rotate"– This is an alternative model type for KGE link prediction tasks.
 As a rule, don't directly modify the modelvalue, because different model types often have substantially different applicable hyperparameters, which can result in a parsing error after the training job has started.To change the model type, use the modelNameparameter in the modelTraining API rather than change it in themodel-HPO-configuration.jsonfile.A way to change the model type and make fine-grain hyperparameter changes is to copy the default model configuration template for the model that you want to use and paste that into the model-HPO-configuration.jsonfile. There is a folder namedhpo-configuration-templatesin the same Amazon S3 location as themodel-HPO-configuration.jsonfile if the inferred task type supports multiple models. This folder contains all the default hyperparameter configurations for the other models that are applicable to the task.For example, if you want to change the model and hyperparameter configurations for a KGElink-prediction task from the defaulttransemodel to adistmultmodel, simply paste the contents of thehpo-configuration-templates/distmult.jsonfile into themodel-HPO-configuration.jsonfile and then edit the hyperparameters as necessary.NoteIf you set the modelNameparameter in themodelTrainingAPI and also change themodeland hyperparameter specification in themodel-HPO-configuration.jsonfile, and these are different, themodelvalue in themodel-HPO-configuration.jsonfile takes precedence, and themodelNamevalue is ignored.
- 
          task_type– (String) The machine learning task type inferred by or passed directly to the data-processing job (do not modify). Valid values are:- "node_class"
- "node_regression"
- "link_prediction"
 The data-processing job infers the task type by examining the exported dataset and the generated training-job configuration file for properties of the dataset. This value should not be changed. If you want to train a different task, you need to run a new data-processing job. If the task_typevalue is not what you were expecting, you should check the inputs to your data-processing job to make sure that they are correct. This includes parameters to themodelTrainingAPI, as well as in the training-job configuration file generated by the data-export process.
- 
          eval_metric– (String) The evaluation metric should be used for evaluating the model performance and for selecting the best-performing model across HPO runs. Valid values are:- "acc"– Standard classification accuracy. This is the default for single-label classification tasks, unless imbalanced labels are found during data processing, in which case the default is- "F1".
- "acc_topk"– The number of times the correct label is among the top- kpredictions. You can also set the value- kby passing in- topkas an extra key.
- "F1"– The F1 score- . 
- "mse"– Mean-squared error metric- , for regression tasks. 
- "mrr"– Mean reciprocal rank metric- . 
- "precision"– The model precision, calculated as the ratio of true positives to predicted positives:- = true-positives / (true-positives + false-positives).
- "recall"– The model recall, calculated as the ratio of true positives to actual positives:- = true-positives / (true-positives + false-negatives).
- "roc_auc"– The area under the ROC curve- . This is the default for multi-label classification. 
 For example, to change the metric to F1, change theeval_metricvalue as follows:" eval_metric": { "metric": "F1", },Or, to change the metric to a topkaccuracy score, you would changeeval_metricas follows:"eval_metric": { "metric": "acc_topk", "topk": 2 },
- 
          eval_frequency– (Object) Specifies how often during training the performance of the model on the validation set should be checked. Based on the validation performance, early stopping can then be initiated and the best model can be saved.The eval_frequencyobject contains two elements, namely"type"and"value". For example:"eval_frequency": { "type": "evaluate_every_pct", "value": 0.1 },Valid typevalues are:- 
              evaluate_every_pct– Specifies the percentage of training to be completed for each evaluation.For evaluate_every_pct, the"value"field contains a floating-point number between zero and one which expresses that percentage.
- 
              evaluate_every_batch– Specifies the number of training batches to be completed for each evaluation.For evaluate_every_batch, the"value"field contains an integer which expresses that batch count.
- 
              evaluate_every_epoch– Specifies the number of epochs per evaluation, where a new epoch starts at midnight.For evaluate_every_epoch, the"value"field contains an integer which expresses that epoch count.
 The default setting for eval_frequencyis:"eval_frequency": { "type": "evaluate_every_epoch", "value": 1 },
- 
              
- 
          1-tier-param– (Required) An array of Tier-1 hyperparameters.If you don't want to tune any hyperparameters, you can set this to an empty array. This does not affect the total number of training jobs launched by the SageMaker AI hyperparameter tuning job. It just means that all training jobs, if there is more than 1 but less than 10, will run with the same set of hyperparameters. On the other hand, if you want to treat all your tunable hyperparameters with equal significance then you can put all the hyperparameters in this array. 
- 
          2-tier-param– (Required) An array of Tier-2 hyperparameters.These parameters are only tuned if maxHPONumberOfTrainingJobshas a value greater than 10. Otherwise, they are fixed to the default values.If you have a training budget of at most 10 training jobs or don't want Tier-2 hyperparameters for any other reason, but you want to tune all tunable hyperparameters, you can set this to an empty array. 
- 
          3-tier-param– (Required) An array of Tier-3 hyperparameters.These parameters are only tuned if maxHPONumberOfTrainingJobshas a value greater than 50. Otherwise, they are fixed to the default values.If you don't want Tier-3 hyperparameters, you can set this to an empty array. 
- 
          fixed-param– (Required) An array of fixed hyperparameters that take only their default values and do not vary in different training jobs.If you want to vary all hyperparameters, you can set this to an empty array and either set the value for maxHPONumberOfTrainingJobslarge enough to vary all tiers or make all hyperparameters Tier-1.
The JSON object that represents each hyperparameter in 1-tier-param,
        2-tier-param, 3-tier-param, and fixed-param
        contains the following elements:
- 
          param– (String) The name of the hyperparameter (do not change).
- 
          type– (String) The hyperparameter type (do not change).Valid types are: bool,int, andfloat.
- 
          default– (String) The default value for the hyperparameter.You can set a new default value. 
Tunable hyperparameters can also contain the following elements:
- 
          range– (Array) The range for a continuous tunable hyperparameter.This should be an array with two values, namely the minimum and maximum of the range ( [min, max]).
- 
          options– (Array) The options for a categorical tunable hyperparameter.This array should contain all the options to consider: "options" : [value1, value2, ... valuen]
- 
          inc_strategy– (String) The type of incremental change for continuous tunable hyperparameter ranges (do not change).Valid values are log,linear, andpower2. This applies only when the range key is set.Modifying this may result in not using the full range of your hyperparameter for tuning. 
- 
          inc_val– (Float) The amount by which successive increments differ for continuous tunablehyperparameters (do not change).This applies only when the range key is set. Modifying this may result in not using the full range of your hyperparameter for tuning. 
- 
          node_strategy– (String) Indicates that the effective range for this hyperparameter should change based on the number of nodes in the graph (do not change).Valid values are "perM"(per million),"per10M"(per 10 million), and"per100M"(per 100 million).Rather than change this value, change the rangeinstead.
- 
          edge_strategy– (String) Indicates that the effective range for this hyperparameter should change based on the number of edges in the graph (do not change).Valid values are "perM"(per million),"per10M"(per 10 million), and"per100M"(per 100 million).Rather than change this value, change the rangeinstead.
List of all the hyperparameters in Neptune ML
The following list contains all the hyperparameters that can be set anywhere in
        Neptune ML, for any model type and task. Because they are not all applicable to
        every model type, it is important that you only set hyperparameters in
        the model-HPO-configuration.json file that appear in the template for
        the model you're using.
- 
          batch-size– The size of the batch of target nodes using in one forward pass. Type:int.Setting this to a much larger value can cause memory issues for training on GPU instances. 
- 
          concat-node-embed– Indicates whether to get the initial representation of a node by concatenating its processed features with learnable initial node embeddings in order to increase the expressivity of the model. Type:bool.
- 
          dropout– The dropout probability applied to dropout layers. Type:float.
- 
          edge-num-hidden– The hidden layer size or number of units for the edge feature module. Only used whenuse-edge-featuresis set toTrue. Type: float.
- 
          enable-early-stop– Toggles whether or not to use the early stopping feature. Type:bool. Default:true.Use this Boolean parameter to turn off the early stop feature. 
- 
          fanout– The number of neighbors to sample for a target node during neighbor sampling. Type:int.This value is tightly coupled with num-layersand should always be in the same hyperparameter tier. This is because you can specify a fanout for each potential GNN layer.Because this hyperparameter can cause model performance to vary widely, it should be fixed or set as a Tier-2 or Tier-3 hyperparameter. Setting it to a large value can cause memory issues for training on GPU instance. 
- 
          gamma– The margin value in the score function. Type:float.This applies to KGElink-prediction models only.
- 
          l2norm– The weight decay value used in the optimizer which imposes an L2 normalization penalty on the weights. Type:bool.
- 
          layer-norm– Indicates whether to use layer normalization forrgcnmodels. Type:bool.
- 
          low-mem– Indicates whether to use a low-memory implementation of the relation message passing function at the expense of speed. Type:bool.
- 
          lr– The learning rate. Type:float.This should be set as a Tier-1 hyperparameter. 
- 
          neg-share– In link prediction, indicates whether positive sampled edges can share negative edge samples. Type:bool.
- 
          num-bases– The number of bases for basis decomposition in argcnmodel. Using a value ofnum-basesthat is less than the number of edge types in the graph acts as a regularizer for thergcnmodel. Type:int.
- 
          num-epochs– The number of epochs of training to run. Type:int.An epoch is a complete training pass through the graph. 
- 
          num-hidden– The hidden layer size or number of units. Type:int.This also sets the initial embedding size for featureless nodes. Setting this to a much larger value without reducing batch-sizecan cause out-of-memory issues for training on GPU instance.
- 
          num-layer– The number of GNN layers in the model. Type:int.This value is tightly coupled with the fanout parameter and should come after fanout is set in the same hyperparameter tier. Because this can cause model performance to vary widely, it should be fixed or set as a Tier-2 or Tier-3 hyperparameter. 
- 
          num-negs– In link prediction, the number of negative samples per positive sample. Type:int.
- 
          per-feat-name-embed– Indicates whether to embed each feature by independently transforming it before combining features. Type:bool.When set to true, each feature per node is independently transformed to a fixed dimension size before all the transformed features for the node are concatenated and further transformed to thenum_hiddendimension.When set to false, the features are concatenated without any feature-specific transformations.
- 
          regularization-coef– In link prediction, the coefficient of regularization loss. Type:float.
- 
          rel-part– Indicates whether to use relation partition forKGElink prediction. Type:bool.
- 
          sparse-lr– The learning rate for learnable-node embeddings. Type:float.Learnable initial node embeddings are used for nodes without features or when concat-node-embedis set. The parameters of the sparse learnable node embedding layer are trained using a separate optimizer which can have a separate learning rate.
- 
          use-class-weight– Indicates whether to apply class weights for imbalanced classification tasks. If set to totrue, the label counts are used to set a weight for each class label. Type:bool.
- 
          use-edge-features– Indicates whether to use edge features during message passing. If set totrue, a custom edge feature module is added to the RGCN layer for edge types that have features. Type:bool.
- 
          use-self-loop– Indicates whether to include self loops in training argcnmodel. Type:bool.
- 
          window-for-early-stop– Controls the number of latest validation scores to average to decide on an early stop. The default is 3. type=int. See also Early stopping of the model training process in Neptune ML. Type:int. Default:3.See . 
Customizing hyperparameters in Neptune ML
When you are editing the model-HPO-configuration.json file, the
      following are the most common kinds of changes to make:
- Edit the minimum and/or maximum values of - rangehyperparameters.
- Set a hyperparameter to a fixed value by moving it to the - fixed-paramsection and setting its default value to the fixed value you want it to take.
- Change the priority of a hyperparameter by placing it in a particular tier, editing its range, and making sure that its default value is set appropriately.