TrainRegression - Train a regression model

Train a classifier from multiple images to perform regression.

Detailed description

This application trains a classifier from multiple input images or a csv file, in order to perform regression. Predictors are composed of pixel values in each band optionally centered and reduced using an XML statistics file produced by the ComputeImagesStatistics application.
The output value for each predictor is assumed to be the last band (or the last column for CSV files). Training and validation predictor lists are built such that their size is inferior to maximum bounds given by the user, and the proportion corresponds to the balance parameter. Several classifier parameters can be set depending on the chosen classifier. In the validation process, the mean square error is computed This application is based on LibSVM and on OpenCV Machine Learning classifiers, and is compatible with OpenCV 2.3.1 and later.

Parameters

This section describes in details the parameters available for this application. Table [1] presents a summary of these parameters and the parameters keys to be used in command-line and programming languages. Application key is TrainRegression .

[1]Table: Parameters table for Train a regression model.
Parameter Key Parameter Name Parameter Type
io Input and output data Group
io.il Input Image List Input image list
io.csv Input CSV file Input File name
io.imstat Input XML image statistics file Input File name
io.out Output regression model Output File name
io.mse Mean Square Error Float
sample Training and validation samples parameters Group
sample.mt Maximum training predictors Int
sample.mv Maximum validation predictors Int
sample.vtr Training and validation sample ratio Float
classifier Classifier to use for the training Choices
classifier libsvm Choice LibSVM classifier
classifier dt Choice Decision Tree classifier
classifier gbt Choice Gradient Boosted Tree classifier
classifier ann Choice Artificial Neural Network classifier
classifier rf Choice Random forests classifier
classifier knn Choice KNN classifier
classifier.libsvm.k SVM Kernel Type Choices
classifier.libsvm.k linear Choice Linear
classifier.libsvm.k rbf Choice Gaussian radial basis function
classifier.libsvm.k poly Choice Polynomial
classifier.libsvm.k sigmoid Choice Sigmoid
classifier.libsvm.m SVM Model Type Choices
classifier.libsvm.m epssvr Choice Epsilon Support Vector Regression
classifier.libsvm.m nusvr Choice Nu Support Vector Regression
classifier.libsvm.c Cost parameter C Float
classifier.libsvm.opt Parameters optimization Boolean
classifier.libsvm.prob Probability estimation Boolean
classifier.libsvm.eps Epsilon Float
classifier.libsvm.nu Nu Float
classifier.dt.max Maximum depth of the tree Int
classifier.dt.min Minimum number of samples in each node Int
classifier.dt.ra Termination criteria for regression tree Float
classifier.dt.cat Cluster possible values of a categorical variable into K <= cat clusters to find a suboptimal split Int
classifier.dt.f K-fold cross-validations Int
classifier.dt.r Set Use1seRule flag to false Boolean
classifier.dt.t Set TruncatePrunedTree flag to false Boolean
classifier.gbt.t Loss Function Type Choices
classifier.gbt.t sqr Choice Squared Loss
classifier.gbt.t abs Choice Absolute Loss
classifier.gbt.t hub Choice Huber Loss
classifier.gbt.w Number of boosting algorithm iterations Int
classifier.gbt.s Regularization parameter Float
classifier.gbt.p Portion of the whole training set used for each algorithm iteration Float
classifier.gbt.max Maximum depth of the tree Int
classifier.ann.t Train Method Type Choices
classifier.ann.t reg Choice RPROP algorithm
classifier.ann.t back Choice Back-propagation algorithm
classifier.ann.sizes Number of neurons in each intermediate layer String list
classifier.ann.f Neuron activation function type Choices
classifier.ann.f ident Choice Identity function
classifier.ann.f sig Choice Symmetrical Sigmoid function
classifier.ann.f gau Choice Gaussian function (Not completely supported)
classifier.ann.a Alpha parameter of the activation function Float
classifier.ann.b Beta parameter of the activation function Float
classifier.ann.bpdw Strength of the weight gradient term in the BACKPROP method Float
classifier.ann.bpms Strength of the momentum term (the difference between weights on the 2 previous iterations) Float
classifier.ann.rdw Initial value Delta_0 of update-values Delta_{ij} in RPROP method Float
classifier.ann.rdwm Update-values lower limit Delta_{min} in RPROP method Float
classifier.ann.term Termination criteria Choices
classifier.ann.term iter Choice Maximum number of iterations
classifier.ann.term eps Choice Epsilon
classifier.ann.term all Choice Max. iterations + Epsilon
classifier.ann.eps Epsilon value used in the Termination criteria Float
classifier.ann.iter Maximum number of iterations used in the Termination criteria Int
classifier.rf.max Maximum depth of the tree Int
classifier.rf.min Minimum number of samples in each node Int
classifier.rf.ra Termination Criteria for regression tree Float
classifier.rf.cat Cluster possible values of a categorical variable into K <= cat clusters to find a suboptimal split Int
classifier.rf.var Size of the randomly selected subset of features at each tree node Int
classifier.rf.nbtrees Maximum number of trees in the forest Int
classifier.rf.acc Sufficient accuracy (OOB error) Float
classifier.knn.k Number of Neighbors Int
classifier.knn.rule Decision rule Choices
classifier.knn.rule mean Choice Mean of neighbors values
classifier.knn.rule median Choice Median of neighbors values
rand set user defined seed Int
inxml Load otb application from xml file XML input parameters file
outxml Save otb application to xml file XML output parameters file

[Input and output data]: This group of parameters allows setting input and output data.

  • Input Image List: A list of input images. First (n-1) bands should contain the predictor. The last band should contain the output value to predict.
  • Input CSV file: Input CSV file containing the predictors, and the output values in last column. Only used when no input image is given.
  • Input XML image statistics file: Input XML file containing the mean and the standard deviation of the input images.
  • Output regression model: Output file containing the model estimated (.txt format).
  • Mean Square Error: Mean square error computed with the validation predictors.

[Training and validation samples parameters]: This group of parameters allows you to set training and validation sample lists parameters.

  • Maximum training predictors: Maximum number of training predictors (default = 1000) (no limit = -1).
  • Maximum validation predictors: Maximum number of validation predictors (default = 1000) (no limit = -1).
  • Training and validation sample ratio: Ratio between training and validation samples (0.0 = all training, 1.0 = all validation) (default = 0.5).

Classifier to use for the training: Choice of the classifier to use for the training. Available choices are:

  • LibSVM classifier: This group of parameters allows setting SVM classifier parameters.
  • SVM Kernel Type: SVM Kernel Type. Available choices are:
  • Linear
  • Gaussian radial basis function
  • Polynomial
  • Sigmoid
  • SVM Model Type: Type of SVM formulation. Available choices are:
  • Epsilon Support Vector Regression
  • Nu Support Vector Regression
  • Cost parameter C: SVM models have a cost parameter C (1 by default) to control the trade-off between training errors and forcing rigid margins.
  • Parameters optimization: SVM parameters optimization flag.
  • Probability estimation: Probability estimation flag.
  • Epsilon
  • Nu
  • Maximum depth of the tree: The training algorithm attempts to split each node while its depth is smaller than the maximum possible depth of the tree. The actual depth may be smaller if the other termination criteria are met, and/or if the tree is pruned.
  • Minimum number of samples in each node: If all absolute differences between an estimated value in a node and the values of the train samples in this node are smaller than this regression accuracy parameter, then the node will not be split.
  • Termination criteria for regression tree
  • Cluster possible values of a categorical variable into K <= cat clusters to find a suboptimal split: Cluster possible values of a categorical variable into K <= cat clusters to find a suboptimal split.
  • K-fold cross-validations: If cv_folds > 1, then it prunes a tree with K-fold cross-validation where K is equal to cv_folds.
  • Set Use1seRule flag to false: If true, then a pruning will be harsher. This will make a tree more compact and more resistant to the training data noise but a bit less accurate.
  • Set TruncatePrunedTree flag to false: If true, then pruned branches are physically removed from the tree.
  • Loss Function Type: Type of loss functionused for training. Available choices are:
  • Squared Loss
  • Absolute Loss
  • Huber Loss
  • Number of boosting algorithm iterations: Number “w” of boosting algorithm iterations, with w*K being the total number of trees in the GBT model, where K is the output number of classes.
  • Regularization parameter: Regularization parameter.
  • Portion of the whole training set used for each algorithm iteration: Portion of the whole training set used for each algorithm iteration. The subset is generated randomly.
  • Maximum depth of the tree: The training algorithm attempts to split each node while its depth is smaller than the maximum possible depth of the tree. The actual depth may be smaller if the other termination criteria are met, and/or if the tree is pruned.
  • Train Method Type: Type of training method for the multilayer perceptron (MLP) neural network. Available choices are:
  • RPROP algorithm
  • Back-propagation algorithm
  • Number of neurons in each intermediate layer: The number of neurons in each intermediate layer (excluding input and output layers).
  • Neuron activation function type: Neuron activation function. Available choices are:
  • Identity function
  • Symmetrical Sigmoid function
  • Gaussian function (Not completely supported)
  • Alpha parameter of the activation function: Alpha parameter of the activation function (used only with sigmoid and gaussian functions).
  • Beta parameter of the activation function: Beta parameter of the activation function (used only with sigmoid and gaussian functions).
  • Strength of the weight gradient term in the BACKPROP method: Strength of the weight gradient term in the BACKPROP method. The recommended value is about 0.1.
  • Strength of the momentum term (the difference between weights on the 2 previous iterations): Strength of the momentum term (the difference between weights on the 2 previous iterations). This parameter provides some inertia to smooth the random fluctuations of the weights. It can vary from 0 (the feature is disabled) to 1 and beyond. The value 0.1 or so is good enough.
  • Initial value Delta_0 of update-values Delta_{ij} in RPROP method: Initial value Delta_0 of update-values Delta_{ij} in RPROP method (default = 0.1).
  • Update-values lower limit Delta_{min} in RPROP method: Update-values lower limit Delta_{min} in RPROP method. It must be positive (default = 1e-7).
  • Termination criteria: Termination criteria. Available choices are:
  • Maximum number of iterations
  • Epsilon
  • Max. iterations + Epsilon
  • Epsilon value used in the Termination criteria: Epsilon value used in the Termination criteria.
  • Maximum number of iterations used in the Termination criteria: Maximum number of iterations used in the Termination criteria.
  • Maximum depth of the tree: The depth of the tree. A low value will likely underfit and conversely a high value will likely overfit. The optimal value can be obtained using cross validation or other suitable methods.
  • Minimum number of samples in each node: If the number of samples in a node is smaller than this parameter, then the node will not be split. A reasonable value is a small percentage of the total data e.g. 1 percent.
  • Termination Criteria for regression tree: If all absolute differences between an estimated value in a node and the values of the train samples in this node are smaller than this regression accuracy parameter, then the node will not be split.
  • Cluster possible values of a categorical variable into K <= cat clusters to find a suboptimal split: Cluster possible values of a categorical variable into K <= cat clusters to find a suboptimal split.
  • Size of the randomly selected subset of features at each tree node: The size of the subset of features, randomly selected at each tree node, that are used to find the best split(s). If you set it to 0, then the size will be set to the square root of the total number of features.
  • Maximum number of trees in the forest: The maximum number of trees in the forest. Typically, the more trees you have, the better the accuracy. However, the improvement in accuracy generally diminishes and reaches an asymptote for a certain number of trees. Also to keep in mind, increasing the number of trees increases the prediction time linearly.
  • Sufficient accuracy (OOB error): Sufficient accuracy (OOB error).
  • Number of Neighbors: The number of neighbors to use.
  • Decision rule: Decision rule for regression output. Available choices are:
  • Mean of neighbors values: Returns the mean of neighbors values.
  • Median of neighbors values: Returns the median of neighbors values.

set user defined seed: Set specific seed. with integer value.

Load otb application from xml file: Load otb application from xml file.

Save otb application to xml file: Save otb application to xml file.

Example

To run this example in command-line, use the following:

otbcli_TrainRegression -io.il training_dataset.tif -io.out regression_model.txt -io.imstat training_statistics.xml -classifier libsvm

To run this example from Python, use the following code snippet:

#!/usr/bin/python

# Import the otb applications package
import otbApplication

# The following line creates an instance of the TrainRegression application
TrainRegression = otbApplication.Registry.CreateApplication("TrainRegression")

# The following lines set all the application parameters:
TrainRegression.SetParameterStringList("io.il", ['training_dataset.tif'])

TrainRegression.SetParameterString("io.out", "regression_model.txt")

TrainRegression.SetParameterString("io.imstat", "training_statistics.xml")

TrainRegression.SetParameterString("classifier","libsvm")

# The following line execute the application
TrainRegression.ExecuteAndWriteOutput()

Limitations

None

Authors

This application has been written by OTB-Team.

See Also

These additional resources can be useful for further information:

OpenCV documentation for machine learning http://docs.opencv.org/modules/ml/doc/ml.html