TrainRegression

Train a classifier from multiple images to perform regression.

Description

This application trains a classifier from multiple input images or a csv file, in order to perform regression. Predictors are composed of pixel values in each band optionally centered and reduced using an XML statistics file produced by the ComputeImagesStatistics application.

The output value for each predictor is assumed to be the last band (or the last column for CSV files). Training and validation predictor lists are built such that their size is inferior to maximum bounds given by the user, and the proportion corresponds to the balance parameter. Several classifier parameters can be set depending on the chosen classifier. In the validation process, the mean square error is computed between the ground truth and the estimated model.

This application is based on LibSVM and on OpenCV Machine Learning classifiers, and is compatible with OpenCV 2.3.1 and later.

Parameters

Input and output data

This group of parameters allows setting input and output data.

Input Image List -io.il image1 image2... Mandatory
A list of input images. First (n-1) bands should contain the predictor. The last band should contain the output value to predict.

Input CSV file -io.csv filename [dtype]
Input CSV file containing the predictors, and the output values in last column. Only used when no input image is given

Input XML image statistics file -io.imstat filename [dtype]
Input XML file containing the mean and the standard deviation of the input images.

Output regression model -io.out filename [dtype] Mandatory
Output file containing the model estimated (.txt format).

Mean Square Error -io.mse float Mandatory
Mean square error computed with the validation predictors

Training and validation samples parameters

This group of parameters allows you to set training and validation sample lists parameters.

Maximum training predictors -sample.mt int Default value: 1000
Maximum number of training predictors (default = 1000) (no limit = -1).

Maximum validation predictors -sample.mv int Default value: 1000
Maximum number of validation predictors (default = 1000) (no limit = -1).

Training and validation sample ratio -sample.vtr float Default value: 0.5
Ratio between training and validation samples (0.0 = all training, 1.0 = all validation) (default = 0.5).


Classifier to use for the training -classifier [libsvm|dt|ann|rf|knn|sharkrf|sharkkm] Default value: libsvm
Choice of the classifier to use for the training.

LibSVM classifier options

SVM Kernel Type -classifier.libsvm.k [linear|rbf|poly|sigmoid] Default value: linear
SVM Kernel Type.

  • Linear
    Linear Kernel, no mapping is done, this is the fastest option.
  • Gaussian radial basis function
    This kernel is a good choice in most of the case. It is an exponential function of the euclidian distance between the vectors.
  • Polynomial
    Polynomial Kernel, the mapping is a polynomial function.
  • Sigmoid
    The kernel is a hyperbolic tangente function of the vectors.

SVM Model Type -classifier.libsvm.m [epssvr|nusvr] Default value: epssvr
Type of SVM formulation.

  • Epsilon Support Vector Regression
    The distance between feature vectors from the training set and the fitting hyper-plane must be less than Epsilon. For outliers the penalty multiplier C is used
  • Nu Support Vector Regression
    Same as the epsilon regression except that this time the bounded parameter nu is used instead of epsilon

Cost parameter C -classifier.libsvm.c float Default value: 1
SVM models have a cost parameter C (1 by default) to control the trade-off between training errors and forcing rigid margins.

Cost parameter Nu -classifier.libsvm.nu float Default value: 0.5
Cost parameter Nu, in the range 0..1, the larger the value, the smoother the decision.

Parameters optimization -classifier.libsvm.opt bool Default value: false
SVM parameters optimization flag.

Probability estimation -classifier.libsvm.prob bool Default value: false
Probability estimation flag.

Epsilon -classifier.libsvm.eps float Default value: 0.001
The distance between feature vectors from the training set and the fitting hyper-plane must be less than Epsilon. For outliersthe penalty mutliplier is set by C.

Decision Tree classifier options

Maximum depth of the tree -classifier.dt.max int Default value: 65535
The training algorithm attempts to split each node while its depth is smaller than the maximum possible depth of the tree. The actual depth may be smaller if the other termination criteria are met, and/or if the tree is pruned.

Minimum number of samples in each node -classifier.dt.min int Default value: 10
If the number of samples in a node is smaller than this parameter, then this node will not be split.

Termination criteria for regression tree -classifier.dt.ra float Default value: 0.01
If all absolute differences between an estimated value in a node and the values of the train samples in this node are smaller than this regression accuracy parameter, then the node will not be split further.

Cluster possible values of a categorical variable into K <= cat clusters to find a suboptimal split -classifier.dt.cat int Default value: 10
Cluster possible values of a categorical variable into K <= cat clusters to find a suboptimal split.

K-fold cross-validations -classifier.dt.f int Default value: 10
If cv_folds > 1, then it prunes a tree with K-fold cross-validation where K is equal to cv_folds.

Set Use1seRule flag to false -classifier.dt.r bool Default value: false
If true, then a pruning will be harsher. This will make a tree more compact and more resistant to the training data noise but a bit less accurate.

Set TruncatePrunedTree flag to false -classifier.dt.t bool Default value: false
If true, then pruned branches are physically removed from the tree.

Artificial Neural Network classifier options

Train Method Type -classifier.ann.t [back|reg] Default value: reg
Type of training method for the multilayer perceptron (MLP) neural network.

  • Back-propagation algorithm
    Method to compute the gradient of the loss function and adjust weights in the network to optimize the result.
  • Resilient Back-propagation algorithm
    Almost the same as the Back-prop algorithm except that it does not take into account the magnitude of the partial derivative (coordinate of the gradient) but only its sign.

Number of neurons in each intermediate layer -classifier.ann.sizes string1 string2... Mandatory
The number of neurons in each intermediate layer (excluding input and output layers).

Neuron activation function type -classifier.ann.f [ident|sig|gau] Default value: sig
This function determine whether the output of the node is positive or not depending on the output of the transfert function.

  • Identity function
  • Symmetrical Sigmoid function
  • Gaussian function (Not completely supported)

Alpha parameter of the activation function -classifier.ann.a float Default value: 1
Alpha parameter of the activation function (used only with sigmoid and gaussian functions).

Beta parameter of the activation function -classifier.ann.b float Default value: 1
Beta parameter of the activation function (used only with sigmoid and gaussian functions).

Strength of the weight gradient term in the BACKPROP method -classifier.ann.bpdw float Default value: 0.1
Strength of the weight gradient term in the BACKPROP method. The recommended value is about 0.1.

Strength of the momentum term (the difference between weights on the 2 previous iterations) -classifier.ann.bpms float Default value: 0.1
Strength of the momentum term (the difference between weights on the 2 previous iterations). This parameter provides some inertia to smooth the random fluctuations of the weights. It can vary from 0 (the feature is disabled) to 1 and beyond. The value 0.1 or so is good enough.

Initial value Delta_0 of update-values Delta_{ij} in RPROP method -classifier.ann.rdw float Default value: 0.1
Initial value Delta_0 of update-values Delta_{ij} in RPROP method (default = 0.1).

Update-values lower limit Delta_{min} in RPROP method -classifier.ann.rdwm float Default value: 1e-07
Update-values lower limit Delta_{min} in RPROP method. It must be positive (default = 1e-7).

Termination criteria -classifier.ann.term [iter|eps|all] Default value: all
Termination criteria.

  • Maximum number of iterations
    Set the number of iterations allowed to the network for its training. Training will stop regardless of the result when this number is reached
  • Epsilon
    Training will focus on result and will stop once the precision isat most epsilon
  • Max. iterations + Epsilon
    Both termination criteria are used. Training stop at the first reached

Epsilon value used in the Termination criteria -classifier.ann.eps float Default value: 0.01
Epsilon value used in the Termination criteria.

Maximum number of iterations used in the Termination criteria -classifier.ann.iter int Default value: 1000
Maximum number of iterations used in the Termination criteria.

Random forests classifier options

Maximum depth of the tree -classifier.rf.max int Default value: 5
The depth of the tree. A low value will likely underfit and conversely a high value will likely overfit. The optimal value can be obtained using cross validation or other suitable methods.

Minimum number of samples in each node -classifier.rf.min int Default value: 10
If the number of samples in a node is smaller than this parameter, then the node will not be split. A reasonable value is a small percentage of the total data e.g. 1 percent.

Termination Criteria for regression tree -classifier.rf.ra float Default value: 0
If all absolute differences between an estimated value in a node and the values of the train samples in this node are smaller than this regression accuracy parameter, then the node will not be split.

Cluster possible values of a categorical variable into K <= cat clusters to find a suboptimal split -classifier.rf.cat int Default value: 10
Cluster possible values of a categorical variable into K <= cat clusters to find a suboptimal split.

Size of the randomly selected subset of features at each tree node -classifier.rf.var int Default value: 0
The size of the subset of features, randomly selected at each tree node, that are used to find the best split(s). If you set it to 0, then the size will be set to the square root of the total number of features.

Maximum number of trees in the forest -classifier.rf.nbtrees int Default value: 100
The maximum number of trees in the forest. Typically, the more trees you have, the better the accuracy. However, the improvement in accuracy generally diminishes and reaches an asymptote for a certain number of trees. Also to keep in mind, increasing the number of trees increases the prediction time linearly.

Sufficient accuracy (OOB error) -classifier.rf.acc float Default value: 0.01
Sufficient accuracy (OOB error).

KNN classifier options

Number of Neighbors -classifier.knn.k int Default value: 32
The number of neighbors to use.

Decision rule -classifier.knn.rule [mean|median] Default value: mean
Decision rule for regression output

  • Mean of neighbors values
    Returns the mean of neighbors values
  • Median of neighbors values
    Returns the median of neighbors values

Shark Random forests classifier options

Maximum number of trees in the forest -classifier.sharkrf.nbtrees int Default value: 100
The maximum number of trees in the forest. Typically, the more trees you have, the better the accuracy. However, the improvement in accuracy generally diminishes and reaches an asymptote for a certain number of trees. Also to keep in mind, increasing the number of trees increases the prediction time linearly.

Min size of the node for a split -classifier.sharkrf.nodesize int Default value: 25
If the number of samples in a node is smaller than this parameter, then the node will not be split. A reasonable value is a small percentage of the total data e.g. 1 percent.

Number of features tested at each node -classifier.sharkrf.mtry int Default value: 0
The number of features (variables) which will be tested at each node in order to compute the split. If set to zero, the square root of the number of features is used.

Out of bound ratio -classifier.sharkrf.oobr float Default value: 0.66
Set the fraction of the original training dataset to use as the out of bag sample.A good default value is 0.66.

Shark kmeans classifier options

Maximum number of iterations for the kmeans algorithm -classifier.sharkkm.maxiter int Default value: 10
The maximum number of iterations for the kmeans algorithm. 0=unlimited

Number of classes for the kmeans algorithm -classifier.sharkkm.k int Default value: 2
The number of classes used for the kmeans algorithm. Default set to 2 class


Random seed -rand int
Set a specific random seed with integer value.

Load parameters from XML -inxml filename.xml
Load application parameters from an XML file.

Save parameters to XML -outxml filename.xml
Save application parameters to an XML file.

Examples

From the command-line:

otbcli_TrainRegression -io.il training_dataset.tif -io.out regression_model.txt -io.imstat training_statistics.xml -classifier libsvm

From Python:

import otbApplication

app = otbApplication.Registry.CreateApplication("TrainRegression")

app.SetParameterStringList("io.il", ['training_dataset.tif'])
app.SetParameterString("io.out", "regression_model.txt")
app.SetParameterString("io.imstat", "training_statistics.xml")
app.SetParameterString("classifier","libsvm")

app.ExecuteAndWriteOutput()

See also

OpenCV documentation for machine learning http://docs.opencv.org/modules/ml/doc/ml.html