TrainImagesClassifier - Train a classifier from multiple images

Train a classifier from multiple pairs of images and training vector data.

Detailed description

This application performs a classifier training from multiple pairs of input images and training vector data. Samples are composed of pixel values in each band optionally centered and reduced using an XML statistics file produced by the ComputeImagesStatistics application.
The training vector data must contain polygons with a positive integer field representing the class label. The name of this field can be set using the “Class label field” parameter. Training and validation sample lists are built such that each class is equally represented in both lists. One parameter allows controlling the ratio between the number of samples in training and validation sets. Two parameters allow managing the size of the training and validation sets per class and per image. Several classifier parameters can be set depending on the chosen classifier. In the validation process, the confusion matrix is organized the following way: rows = reference labels, columns = produced labels. In the header of the optional confusion matrix output file, the validation (reference) and predicted (produced) class labels are ordered according to the rows/columns of the confusion matrix. This application is based on LibSVM, OpenCV Machine Learning (2.3.1 and later), and Shark ML. The output of this application is a text model file, whose format corresponds to the ML model type chosen. There is no image nor vector data output.

Parameters

This section describes in details the parameters available for this application. Table [1] presents a summary of these parameters and the parameters keys to be used in command-line and programming languages. Application key is TrainImagesClassifier .

[1]Table: Parameters table for Train a classifier from multiple images.
Parameter Key Parameter Name Parameter Type
io Input and output data Group
io.il Input Image List Input image list
io.vd Input Vector Data List Input vector data list
io.valid Validation Vector Data List Input vector data list
io.imstat Input XML image statistics file Input File name
io.out Output model Output File name
io.confmatout Output confusion matrix or contingency table Output File name
cleanup Temporary files cleaning Boolean
sample Training and validation samples parameters Group
sample.mt Maximum training sample size per class Int
sample.mv Maximum validation sample size per class Int
sample.bm Bound sample number by minimum Int
sample.vtr Training and validation sample ratio Float
sample.vfn Field containing the class integer label for supervision List
ram Available RAM (Mb) Int
elev Elevation management Group
elev.dem DEM directory Directory
elev.geoid Geoid File Input File name
elev.default Default elevation Float
classifier Classifier to use for the training Choices
classifier libsvm LibSVM classifier Choice
classifier boost Boost classifier Choice
classifier dt Decision Tree classifier Choice
classifier gbt Gradient Boosted Tree classifier Choice
classifier ann Artificial Neural Network classifier Choice
classifier bayes Normal Bayes classifier Choice
classifier rf Random forests classifier Choice
classifier knn KNN classifier Choice
classifier sharkrf Shark Random forests classifier Choice
classifier sharkkm Shark kmeans classifier Choice
classifier.libsvm.k SVM Kernel Type Choices
classifier.libsvm.k linear Linear Choice
classifier.libsvm.k rbf Gaussian radial basis function Choice
classifier.libsvm.k poly Polynomial Choice
classifier.libsvm.k sigmoid Sigmoid Choice
classifier.libsvm.m SVM Model Type Choices
classifier.libsvm.m csvc C support vector classification Choice
classifier.libsvm.m nusvc Nu support vector classification Choice
classifier.libsvm.m oneclass Distribution estimation (One Class SVM) Choice
classifier.libsvm.c Cost parameter C Float
classifier.libsvm.nu Cost parameter Nu Float
classifier.libsvm.opt Parameters optimization Boolean
classifier.libsvm.prob Probability estimation Boolean
classifier.boost.t Boost Type Choices
classifier.boost.t discrete Discrete AdaBoost Choice
classifier.boost.t real Real AdaBoost (technique using confidence-rated predictions and working well with categorical data) Choice
classifier.boost.t logit LogitBoost (technique producing good regression fits) Choice
classifier.boost.t gentle Gentle AdaBoost (technique setting less weight on outlier data points and, for that reason, being often good with regression data) Choice
classifier.boost.w Weak count Int
classifier.boost.r Weight Trim Rate Float
classifier.boost.m Maximum depth of the tree Int
classifier.dt.max Maximum depth of the tree Int
classifier.dt.min Minimum number of samples in each node Int
classifier.dt.ra Termination criteria for regression tree Float
classifier.dt.cat Cluster possible values of a categorical variable into K <= cat clusters to find a suboptimal split Int
classifier.dt.f K-fold cross-validations Int
classifier.dt.r Set Use1seRule flag to false Boolean
classifier.dt.t Set TruncatePrunedTree flag to false Boolean
classifier.gbt.w Number of boosting algorithm iterations Int
classifier.gbt.s Regularization parameter Float
classifier.gbt.p Portion of the whole training set used for each algorithm iteration Float
classifier.gbt.max Maximum depth of the tree Int
classifier.ann.t Train Method Type Choices
classifier.ann.t back Back-propagation algorithm Choice
classifier.ann.t reg Resilient Back-propagation algorithm Choice
classifier.ann.sizes Number of neurons in each intermediate layer String list
classifier.ann.f Neuron activation function type Choices
classifier.ann.f ident Identity function Choice
classifier.ann.f sig Symmetrical Sigmoid function Choice
classifier.ann.f gau Gaussian function (Not completely supported) Choice
classifier.ann.a Alpha parameter of the activation function Float
classifier.ann.b Beta parameter of the activation function Float
classifier.ann.bpdw Strength of the weight gradient term in the BACKPROP method Float
classifier.ann.bpms Strength of the momentum term (the difference between weights on the 2 previous iterations) Float
classifier.ann.rdw Initial value Delta_0 of update-values Delta_{ij} in RPROP method Float
classifier.ann.rdwm Update-values lower limit Delta_{min} in RPROP method Float
classifier.ann.term Termination criteria Choices
classifier.ann.term iter Maximum number of iterations Choice
classifier.ann.term eps Epsilon Choice
classifier.ann.term all Max. iterations + Epsilon Choice
classifier.ann.eps Epsilon value used in the Termination criteria Float
classifier.ann.iter Maximum number of iterations used in the Termination criteria Int
classifier.rf.max Maximum depth of the tree Int
classifier.rf.min Minimum number of samples in each node Int
classifier.rf.ra Termination Criteria for regression tree Float
classifier.rf.cat Cluster possible values of a categorical variable into K <= cat clusters to find a suboptimal split Int
classifier.rf.var Size of the randomly selected subset of features at each tree node Int
classifier.rf.nbtrees Maximum number of trees in the forest Int
classifier.rf.acc Sufficient accuracy (OOB error) Float
classifier.knn.k Number of Neighbors Int
classifier.sharkrf.nbtrees Maximum number of trees in the forest Int
classifier.sharkrf.nodesize Min size of the node for a split Int
classifier.sharkrf.mtry Number of features tested at each node Int
classifier.sharkrf.oobr Out of bound ratio Float
classifier.sharkkm.maxiter Maximum number of iteration for the kmeans algorithm. Int
classifier.sharkkm.k The number of class used for the kmeans algorithm. Int
rand set user defined seed Int
inxml Load otb application from xml file XML input parameters file
outxml Save otb application to xml file XML output parameters file

[Input and output data]: This group of parameters allows setting input and output data.

  • Input Image List: A list of input images.
  • Input Vector Data List: A list of vector data to select the training samples.
  • Validation Vector Data List: A list of vector data to select the validation samples.
  • Input XML image statistics file: XML file containing mean and variance of each feature.
  • Output model: Output file containing the model estimated (.txt format).
  • Output confusion matrix or contingency table: Output file containing the confusion matrix or contingency table (.csv format).The contingency table is output when we unsupervised algorithms is used otherwise the confusion matrix is output.

Temporary files cleaning: If activated, the application will try to clean all temporary files it created.

[Training and validation samples parameters]: This group of parameters allows you to set training and validation sample lists parameters.

  • Maximum training sample size per class: Maximum size per class (in pixels) of the training sample list (default = 1000) (no limit = -1). If equal to -1, then the maximal size of the available training sample list per class will be equal to the surface area of the smallest class multiplied by the training sample ratio.
  • Maximum validation sample size per class: Maximum size per class (in pixels) of the validation sample list (default = 1000) (no limit = -1). If equal to -1, then the maximal size of the available validation sample list per class will be equal to the surface area of the smallest class multiplied by the validation sample ratio.
  • Bound sample number by minimum: Bound the number of samples for each class by the number of available samples by the smaller class. Proportions between training and validation are respected. Default is true (=1).
  • Training and validation sample ratio: Ratio between training and validation samples (0.0 = all training, 1.0 = all validation) (default = 0.5).
  • Field containing the class integer label for supervision: Field containing the class id for supervision. The values in this field shall be cast into integers.

Available RAM (Mb): Available memory for processing (in MB).

[Elevation management]: This group of parameters allows managing elevation values. Supported formats are SRTM, DTED or any geotiff. DownloadSRTMTiles application could be a useful tool to list/download tiles related to a product.

  • DEM directory: This parameter allows selecting a directory containing Digital Elevation Model files. Note that this directory should contain only DEM files. Unexpected behaviour might occurs if other images are found in this directory.
  • Geoid File: Use a geoid grid to get the height above the ellipsoid in case there is no DEM available, no coverage for some points or pixels with no_data in the DEM tiles. A version of the geoid can be found on the OTB website(https://gitlab.orfeo-toolbox.org/orfeotoolbox/otb-data/blob/master/Input/DEM/egm96.grd).
  • Default elevation: This parameter allows setting the default height above ellipsoid when there is no DEM available, no coverage for some points or pixels with no_data in the DEM tiles, and no geoid file has been set. This is also used by some application as an average elevation value.

Classifier to use for the training: Choice of the classifier to use for the training. Available choices are:

  • LibSVM classifier: This group of parameters allows setting SVM classifier parameters.
  • SVM Kernel Type: SVM Kernel Type. Available choices are:
  • Linear: Linear Kernel, no mapping is done, this is the fastest option.
  • Gaussian radial basis function: This kernel is a good choice in most of the case. It is an exponential function of the euclidian distance between the vectors.
  • Polynomial: Polynomial Kernel, the mapping is a polynomial function.
  • Sigmoid: The kernel is a hyperbolic tangente function of the vectors.
  • SVM Model Type: Type of SVM formulation. Available choices are:
  • C support vector classification: This formulation allows imperfect separation of classes. The penalty is set through the cost parameter C.
  • Nu support vector classification: This formulation allows imperfect separation of classes. The penalty is set through the cost parameter Nu. As compared to C, Nu is harder to optimize, and may not be as fast.
  • Distribution estimation (One Class SVM): All the training data are from the same class, SVM builds a boundary that separates the class from the rest of the feature space.
  • Cost parameter C: SVM models have a cost parameter C (1 by default) to control the trade-off between training errors and forcing rigid margins.
  • Cost parameter Nu: Cost parameter Nu, in the range 0..1, the larger the value, the smoother the decision.
  • Parameters optimization: SVM parameters optimization flag.
  • Probability estimation: Probability estimation flag.
  • Boost Type: Type of Boosting algorithm. Available choices are:
  • Discrete AdaBoost: This procedure trains the classifiers on weighted versions of the training sample, giving higher weight to cases that are currently misclassified. This is done for a sequence of weighter samples, and then the final classifier is defined as a linear combination of the classifier from each stage.
  • Real AdaBoost (technique using confidence-rated predictions and working well with categorical data): Adaptation of the Discrete Adaboost algorithm with Real value.
  • LogitBoost (technique producing good regression fits): This procedure is an adaptive Newton algorithm for fitting an additive logistic regression model. Beware it can produce numeric instability.
  • Gentle AdaBoost (technique setting less weight on outlier data points and, for that reason, being often good with regression data): A modified version of the Real Adaboost algorithm, using Newton stepping rather than exact optimization at each step.
  • Weak count: The number of weak classifiers.
  • Weight Trim Rate: A threshold between 0 and 1 used to save computational time. Samples with summary weight <= (1 - weight_trim_rate) do not participate in the next iteration of training. Set this parameter to 0 to turn off this functionality.
  • Maximum depth of the tree: Maximum depth of the tree.
  • Maximum depth of the tree: The training algorithm attempts to split each node while its depth is smaller than the maximum possible depth of the tree. The actual depth may be smaller if the other termination criteria are met, and/or if the tree is pruned.
  • Minimum number of samples in each node: If the number of samples in a node is smaller than this parameter, then this node will not be split.
  • Termination criteria for regression tree: If all absolute differences between an estimated value in a node and the values of the train samples in this node are smaller than this regression accuracy parameter, then the node will not be split further.
  • Cluster possible values of a categorical variable into K <= cat clusters to find a suboptimal split: Cluster possible values of a categorical variable into K <= cat clusters to find a suboptimal split.
  • K-fold cross-validations: If cv_folds > 1, then it prunes a tree with K-fold cross-validation where K is equal to cv_folds.
  • Set Use1seRule flag to false: If true, then a pruning will be harsher. This will make a tree more compact and more resistant to the training data noise but a bit less accurate.
  • Set TruncatePrunedTree flag to false: If true, then pruned branches are physically removed from the tree.
  • Number of boosting algorithm iterations: Number “w” of boosting algorithm iterations, with w*K being the total number of trees in the GBT model, where K is the output number of classes.
  • Regularization parameter: Regularization parameter.
  • Portion of the whole training set used for each algorithm iteration: Portion of the whole training set used for each algorithm iteration. The subset is generated randomly.
  • Maximum depth of the tree: The training algorithm attempts to split each node while its depth is smaller than the maximum possible depth of the tree. The actual depth may be smaller if the other termination criteria are met, and/or if the tree is pruned.
  • Train Method Type: Type of training method for the multilayer perceptron (MLP) neural network. Available choices are:
  • Back-propagation algorithm: Method to compute the gradient of the loss function and adjust weights in the network to optimize the result.
  • Resilient Back-propagation algorithm: Almost the same as the Back-prop algorithm except that it does not take into account the magnitude of the partial derivative (coordinate of the gradient) but only its sign.
  • Number of neurons in each intermediate layer: The number of neurons in each intermediate layer (excluding input and output layers).
  • Neuron activation function type: This function determine whether the output of the node is positive or not depending on the output of the transfert function. Available choices are:
  • Identity function
  • Symmetrical Sigmoid function
  • Gaussian function (Not completely supported)
  • Alpha parameter of the activation function: Alpha parameter of the activation function (used only with sigmoid and gaussian functions).
  • Beta parameter of the activation function: Beta parameter of the activation function (used only with sigmoid and gaussian functions).
  • Strength of the weight gradient term in the BACKPROP method: Strength of the weight gradient term in the BACKPROP method. The recommended value is about 0.1.
  • Strength of the momentum term (the difference between weights on the 2 previous iterations): Strength of the momentum term (the difference between weights on the 2 previous iterations). This parameter provides some inertia to smooth the random fluctuations of the weights. It can vary from 0 (the feature is disabled) to 1 and beyond. The value 0.1 or so is good enough.
  • Initial value Delta_0 of update-values Delta_{ij} in RPROP method: Initial value Delta_0 of update-values Delta_{ij} in RPROP method (default = 0.1).
  • Update-values lower limit Delta_{min} in RPROP method: Update-values lower limit Delta_{min} in RPROP method. It must be positive (default = 1e-7).
  • Termination criteria: Termination criteria. Available choices are:
  • Maximum number of iterations: Set the number of iterations allowed to the network for its training. Training will stop regardless of the result when this number is reached.
  • Epsilon: Training will focus on result and will stop once the precision isat most epsilon.
  • Max. iterations + Epsilon: Both termination criteria are used. Training stop at the first reached.
  • Epsilon value used in the Termination criteria: Epsilon value used in the Termination criteria.
  • Maximum number of iterations used in the Termination criteria: Maximum number of iterations used in the Termination criteria.
  • Maximum depth of the tree: The depth of the tree. A low value will likely underfit and conversely a high value will likely overfit. The optimal value can be obtained using cross validation or other suitable methods.
  • Minimum number of samples in each node: If the number of samples in a node is smaller than this parameter, then the node will not be split. A reasonable value is a small percentage of the total data e.g. 1 percent.
  • Termination Criteria for regression tree: If all absolute differences between an estimated value in a node and the values of the train samples in this node are smaller than this regression accuracy parameter, then the node will not be split.
  • Cluster possible values of a categorical variable into K <= cat clusters to find a suboptimal split: Cluster possible values of a categorical variable into K <= cat clusters to find a suboptimal split.
  • Size of the randomly selected subset of features at each tree node: The size of the subset of features, randomly selected at each tree node, that are used to find the best split(s). If you set it to 0, then the size will be set to the square root of the total number of features.
  • Maximum number of trees in the forest: The maximum number of trees in the forest. Typically, the more trees you have, the better the accuracy. However, the improvement in accuracy generally diminishes and reaches an asymptote for a certain number of trees. Also to keep in mind, increasing the number of trees increases the prediction time linearly.
  • Sufficient accuracy (OOB error): Sufficient accuracy (OOB error).
  • Number of Neighbors: The number of neighbors to use.
  • Maximum number of trees in the forest: The maximum number of trees in the forest. Typically, the more trees you have, the better the accuracy. However, the improvement in accuracy generally diminishes and reaches an asymptote for a certain number of trees. Also to keep in mind, increasing the number of trees increases the prediction time linearly.
  • Min size of the node for a split: If the number of samples in a node is smaller than this parameter, then the node will not be split. A reasonable value is a small percentage of the total data e.g. 1 percent.
  • Number of features tested at each node: The number of features (variables) which will be tested at each node in order to compute the split. If set to zero, the square root of the number of features is used.
  • Out of bound ratio: Set the fraction of the original training dataset to use as the out of bag sample.A good default value is 0.66. .
  • Maximum number of iteration for the kmeans algorithm.: The maximum number of iteration for the kmeans algorithm. 0=unlimited.
  • The number of class used for the kmeans algorithm.: The number of class used for the kmeans algorithm. Default set to 2 class.

set user defined seed: Set specific seed. with integer value.

Load otb application from xml file: Load otb application from xml file.

Save otb application to xml file: Save otb application to xml file.

Example

To run this example in command-line, use the following:

otbcli_TrainImagesClassifier -io.il QB_1_ortho.tif -io.vd VectorData_QB1.shp -io.imstat EstimateImageStatisticsQB1.xml -sample.mv 100 -sample.mt 100 -sample.vtr 0.5 -sample.vfn Class -classifier libsvm -classifier.libsvm.k linear -classifier.libsvm.c 1 -classifier.libsvm.opt false -io.out svmModelQB1.txt -io.confmatout svmConfusionMatrixQB1.csv

To run this example from Python, use the following code snippet:

#!/usr/bin/python

# Import the otb applications package
import otbApplication

# The following line creates an instance of the TrainImagesClassifier application
TrainImagesClassifier = otbApplication.Registry.CreateApplication("TrainImagesClassifier")

# The following lines set all the application parameters:
TrainImagesClassifier.SetParameterStringList("io.il", ['QB_1_ortho.tif'])

TrainImagesClassifier.SetParameterStringList("io.vd", ['VectorData_QB1.shp'])

TrainImagesClassifier.SetParameterString("io.imstat", "EstimateImageStatisticsQB1.xml")

TrainImagesClassifier.SetParameterInt("sample.mv", 100)

TrainImagesClassifier.SetParameterInt("sample.mt", 100)

TrainImagesClassifier.SetParameterFloat("sample.vtr", 0.5)

# The following line execute the application
TrainImagesClassifier.ExecuteAndWriteOutput()

Limitations

None

Authors

This application has been written by OTB-Team.

See Also

These additional resources can be useful for further information:
OpenCV documentation for machine learning http://docs.opencv.org/modules/ml/doc/ml.html