Train a classifier from multiple images¶

Train a classifier from multiple pairs of images and training vector data.

Detailed description¶

This application performs a classifier training from multiple pairs of input images and training vector data. Samples are composed of pixel values in each band optionally centered and reduced using an XML statistics file produced by the ComputeImagesStatistics application.: The training vector data must contain polygons with a positive integer field representing the class label. The name of this field can be set using the “Class label field” parameter. Training and validation sample lists are built such that each class is equally represented in both lists. One parameter allows controlling the ratio between the number of samples in training and validation sets. Two parameters allow managing the size of the training and validation sets per class and per image. Several classifier parameters can be set depending on the chosen classifier. In the validation process, the confusion matrix is organized the following way: rows = reference labels, columns = produced labels. In the header of the optional confusion matrix output file, the validation (reference) and predicted (produced) class labels are ordered according to the rows/columns of the confusion matrix. This application is based on LibSVM and on OpenCV Machine Learning classifiers, and is compatible with OpenCV 2.3.1 and later.

Parameters¶

This section describes in details the parameters available for this application. Table [1] presents a summary of these parameters and the parameters keys to be used in command-line and programming languages. Application key is TrainImagesClassifier .

[1]	Table: Parameters table for Train a classifier from multiple images.

Parameter Key	Parameter Type	Parameter Description
io	Group	Group
io.il	Input image list	Input image list
io.vd	Input vector data list	Input vector data list
io.imstat	Input File name	Input File name
io.confmatout	Output File name	Output File name
io.out	Output File name	Output File name
elev	Group	Group
elev.dem	Directory	Directory
elev.geoid	Input File name	Input File name
elev.default	Float	Float
sample	Group	Group
sample.mt	Int	Int
sample.mv	Int	Int
sample.bm	Int	Int
sample.edg	Boolean	Boolean
sample.vtr	Float	Float
sample.vfn	String	String
classifier	Choices	Choices
classifier libsvm	Choice	LibSVM classifier
classifier boost	Choice	Boost classifier
classifier dt	Choice	Decision Tree classifier
classifier gbt	Choice	Gradient Boosted Tree classifier
classifier ann	Choice	Artificial Neural Network classifier
classifier bayes	Choice	Normal Bayes classifier
classifier rf	Choice	Random forests classifier
classifier knn	Choice	KNN classifier
classifier.libsvm.k	Choices	Choices
classifier.libsvm.k linear	Choice	Linear
classifier.libsvm.k rbf	Choice	Gaussian radial basis function
classifier.libsvm.k poly	Choice	Polynomial
classifier.libsvm.k sigmoid	Choice	Sigmoid
classifier.libsvm.m	Choices	Choices
classifier.libsvm.m csvc	Choice	C support vector classification
classifier.libsvm.m nusvc	Choice	Nu support vector classification
classifier.libsvm.m oneclass	Choice	Distribution estimation (One Class SVM)
classifier.libsvm.c	Float	Float
classifier.libsvm.opt	Boolean	Boolean
classifier.libsvm.prob	Boolean	Boolean
classifier.boost.t	Choices	Choices
classifier.boost.t discrete	Choice	Discrete AdaBoost
classifier.boost.t real	Choice	Real AdaBoost (technique using confidence-rated predictions and working well with categorical data)
classifier.boost.t logit	Choice	LogitBoost (technique producing good regression fits)
classifier.boost.t gentle	Choice	Gentle AdaBoost (technique setting less weight on outlier data points and, for that reason, being often good with regression data)
classifier.boost.w	Int	Int
classifier.boost.r	Float	Float
classifier.boost.m	Int	Int
classifier.dt.max	Int	Int
classifier.dt.min	Int	Int
classifier.dt.ra	Float	Float
classifier.dt.cat	Int	Int
classifier.dt.f	Int	Int
classifier.dt.r	Boolean	Boolean
classifier.dt.t	Boolean	Boolean
classifier.gbt.w	Int	Int
classifier.gbt.s	Float	Float
classifier.gbt.p	Float	Float
classifier.gbt.max	Int	Int
classifier.ann.t	Choices	Choices
classifier.ann.t reg	Choice	RPROP algorithm
classifier.ann.t back	Choice	Back-propagation algorithm
classifier.ann.sizes	String list	String list
classifier.ann.f	Choices	Choices
classifier.ann.f ident	Choice	Identity function
classifier.ann.f sig	Choice	Symmetrical Sigmoid function
classifier.ann.f gau	Choice	Gaussian function (Not completely supported)
classifier.ann.a	Float	Float
classifier.ann.b	Float	Float
classifier.ann.bpdw	Float	Float
classifier.ann.bpms	Float	Float
classifier.ann.rdw	Float	Float
classifier.ann.rdwm	Float	Float
classifier.ann.term	Choices	Choices
classifier.ann.term iter	Choice	Maximum number of iterations
classifier.ann.term eps	Choice	Epsilon
classifier.ann.term all	Choice	Max. iterations + Epsilon
classifier.ann.eps	Float	Float
classifier.ann.iter	Int	Int
classifier.rf.max	Int	Int
classifier.rf.min	Int	Int
classifier.rf.ra	Float	Float
classifier.rf.cat	Int	Int
classifier.rf.var	Int	Int
classifier.rf.nbtrees	Int	Int
classifier.rf.acc	Float	Float
classifier.knn.k	Int	Int
rand	Int	Int
inxml	XML input parameters file	XML input parameters file
outxml	XML output parameters file	XML output parameters file

Input and output data This group of parameters allows setting input and output data.

Input Image List: A list of input images.
Input Vector Data List: A list of vector data to select the training samples.
Input XML image statistics file: Input XML file containing the mean and the standard deviation of the input images.
Output confusion matrix: Output file containing the confusion matrix (.csv format).
Output model: Output file containing the model estimated (.txt format).

Elevation management This group of parameters allows managing elevation values. Supported formats are SRTM, DTED or any geotiff. DownloadSRTMTiles application could be a useful tool to list/download tiles related to a product.

DEM directory: This parameter allows selecting a directory containing Digital Elevation Model files. Note that this directory should contain only DEM files. Unexpected behaviour might occurs if other images are found in this directory.
Geoid File: Use a geoid grid to get the height above the ellipsoid in case there is no DEM available, no coverage for some points or pixels with no_data in the DEM tiles. A version of the geoid can be found on the OTB website (http://hg.orfeo-toolbox.org/OTB-Data/raw-file/404aa6e4b3e0/Input/DEM/egm96.grd).
Default elevation: This parameter allows setting the default height above ellipsoid when there is no DEM available, no coverage for some points or pixels with no_data in the DEM tiles, and no geoid file has been set. This is also used by some application as an average elevation value.

Training and validation samples parameters This group of parameters allows you to set training and validation sample lists parameters.

Maximum training sample size per class: Maximum size per class (in pixels) of the training sample list (default = 1000) (no limit = -1). If equal to -1, then the maximal size of the available training sample list per class will be equal to the surface area of the smallest class multiplied by the training sample ratio.
Maximum validation sample size per class: Maximum size per class (in pixels) of the validation sample list (default = 1000) (no limit = -1). If equal to -1, then the maximal size of the available validation sample list per class will be equal to the surface area of the smallest class multiplied by the validation sample ratio.
Bound sample number by minimum: Bound the number of samples for each class by the number of available samples by the smaller class. Proportions between training and validation are respected. Default is true (=1).
On edge pixel inclusion: Takes pixels on polygon edge into consideration when building training and validation samples.
Training and validation sample ratio: Ratio between training and validation samples (0.0 = all training, 1.0 = all validation) (default = 0.5).
Name of the discrimination field: Name of the field used to discriminate class labels in the input vector data files.

Classifier to use for the training Choice of the classifier to use for the training. Available choices are:

LibSVM classifier : This group of parameters allows setting SVM classifier parameters.

SVM Kernel Type : SVM Kernel Type.

SVM Model Type : Type of SVM formulation.

Cost parameter C : SVM models have a cost parameter C (1 by default) to control the trade-off between training errors and forcing rigid margins.

Parameters optimization : SVM parameters optimization flag.

Probability estimation : Probability estimation flag.

Boost classifier : This group of parameters allows setting Boost classifier parameters. See complete documentation here url{http://docs.opencv.org/modules/ml/doc/boosting.html}.

Boost Type : Type of Boosting algorithm.

Weak count : The number of weak classifiers.

Weight Trim Rate : A threshold between 0 and 1 used to save computational time. Samples with summary weight <= (1 - weight_trim_rate) do not participate in the next iteration of training. Set this parameter to 0 to turn off this functionality.

Maximum depth of the tree : Maximum depth of the tree.

Decision Tree classifier : This group of parameters allows setting Decision Tree classifier parameters. See complete documentation here url{http://docs.opencv.org/modules/ml/doc/decision_trees.html}.

Maximum depth of the tree : The training algorithm attempts to split each node while its depth is smaller than the maximum possible depth of the tree. The actual depth may be smaller if the other termination criteria are met, and/or if the tree is pruned.

Minimum number of samples in each node : If all absolute differences between an estimated value in a node and the values of the train samples in this node are smaller than this regression accuracy parameter, then the node will not be split.

Termination criteria for regression tree :

Cluster possible values of a categorical variable into K <= cat clusters to find a suboptimal split : Cluster possible values of a categorical variable into K <= cat clusters to find a suboptimal split.

K-fold cross-validations : If cv_folds > 1, then it prunes a tree with K-fold cross-validation where K is equal to cv_folds.

Set Use1seRule flag to false : If true, then a pruning will be harsher. This will make a tree more compact and more resistant to the training data noise but a bit less accurate.

Set TruncatePrunedTree flag to false : If true, then pruned branches are physically removed from the tree.

Gradient Boosted Tree classifier : This group of parameters allows setting Gradient Boosted Tree classifier parameters. See complete documentation here url{http://docs.opencv.org/modules/ml/doc/gradient_boosted_trees.html}.

Number of boosting algorithm iterations : Number “w” of boosting algorithm iterations, with w*K being the total number of trees in the GBT model, where K is the output number of classes.

Regularization parameter : Regularization parameter.

Portion of the whole training set used for each algorithm iteration : Portion of the whole training set used for each algorithm iteration. The subset is generated randomly.

Maximum depth of the tree : The training algorithm attempts to split each node while its depth is smaller than the maximum possible depth of the tree. The actual depth may be smaller if the other termination criteria are met, and/or if the tree is pruned.

Artificial Neural Network classifier : This group of parameters allows setting Artificial Neural Network classifier parameters. See complete documentation here url{http://docs.opencv.org/modules/ml/doc/neural_networks.html}.

Train Method Type : Type of training method for the multilayer perceptron (MLP) neural network.

Number of neurons in each intermediate layer : The number of neurons in each intermediate layer (excluding input and output layers).

Neuron activation function type : Neuron activation function.

Alpha parameter of the activation function : Alpha parameter of the activation function (used only with sigmoid and gaussian functions).

Beta parameter of the activation function : Beta parameter of the activation function (used only with sigmoid and gaussian functions).

Strength of the weight gradient term in the BACKPROP method : Strength of the weight gradient term in the BACKPROP method. The recommended value is about 0.1.

Strength of the momentum term (the difference between weights on the 2 previous iterations) : Strength of the momentum term (the difference between weights on the 2 previous iterations). This parameter provides some inertia to smooth the random fluctuations of the weights. It can vary from 0 (the feature is disabled) to 1 and beyond. The value 0.1 or so is good enough.

Initial value Delta_0 of update-values Delta_{ij} in RPROP method : Initial value Delta_0 of update-values Delta_{ij} in RPROP method (default = 0.1).

Update-values lower limit Delta_{min} in RPROP method : Update-values lower limit Delta_{min} in RPROP method. It must be positive (default = 1e-7).

Termination criteria : Termination criteria.

Epsilon value used in the Termination criteria : Epsilon value used in the Termination criteria.

Maximum number of iterations used in the Termination criteria : Maximum number of iterations used in the Termination criteria.

Normal Bayes classifier : Use a Normal Bayes Classifier. See complete documentation here url{http://docs.opencv.org/modules/ml/doc/normal_bayes_classifier.html}.

Random forests classifier : This group of parameters allows setting Random Forests classifier parameters. See complete documentation here url{http://docs.opencv.org/modules/ml/doc/random_trees.html}.

Maximum depth of the tree : The depth of the tree. A low value will likely underfit and conversely a high value will likely overfit. The optimal value can be obtained using cross validation or other suitable methods.

Minimum number of samples in each node : If the number of samples in a node is smaller than this parameter, then the node will not be split. A reasonable value is a small percentage of the total data e.g. 1 percent.

Termination Criteria for regression tree : If all absolute differences between an estimated value in a node and the values of the train samples in this node are smaller than this regression accuracy parameter, then the node will not be split.

Cluster possible values of a categorical variable into K <= cat clusters to find a suboptimal split : Cluster possible values of a categorical variable into K <= cat clusters to find a suboptimal split.

Size of the randomly selected subset of features at each tree node : The size of the subset of features, randomly selected at each tree node, that are used to find the best split(s). If you set it to 0, then the size will be set to the square root of the total number of features.

Maximum number of trees in the forest : The maximum number of trees in the forest. Typically, the more trees you have, the better the accuracy. However, the improvement in accuracy generally diminishes and reaches an asymptote for a certain number of trees. Also to keep in mind, increasing the number of trees increases the prediction time linearly.

Sufficient accuracy (OOB error) : Sufficient accuracy (OOB error).

KNN classifier : This group of parameters allows setting KNN classifier parameters. See complete documentation here url{http://docs.opencv.org/modules/ml/doc/k_nearest_neighbors.html}.

Number of Neighbors : The number of neighbors to use.

set user defined seed Set specific seed. with integer value.

Load otb application from xml file Load otb application from xml file.

Save otb application to xml file Save otb application to xml file.

Example¶

To run this example in command-line, use the following:

otbcli_TrainImagesClassifier -io.il QB_1_ortho.tif -io.vd VectorData_QB1.shp -io.imstat EstimateImageStatisticsQB1.xml -sample.mv 100 -sample.mt 100 -sample.vtr 0.5 -sample.edg false -sample.vfn Class -classifier libsvm -classifier.libsvm.k linear -classifier.libsvm.c 1 -classifier.libsvm.opt false -io.out svmModelQB1.txt -io.confmatout svmConfusionMatrixQB1.csv

To run this example from Python, use the following code snippet:

#!/usr/bin/python

# Import the otb applications package
import otbApplication

# The following line creates an instance of the TrainImagesClassifier application
TrainImagesClassifier = otbApplication.Registry.CreateApplication("TrainImagesClassifier")

# The following lines set all the application parameters:
TrainImagesClassifier.SetParameterStringList("io.il", ['QB_1_ortho.tif'])

TrainImagesClassifier.SetParameterStringList("io.vd", ['VectorData_QB1.shp'])

TrainImagesClassifier.SetParameterString("io.imstat", "EstimateImageStatisticsQB1.xml")

TrainImagesClassifier.SetParameterInt("sample.mv", 100)

TrainImagesClassifier.SetParameterInt("sample.mt", 100)

TrainImagesClassifier.SetParameterFloat("sample.vtr", 0.5)

TrainImagesClassifier.SetParameterString("sample.edg","1")

TrainImagesClassifier.SetParameterString("sample.vfn", "Class")

TrainImagesClassifier.SetParameterString("classifier","libsvm")

TrainImagesClassifier.SetParameterString("classifier.libsvm.k","linear")

TrainImagesClassifier.SetParameterFloat("classifier.libsvm.c", 1)

TrainImagesClassifier.SetParameterString("classifier.libsvm.opt","1")

TrainImagesClassifier.SetParameterString("io.out", "svmModelQB1.txt")

TrainImagesClassifier.SetParameterString("io.confmatout", "svmConfusionMatrixQB1.csv")

# The following line execute the application
TrainImagesClassifier.ExecuteAndWriteOutput()

Limitations¶

None

Authors¶

This application has been written by OTB-Team.