# TrainVectorClassifier¶

Train a classifier based on labeled geometries and a list of features to consider.

## Description¶

This application trains a classifier based on labeled geometries and a list of features to consider for classification. This application is based on LibSVM, OpenCV Machine Learning (2.3.1 and later), and Shark ML The output of this application is a text model file, whose format corresponds to the ML model type chosen. There are no image or vector data outputs created.

This application has several output images and supports “multi-writing”. Instead of computing and writing each image independently, the streamed image blocks are written in a synchronous way for each output. The output images will be computed strip by strip, using the available RAM to compute the strip size, and a user defined streaming mode can be specified using the streaming extended filenames (type, mode and value). Note that multi-writing can be disabled using the multi-write extended filename option: &multiwrite=false, in this case the output images will be written one by one. Note that multi-writing is not supported for MPI writers.

## Parameters¶

### Input and output data¶

This group of parameters allows setting input and output data.

**Input Vector Data** `-io.vd vectorfile1 vectorfile2...`

*Mandatory*

Input geometries used for training (note: all geometries from the layer will be used)

**Input XML image statistics file** `-io.stats filename [dtype]`

XML file containing mean and variance of each feature.

**Output model** `-io.out filename [dtype]`

*Mandatory*

Output file containing the model estimated (.txt format).

**Output confusion matrix or contingency table** `-io.confmatout filename [dtype]`

Output file containing the confusion matrix or contingency table (.csv format).The contingency table is output when we unsupervised algorithms is used otherwise the confusion matrix is output.

**Layer Index** `-layer int`

*Default value: 0*

Index of the layer to use in the input vector file.

**Field names for training features** `-feat string1 string2...`

List of field names in the input vector data to be used as features for training.

### Validation data¶

This group of parameters defines validation data.

**Validation Vector Data** `-valid.vd vectorfile1 vectorfile2...`

Geometries used for validation (must contain the same fields used for training, all geometries from the layer will be used)

**Layer Index** `-valid.layer int`

*Default value: 0*

Index of the layer to use in the validation vector file.

**Field containing the class integer label for supervision** `-cfield string`

Field containing the class id for supervision. The values in this field shall be cast into integers. Only geometries with this field available will be taken into account.

**Verbose mode** `-v bool`

*Default value: true*

Verbose mode, display the contingency table result.

**Classifier to use for the training** `-classifier [libsvm|boost|dt|ann|bayes|rf|knn|sharkrf|sharkkm]`

*Default value: libsvm*

Choice of the classifier to use for the training.

**LibSVM classifier**

This group of parameters allows setting SVM classifier parameters.**Boost classifier**

http://docs.opencv.org/modules/ml/doc/boosting.html**Decision Tree classifier**

http://docs.opencv.org/modules/ml/doc/decision_trees.html**Artificial Neural Network classifier**

http://docs.opencv.org/modules/ml/doc/neural_networks.html**Normal Bayes classifier**

http://docs.opencv.org/modules/ml/doc/normal_bayes_classifier.html**Random forests classifier**

http://docs.opencv.org/modules/ml/doc/random_trees.html**KNN classifier**

http://docs.opencv.org/modules/ml/doc/k_nearest_neighbors.html**Shark Random forests classifier**

http://image.diku.dk/shark/doxygen_pages/html/classshark_1_1_r_f_trainer.html.

It is noteworthy that training is parallel.**Shark kmeans classifier**

http://image.diku.dk/shark/sphinx_pages/build/html/rest_sources/tutorials/algorithms/kmeans.html

### LibSVM classifier options¶

**SVM Kernel Type** `-classifier.libsvm.k [linear|rbf|poly|sigmoid]`

*Default value: linear*

SVM Kernel Type.

**Linear**

Linear Kernel, no mapping is done, this is the fastest option.**Gaussian radial basis function**

This kernel is a good choice in most of the case. It is an exponential function of the euclidian distance between the vectors.**Polynomial**

Polynomial Kernel, the mapping is a polynomial function.**Sigmoid**

The kernel is a hyperbolic tangente function of the vectors.

**SVM Model Type** `-classifier.libsvm.m [csvc|nusvc|oneclass]`

*Default value: csvc*

Type of SVM formulation.

**C support vector classification**

This formulation allows imperfect separation of classes. The penalty is set through the cost parameter C.**Nu support vector classification**

This formulation allows imperfect separation of classes. The penalty is set through the cost parameter Nu. As compared to C, Nu is harder to optimize, and may not be as fast.**Distribution estimation (One Class SVM)**

All the training data are from the same class, SVM builds a boundary that separates the class from the rest of the feature space.

**Cost parameter C** `-classifier.libsvm.c float`

*Default value: 1*

SVM models have a cost parameter C (1 by default) to control the trade-off between training errors and forcing rigid margins.

**Cost parameter Nu** `-classifier.libsvm.nu float`

*Default value: 0.5*

Cost parameter Nu, in the range 0..1, the larger the value, the smoother the decision.

**Parameters optimization** `-classifier.libsvm.opt bool`

*Default value: false*

SVM parameters optimization flag.

**Probability estimation** `-classifier.libsvm.prob bool`

*Default value: false*

Probability estimation flag.

### Boost classifier options¶

**Boost Type** `-classifier.boost.t [discrete|real|logit|gentle]`

*Default value: real*

Type of Boosting algorithm.

**Discrete AdaBoost**

This procedure trains the classifiers on weighted versions of the training sample, giving higher weight to cases that are currently misclassified. This is done for a sequence of weighter samples, and then the final classifier is defined as a linear combination of the classifier from each stage.**Real AdaBoost (technique using confidence-rated predictions and working well with categorical data)**

Adaptation of the Discrete Adaboost algorithm with Real value**LogitBoost (technique producing good regression fits)**

This procedure is an adaptive Newton algorithm for fitting an additive logistic regression model. Beware it can produce numeric instability.**Gentle AdaBoost (technique setting less weight on outlier data points and, for that reason, being often good with regression data)**

A modified version of the Real Adaboost algorithm, using Newton stepping rather than exact optimization at each step.

**Weak count** `-classifier.boost.w int`

*Default value: 100*

The number of weak classifiers.

**Weight Trim Rate** `-classifier.boost.r float`

*Default value: 0.95*

A threshold between 0 and 1 used to save computational time. Samples with summary weight <= (1 - weight_trim_rate) do not participate in the next iteration of training. Set this parameter to 0 to turn off this functionality.

**Maximum depth of the tree** `-classifier.boost.m int`

*Default value: 1*

Maximum depth of the tree.

### Decision Tree classifier options¶

**Maximum depth of the tree** `-classifier.dt.max int`

*Default value: 10*

The training algorithm attempts to split each node while its depth is smaller than the maximum possible depth of the tree. The actual depth may be smaller if the other termination criteria are met, and/or if the tree is pruned.

**Minimum number of samples in each node** `-classifier.dt.min int`

*Default value: 10*

If the number of samples in a node is smaller than this parameter, then this node will not be split.

**Termination criteria for regression tree** `-classifier.dt.ra float`

*Default value: 0.01*

If all absolute differences between an estimated value in a node and the values of the train samples in this node are smaller than this regression accuracy parameter, then the node will not be split further.

**Cluster possible values of a categorical variable into K <= cat clusters to find a suboptimal split** `-classifier.dt.cat int`

*Default value: 10*

Cluster possible values of a categorical variable into K <= cat clusters to find a suboptimal split.

**Set Use1seRule flag to false** `-classifier.dt.r bool`

*Default value: false*

If true, then a pruning will be harsher. This will make a tree more compact and more resistant to the training data noise but a bit less accurate.

**Set TruncatePrunedTree flag to false** `-classifier.dt.t bool`

*Default value: false*

If true, then pruned branches are physically removed from the tree.

### Artificial Neural Network classifier options¶

**Train Method Type** `-classifier.ann.t [back|reg]`

*Default value: reg*

Type of training method for the multilayer perceptron (MLP) neural network.

**Back-propagation algorithm**

Method to compute the gradient of the loss function and adjust weights in the network to optimize the result.**Resilient Back-propagation algorithm**

Almost the same as the Back-prop algorithm except that it does not take into account the magnitude of the partial derivative (coordinate of the gradient) but only its sign.

**Number of neurons in each intermediate layer** `-classifier.ann.sizes string1 string2...`

*Mandatory*

The number of neurons in each intermediate layer (excluding input and output layers).

**Neuron activation function type** `-classifier.ann.f [ident|sig|gau]`

*Default value: sig*

This function determine whether the output of the node is positive or not depending on the output of the transfert function.

**Identity function****Symmetrical Sigmoid function****Gaussian function (Not completely supported)**

**Alpha parameter of the activation function** `-classifier.ann.a float`

*Default value: 1*

Alpha parameter of the activation function (used only with sigmoid and gaussian functions).

**Beta parameter of the activation function** `-classifier.ann.b float`

*Default value: 1*

Beta parameter of the activation function (used only with sigmoid and gaussian functions).

**Strength of the weight gradient term in the BACKPROP method** `-classifier.ann.bpdw float`

*Default value: 0.1*

Strength of the weight gradient term in the BACKPROP method. The recommended value is about 0.1.

**Strength of the momentum term (the difference between weights on the 2 previous iterations)** `-classifier.ann.bpms float`

*Default value: 0.1*

Strength of the momentum term (the difference between weights on the 2 previous iterations). This parameter provides some inertia to smooth the random fluctuations of the weights. It can vary from 0 (the feature is disabled) to 1 and beyond. The value 0.1 or so is good enough.

**Initial value Delta_0 of update-values Delta_{ij} in RPROP method** `-classifier.ann.rdw float`

*Default value: 0.1*

Initial value Delta_0 of update-values Delta_{ij} in RPROP method (default = 0.1).

**Update-values lower limit Delta_{min} in RPROP method** `-classifier.ann.rdwm float`

*Default value: 1e-07*

Update-values lower limit Delta_{min} in RPROP method. It must be positive (default = 1e-7).

**Termination criteria** `-classifier.ann.term [iter|eps|all]`

*Default value: all*

Termination criteria.

**Maximum number of iterations**

Set the number of iterations allowed to the network for its training. Training will stop regardless of the result when this number is reached**Epsilon**

Training will focus on result and will stop once the precision isat most epsilon**Max. iterations + Epsilon**

Both termination criteria are used. Training stop at the first reached

**Epsilon value used in the Termination criteria** `-classifier.ann.eps float`

*Default value: 0.01*

Epsilon value used in the Termination criteria.

**Maximum number of iterations used in the Termination criteria** `-classifier.ann.iter int`

*Default value: 1000*

Maximum number of iterations used in the Termination criteria.

### Random forests classifier options¶

**Maximum depth of the tree** `-classifier.rf.max int`

*Default value: 5*

The depth of the tree. A low value will likely underfit and conversely a high value will likely overfit. The optimal value can be obtained using cross validation or other suitable methods.

**Minimum number of samples in each node** `-classifier.rf.min int`

*Default value: 10*

If the number of samples in a node is smaller than this parameter, then the node will not be split. A reasonable value is a small percentage of the total data e.g. 1 percent.

**Termination Criteria for regression tree** `-classifier.rf.ra float`

*Default value: 0*

If all absolute differences between an estimated value in a node and the values of the train samples in this node are smaller than this regression accuracy parameter, then the node will not be split.

**Cluster possible values of a categorical variable into K <= cat clusters to find a suboptimal split** `-classifier.rf.cat int`

*Default value: 10*

Cluster possible values of a categorical variable into K <= cat clusters to find a suboptimal split.

**Size of the randomly selected subset of features at each tree node** `-classifier.rf.var int`

*Default value: 0*

The size of the subset of features, randomly selected at each tree node, that are used to find the best split(s). If you set it to 0, then the size will be set to the square root of the total number of features.

**Maximum number of trees in the forest** `-classifier.rf.nbtrees int`

*Default value: 100*

The maximum number of trees in the forest. Typically, the more trees you have, the better the accuracy. However, the improvement in accuracy generally diminishes and reaches an asymptote for a certain number of trees. Also to keep in mind, increasing the number of trees increases the prediction time linearly.

**Sufficient accuracy (OOB error)** `-classifier.rf.acc float`

*Default value: 0.01*

Sufficient accuracy (OOB error).

### KNN classifier options¶

**Number of Neighbors** `-classifier.knn.k int`

*Default value: 32*

The number of neighbors to use.

### Shark Random forests classifier options¶

**Maximum number of trees in the forest** `-classifier.sharkrf.nbtrees int`

*Default value: 100*

The maximum number of trees in the forest. Typically, the more trees you have, the better the accuracy. However, the improvement in accuracy generally diminishes and reaches an asymptote for a certain number of trees. Also to keep in mind, increasing the number of trees increases the prediction time linearly.

**Min size of the node for a split** `-classifier.sharkrf.nodesize int`

*Default value: 25*

If the number of samples in a node is smaller than this parameter, then the node will not be split. A reasonable value is a small percentage of the total data e.g. 1 percent.

**Number of features tested at each node** `-classifier.sharkrf.mtry int`

*Default value: 0*

The number of features (variables) which will be tested at each node in order to compute the split. If set to zero, the square root of the number of features is used.

**Out of bound ratio** `-classifier.sharkrf.oobr float`

*Default value: 0.66*

Set the fraction of the original training dataset to use as the out of bag sample.A good default value is 0.66.

### Shark kmeans classifier options¶

**Maximum number of iterations for the kmeans algorithm** `-classifier.sharkkm.maxiter int`

*Default value: 10*

The maximum number of iterations for the kmeans algorithm. 0=unlimited

**Number of classes for the kmeans algorithm** `-classifier.sharkkm.k int`

*Default value: 2*

The number of classes used for the kmeans algorithm. Default set to 2 class

**User defined input centroids** `-classifier.sharkkm.incentroids filename [dtype]`

Input text file containing centroid posistions used to initialize the algorithm. Each centroid must be described by p parameters, p being the number of features in the input vector data, and the number of centroids must be equal to the number of classes (one centroid per line with values separated by spaces).

**Statistics file** `-classifier.sharkkm.cstats filename [dtype]`

A XML file containing mean and standard deviation to centerand reduce the input centroids before the KMeans algorithm, produced by ComputeImagesStatistics application.

**Output centroids text file** `-classifier.sharkkm.outcentroids filename [dtype]`

Output text file containing centroids after the kmean algorithm.

**Random seed** `-rand int`

Set a specific random seed with integer value.

## Examples¶

From the command-line:

```
otbcli_TrainVectorClassifier -io.vd vectorData.shp -io.stats meanVar.xml -io.out svmModel.svm -feat perimeter area width -cfield predicted
```

From Python:

```
import otbApplication
app = otbApplication.Registry.CreateApplication("TrainVectorClassifier")
app.SetParameterStringList("io.vd", ['vectorData.shp'])
app.SetParameterString("io.stats", "meanVar.xml")
app.SetParameterString("io.out", "svmModel.svm")
app.SetParameterStringList("feat", "perimeter area width")
app.SetParameterString("cfield", "predicted")
app.ExecuteAndWriteOutput()
```