TrainVectorClassifier - Train Vector Classifier¶
Train a classifier based on labeled geometries and a list of features to consider.
Detailed description¶
This application trains a classifier based on labeled geometries and a list of features to consider for classification.
Parameters¶
This section describes in details the parameters available for this application. Table [1] presents a summary of these parameters and the parameters keys to be used in command-line and programming languages. Application key is TrainVectorClassifier .
[1] | Table: Parameters table for Train Vector Classifier. |
Parameter Key | Parameter Name | Parameter Type |
---|---|---|
io | Input and output data | Group |
io.vd | Input Vector Data | Input vector data list |
io.stats | Input XML image statistics file | Input File name |
io.out | Output model | Output File name |
io.confmatout | Output confusion matrix or contingency table | Output File name |
layer | Layer Index | Int |
feat | Field names for training features. | List |
valid | Validation data | Group |
valid.vd | Validation Vector Data | Input vector data list |
valid.layer | Layer Index | Int |
cfield | Field containing the class id for supervision | List |
classifier | Classifier to use for the training | Choices |
classifier libsvm | Choice | LibSVM classifier |
classifier boost | Choice | Boost classifier |
classifier dt | Choice | Decision Tree classifier |
classifier gbt | Choice | Gradient Boosted Tree classifier |
classifier ann | Choice | Artificial Neural Network classifier |
classifier bayes | Choice | Normal Bayes classifier |
classifier rf | Choice | Random forests classifier |
classifier knn | Choice | KNN classifier |
classifier.libsvm.k | SVM Kernel Type | Choices |
classifier.libsvm.k linear | Choice | Linear |
classifier.libsvm.k rbf | Choice | Gaussian radial basis function |
classifier.libsvm.k poly | Choice | Polynomial |
classifier.libsvm.k sigmoid | Choice | Sigmoid |
classifier.libsvm.m | SVM Model Type | Choices |
classifier.libsvm.m csvc | Choice | C support vector classification |
classifier.libsvm.m nusvc | Choice | Nu support vector classification |
classifier.libsvm.m oneclass | Choice | Distribution estimation (One Class SVM) |
classifier.libsvm.c | Cost parameter C | Float |
classifier.libsvm.opt | Parameters optimization | Boolean |
classifier.libsvm.prob | Probability estimation | Boolean |
classifier.boost.t | Boost Type | Choices |
classifier.boost.t discrete | Choice | Discrete AdaBoost |
classifier.boost.t real | Choice | Real AdaBoost (technique using confidence-rated predictions and working well with categorical data) |
classifier.boost.t logit | Choice | LogitBoost (technique producing good regression fits) |
classifier.boost.t gentle | Choice | Gentle AdaBoost (technique setting less weight on outlier data points and, for that reason, being often good with regression data) |
classifier.boost.w | Weak count | Int |
classifier.boost.r | Weight Trim Rate | Float |
classifier.boost.m | Maximum depth of the tree | Int |
classifier.dt.max | Maximum depth of the tree | Int |
classifier.dt.min | Minimum number of samples in each node | Int |
classifier.dt.ra | Termination criteria for regression tree | Float |
classifier.dt.cat | Cluster possible values of a categorical variable into K <= cat clusters to find a suboptimal split | Int |
classifier.dt.f | K-fold cross-validations | Int |
classifier.dt.r | Set Use1seRule flag to false | Boolean |
classifier.dt.t | Set TruncatePrunedTree flag to false | Boolean |
classifier.gbt.w | Number of boosting algorithm iterations | Int |
classifier.gbt.s | Regularization parameter | Float |
classifier.gbt.p | Portion of the whole training set used for each algorithm iteration | Float |
classifier.gbt.max | Maximum depth of the tree | Int |
classifier.ann.t | Train Method Type | Choices |
classifier.ann.t reg | Choice | RPROP algorithm |
classifier.ann.t back | Choice | Back-propagation algorithm |
classifier.ann.sizes | Number of neurons in each intermediate layer | String list |
classifier.ann.f | Neuron activation function type | Choices |
classifier.ann.f ident | Choice | Identity function |
classifier.ann.f sig | Choice | Symmetrical Sigmoid function |
classifier.ann.f gau | Choice | Gaussian function (Not completely supported) |
classifier.ann.a | Alpha parameter of the activation function | Float |
classifier.ann.b | Beta parameter of the activation function | Float |
classifier.ann.bpdw | Strength of the weight gradient term in the BACKPROP method | Float |
classifier.ann.bpms | Strength of the momentum term (the difference between weights on the 2 previous iterations) | Float |
classifier.ann.rdw | Initial value Delta_0 of update-values Delta_{ij} in RPROP method | Float |
classifier.ann.rdwm | Update-values lower limit Delta_{min} in RPROP method | Float |
classifier.ann.term | Termination criteria | Choices |
classifier.ann.term iter | Choice | Maximum number of iterations |
classifier.ann.term eps | Choice | Epsilon |
classifier.ann.term all | Choice | Max. iterations + Epsilon |
classifier.ann.eps | Epsilon value used in the Termination criteria | Float |
classifier.ann.iter | Maximum number of iterations used in the Termination criteria | Int |
classifier.rf.max | Maximum depth of the tree | Int |
classifier.rf.min | Minimum number of samples in each node | Int |
classifier.rf.ra | Termination Criteria for regression tree | Float |
classifier.rf.cat | Cluster possible values of a categorical variable into K <= cat clusters to find a suboptimal split | Int |
classifier.rf.var | Size of the randomly selected subset of features at each tree node | Int |
classifier.rf.nbtrees | Maximum number of trees in the forest | Int |
classifier.rf.acc | Sufficient accuracy (OOB error) | Float |
classifier.knn.k | Number of Neighbors | Int |
rand | set user defined seed | Int |
inxml | Load otb application from xml file | XML input parameters file |
outxml | Save otb application to xml file | XML output parameters file |
[Input and output data]: This group of parameters allows setting input and output data.
- Input Vector Data: Input geometries used for training (note : all geometries from the layer will be used).
- Input XML image statistics file: XML file containing mean and variance of each feature.
- Output model: Output file containing the model estimated (.txt format).
- Output confusion matrix or contingency table: Output file containing the confusion matrix or contingency table (.csv format).The contingency table is output when we unsupervised algorithms is used otherwise the confusion matrix is output.
Layer Index: Index of the layer to use in the input vector file.
Field names for training features.: List of field names in the input vector data to be used as features for training.
[Validation data]: This group of parameters defines validation data.
- Validation Vector Data: Geometries used for validation (must contain the same fields used for training, all geometries from the layer will be used).
- Layer Index: Index of the layer to use in the validation vector file.
Field containing the class id for supervision: Field containing the class id for supervision. Only geometries with this field available will be taken into account.
Classifier to use for the training: Choice of the classifier to use for the training. Available choices are:
- LibSVM classifier: This group of parameters allows setting SVM classifier parameters.
- SVM Kernel Type: SVM Kernel Type. Available choices are:
- Linear
- Gaussian radial basis function
- Polynomial
- Sigmoid
- SVM Model Type: Type of SVM formulation. Available choices are:
- C support vector classification
- Nu support vector classification
- Distribution estimation (One Class SVM)
- Cost parameter C: SVM models have a cost parameter C (1 by default) to control the trade-off between training errors and forcing rigid margins.
- Parameters optimization: SVM parameters optimization flag.
- Probability estimation: Probability estimation flag.
- Boost classifier: This group of parameters allows setting Boost classifier parameters. See complete documentation here url{http://docs.opencv.org/modules/ml/doc/boosting.html}.
- Boost Type: Type of Boosting algorithm. Available choices are:
- Discrete AdaBoost
- Real AdaBoost (technique using confidence-rated predictions and working well with categorical data)
- LogitBoost (technique producing good regression fits)
- Gentle AdaBoost (technique setting less weight on outlier data points and, for that reason, being often good with regression data)
- Weak count: The number of weak classifiers.
- Weight Trim Rate: A threshold between 0 and 1 used to save computational time. Samples with summary weight <= (1 - weight_trim_rate) do not participate in the next iteration of training. Set this parameter to 0 to turn off this functionality.
- Maximum depth of the tree: Maximum depth of the tree.
- Decision Tree classifier: This group of parameters allows setting Decision Tree classifier parameters. See complete documentation here url{http://docs.opencv.org/modules/ml/doc/decision_trees.html}.
- Maximum depth of the tree: The training algorithm attempts to split each node while its depth is smaller than the maximum possible depth of the tree. The actual depth may be smaller if the other termination criteria are met, and/or if the tree is pruned.
- Minimum number of samples in each node: If all absolute differences between an estimated value in a node and the values of the train samples in this node are smaller than this regression accuracy parameter, then the node will not be split.
- Termination criteria for regression tree
- Cluster possible values of a categorical variable into K <= cat clusters to find a suboptimal split: Cluster possible values of a categorical variable into K <= cat clusters to find a suboptimal split.
- K-fold cross-validations: If cv_folds > 1, then it prunes a tree with K-fold cross-validation where K is equal to cv_folds.
- Set Use1seRule flag to false: If true, then a pruning will be harsher. This will make a tree more compact and more resistant to the training data noise but a bit less accurate.
- Set TruncatePrunedTree flag to false: If true, then pruned branches are physically removed from the tree.
- Gradient Boosted Tree classifier: This group of parameters allows setting Gradient Boosted Tree classifier parameters. See complete documentation here url{http://docs.opencv.org/modules/ml/doc/gradient_boosted_trees.html}.
- Number of boosting algorithm iterations: Number “w” of boosting algorithm iterations, with w*K being the total number of trees in the GBT model, where K is the output number of classes.
- Regularization parameter: Regularization parameter.
- Portion of the whole training set used for each algorithm iteration: Portion of the whole training set used for each algorithm iteration. The subset is generated randomly.
- Maximum depth of the tree: The training algorithm attempts to split each node while its depth is smaller than the maximum possible depth of the tree. The actual depth may be smaller if the other termination criteria are met, and/or if the tree is pruned.
- Artificial Neural Network classifier: This group of parameters allows setting Artificial Neural Network classifier parameters. See complete documentation here url{http://docs.opencv.org/modules/ml/doc/neural_networks.html}.
- Train Method Type: Type of training method for the multilayer perceptron (MLP) neural network. Available choices are:
- RPROP algorithm
- Back-propagation algorithm
- Number of neurons in each intermediate layer: The number of neurons in each intermediate layer (excluding input and output layers).
- Neuron activation function type: Neuron activation function. Available choices are:
- Identity function
- Symmetrical Sigmoid function
- Gaussian function (Not completely supported)
- Alpha parameter of the activation function: Alpha parameter of the activation function (used only with sigmoid and gaussian functions).
- Beta parameter of the activation function: Beta parameter of the activation function (used only with sigmoid and gaussian functions).
- Strength of the weight gradient term in the BACKPROP method: Strength of the weight gradient term in the BACKPROP method. The recommended value is about 0.1.
- Strength of the momentum term (the difference between weights on the 2 previous iterations): Strength of the momentum term (the difference between weights on the 2 previous iterations). This parameter provides some inertia to smooth the random fluctuations of the weights. It can vary from 0 (the feature is disabled) to 1 and beyond. The value 0.1 or so is good enough.
- Initial value Delta_0 of update-values Delta_{ij} in RPROP method: Initial value Delta_0 of update-values Delta_{ij} in RPROP method (default = 0.1).
- Update-values lower limit Delta_{min} in RPROP method: Update-values lower limit Delta_{min} in RPROP method. It must be positive (default = 1e-7).
- Termination criteria: Termination criteria. Available choices are:
- Maximum number of iterations
- Epsilon
- Max. iterations + Epsilon
- Epsilon value used in the Termination criteria: Epsilon value used in the Termination criteria.
- Maximum number of iterations used in the Termination criteria: Maximum number of iterations used in the Termination criteria.
- Normal Bayes classifier: Use a Normal Bayes Classifier. See complete documentation here url{http://docs.opencv.org/modules/ml/doc/normal_bayes_classifier.html}.
- Random forests classifier: This group of parameters allows setting Random Forests classifier parameters. See complete documentation here url{http://docs.opencv.org/modules/ml/doc/random_trees.html}.
- Maximum depth of the tree: The depth of the tree. A low value will likely underfit and conversely a high value will likely overfit. The optimal value can be obtained using cross validation or other suitable methods.
- Minimum number of samples in each node: If the number of samples in a node is smaller than this parameter, then the node will not be split. A reasonable value is a small percentage of the total data e.g. 1 percent.
- Termination Criteria for regression tree: If all absolute differences between an estimated value in a node and the values of the train samples in this node are smaller than this regression accuracy parameter, then the node will not be split.
- Cluster possible values of a categorical variable into K <= cat clusters to find a suboptimal split: Cluster possible values of a categorical variable into K <= cat clusters to find a suboptimal split.
- Size of the randomly selected subset of features at each tree node: The size of the subset of features, randomly selected at each tree node, that are used to find the best split(s). If you set it to 0, then the size will be set to the square root of the total number of features.
- Maximum number of trees in the forest: The maximum number of trees in the forest. Typically, the more trees you have, the better the accuracy. However, the improvement in accuracy generally diminishes and reaches an asymptote for a certain number of trees. Also to keep in mind, increasing the number of trees increases the prediction time linearly.
- Sufficient accuracy (OOB error): Sufficient accuracy (OOB error).
- KNN classifier: This group of parameters allows setting KNN classifier parameters. See complete documentation here url{http://docs.opencv.org/modules/ml/doc/k_nearest_neighbors.html}.
- Number of Neighbors: The number of neighbors to use.
set user defined seed: Set specific seed. with integer value.
Load otb application from xml file: Load otb application from xml file.
Save otb application to xml file: Save otb application to xml file.
Example¶
To run this example in command-line, use the following:
otbcli_TrainVectorClassifier -io.vd vectorData.shp -io.stats meanVar.xml -io.out svmModel.svm -feat perimeter area width -cfield predicted
To run this example from Python, use the following code snippet:
#!/usr/bin/python
# Import the otb applications package
import otbApplication
# The following line creates an instance of the TrainVectorClassifier application
TrainVectorClassifier = otbApplication.Registry.CreateApplication("TrainVectorClassifier")
# The following lines set all the application parameters:
TrainVectorClassifier.SetParameterStringList("io.vd", ['vectorData.shp'])
TrainVectorClassifier.SetParameterString("io.stats", "meanVar.xml")
TrainVectorClassifier.SetParameterString("io.out", "svmModel.svm")
# The following line execute the application
TrainVectorClassifier.ExecuteAndWriteOutput()
Authors¶
This application has been written by OTB Team.