Updating Trained Models
Updating Already Trained Models in Apache Ignite
The model updating interface in Ignite ML provides relearning of an already trained model on a new portion of data using the state of the model trained earlier. This interface is represented in the DatasetTrainer class and it repeats the training interface with an already learned model as the first parameter:
M update (M mdl, DatasetBuilder<K, V> datasetBuilder, IgniteBiFunction<K, V, Vector> featureExtractor, IgniteBiFunction<K, V, L> lbExtractor).
M update (M mdl, Ignite ignite, IgniteCache<K, V> cache, IgniteBiFunction<K, V, Vector> featureExtractor, IgniteBiFunction<K, V, L> lbExtractor).
M update (M mdl, Ignite ignite, IgniteCache<K, V> cache, IgniteBiPredicate<K, V> filter, IgniteBiFunction<K, V, Vector> featureExtractor, IgniteBiFunction<K, V, L> lbExtractor).
M update(M mdl, Map<K, V> data, int parts, IgniteBiFunction<K, V, Vector> featureExtractor, IgniteBiFunction<K, V, L> lbExtractor).
M update (M mdl, Map<K, V> data, IgniteBiPredicate<K, V> filter, int parts, IgniteBiFunction<K, V, Vector> featureExtractor, IgniteBiFunction<K, V, L> lbExtractor).
The interface brings online learning and online batch learning. Online learning means that you can train a model and when you get a new example for learning, such as clicks on a website, you can update the model as if the model were trained on this example too. Batch online learning requires a batch of examples instead of one training example for model update. Some models allow both update strategies, some allow only batch updating. It depends upon the learning algorithm. Further details of model update capabilities in terms of online and batch online learning can be found below.
The new portion of data should be compatible with the first trainer’s parameters and previous dataset that was used for previous pieces of training in terms of feature vector size and feature value distributions. For example, if you train an ANN model then you should provide the trainer with distance measure and candidates parameter count as at the first learning stage. If you update k-means then the new dataset should contain at least k-rows.
Each model has a special implementation of this interface. Read the next section to get more information about the updating process for each algorithm.
Model updating takes already learned centroids and updates them by new rows. We recommend to use batch online learning for this model. First, the dataset should have a size equal to the k-value at least. Second, a dataset with a small number of rows can move centroids to invalid positions.
Model updating just adds a new dataset to the old dataset. In this case, model updating isn’t restricted.
As in the case of KNN, a new trainer should provide the same distance measure and k-value. Those parameters are important because internally ANN use KMeans and statistics over centroids provided by KMeans. During an update, the trainer gets statistics over centroids from the last learning and updates it with new observations. From this point of view, ANN allows “mini-batch” online learning where batch size is equal to the k-parameter.
Neural Network (NN)
NN updating just gets current neural network state and updates it according to the gradient of error on a new dataset. In this case the NN requires only feature vector compatibility between different datasets.
Logistic regression inherits all restrictions from the neural network trainer because it uses perceptron internally.
The LinearRegressionSGD trainer inherits all restrictions from the neural network trainer. LinearRegressionLSQRTrainer restores state from the last learning and uses it as a first approximation in learning on a new dataset. In this way, LinearRegressionLSQRTrainer also requires only feature vectors compatibility.
SVM trainer uses the state of a learned model as first approximation during a training process. From this point of view, the algorithm only requires feature vectors compatibility.
There is no correct implementation for decision tree updating. Updating learns a new model on a given dataset.
GDB trainer updating gets already learned models from composition and tries to minimize the error gradient on a given dataset through learning of new models predicting gradient. It also uses a convergence checker and if there is no large error on a new dataset then GDB skips the update stage. From this point of view, GDB requires only feature vector compatibility.
|Every update can increase the model composition size. All models depend upon each other. So, frequent updating based upon small datasets can produce an enormous model that requires a lot of memory.|
Random Forest (RF)
The RF trainer just learns new decision trees on a given dataset and adds them to an already learned composition. In this way, RF requires feature vector compatibility and the dataset should have a size bigger than one element because a decision tree cannot be trained on such a small dataset. In contrast to GDB models in a trained composition, RF models aren’t dependent upon each other and if the composition is too big then a user can manually remove some models.
Apache, Apache Ignite, the Apache feather and the Apache Ignite logo are either registered trademarks or trademarks of The Apache Software Foundation.