Random Forest in Apache Ignite
Random forest is an ensemble learning method to solve any classification and regression problem. Random forest training builds a model composition (ensemble) of one type and uses some aggregation algorithm of several answers from models. Each model is trained on a part of the training dataset. The part is defined according to bagging and feature subspace methods. More information about these concepts may be found here: https://en.wikipedia.org/wiki/Random_forest, https://en.wikipedia.org/wiki/Bootstrap_aggregating and https://en.wikipedia.org/wiki/Random_subspace_method.
There are several implementations of aggregation algorithms in Apache Ignite ML:
MeanValuePredictionsAggregator- computes answer of a random forest as mean value of predictions from all models in the given composition. Often this is is used for regression tasks.
OnMajorityPredictionsAggegator- gets a mode of predictions from all models in the given composition. This can be useful for a classification task. NOTE: This aggregator supports multi-classification tasks.
The random forest algorithm is implemented in Ignite ML as a special case of a model composition with specific aggregators for different problems (
MeanValuePredictionsAggregator for regression,
OnMajorityPredictionsAggegator for classification).
Here is an example of model usage:
ModelsComposition randomForest = …. double prediction = randomForest.apply(featuresVector);
The random forest training algorithm is implemented with RandomForestRegressionTrainer and RandomForestClassifierTrainer trainers with the following parameters:
meta - features meta, list of feature type description such as:
featureId- index in features vector.
isCategoricalFeature- flag having true value if a feature is categorical.
This meta-information is important for random forest training algorithms because it builds feature histograms and categorical features should be represented in histograms for all feature values:
featuresCountSelectionStrgy- sets strategy defining count of random features for learning one tree. There are several strategies: SQRT, LOG2, ALL and ONE_THIRD strategies implemented in the FeaturesCountSelectionStrategies class.
maxDepth- sets the maximum tree depth.
minInpurityDelta- a node in a decision tree is split into two nodes if the impurity values on these two nodes is less than the unspilt node’s minImpurityDecrease value.
subSampleSize- value lying in the [0; MAX_DOUBLE]-interval. This parameter defines the count of sample repetitions in uniformly sampling with replacement.
seed- seed value used in random generators.
Random forest training may be used as follows:
RandomForestClassifierTrainer trainer = new RandomForestClassifierTrainer(featuresMeta) .withCountOfTrees(101) .withFeaturesCountSelectionStrgy(FeaturesCountSelectionStrategies.ONE_THIRD) .withMaxDepth(4) .withMinImpurityDelta(0.) .withSubSampleSize(0.3) .withSeed(0); ModelsComposition rfModel = trainer.fit( ignite, dataCache, vectorizer );
To see how Random Forest Classifier can be used in practice, try this example that is available on GitHub and delivered with every Apache Ignite distribution. In this example, a Wine recognition dataset was used. Description of this dataset and data are available from the UCI Machine Learning Repository.
Apache, Apache Ignite, the Apache feather and the Apache Ignite logo are either registered trademarks or trademarks of The Apache Software Foundation.