Apache Ignite 2.5: Scaling to 1000s Nodes Clusters

Apache Ignite was always appreciated by its users for two primary things it delivers - scalability and performance. Throughout the lifetime many distributed systems tend to do performance optimizations from a release to release while making scalability related improvements just a couple of times. It's not because the scalability is of no interest. Usually, scalability requirements are set and solved once by a distributed system and don't require significant additional interventions by engineers.

However, Apache Ignite grew to the point when the community decided to revisit its discovery subsystem that influences how well and far Ignite scales out. The goal was pretty clear - Ignite has to scale to 1000s of nodes as good as it scales to 100s now.

It took many months to get the task implemented. So, please join me in welcoming Apache Ignite 2.5 that now can be scaled easily to 1000s of nodes and goes with other exciting capabilities. Let's check out the most prominent ones.

Massive Scalability

There are two components of Ignite that were modified in Ignite 2.5 to improve its scalability capabilities. The first one is related to 1000s nodes clusters while the other is related to the way we train machine learning (ML) models in Ignite. Let's start with the first.

Marrying Apache Ignite and ZooKeeper

Right, that 1000s nodes scalability goal was solved with the help of Apache ZooKeeper. Why did we turn to it?

Apache Ignite default TCP/IP Discovery organizes cluster nodes into a ring-topology form that has its advantages and disadvantages. For instance, on topologies with hundreds of cluster nodes, it can take many seconds why a system message traverse through all the nodes. As a result, necessary processing of events such as joining of new nodes or detecting of failed ones can take a while affecting overall cluster responsiveness and performance. That is a big deal if you'd like to run 1000s nodes clusters.

The new ZooKeeper Discovery uses ZooKeeper as a single point of synchronization where Ignite nodes are exchanging discovery events through it. It solved the issue with long-to-be-processed discovery messages and, as a result, allowed Ignite scaling to large cluster topologies.

As a rule of thumb, keep using default TCP/IP Discovery if it's unlikely that your Ignite cluster scales beyond 300s nodes and switch to ZooKeeper Discovery if that's the case.

Machine Learning: Partition-Based Datasets

That's the second prominent feature of Ignite 2.5 that improves the way of how far you can scale your Ignite clusters to train ML models over terabytes or petabytes of data. The partition-based datasets moved us closer to the implementation of Zero-ETL concept which implies that Ignite can be used as a single storage where ML models and algorithms are being improved iteratively and online without ETLing data back and forth between Ignite and another storage.

Read more about the datasets from this documentation page.

Genetic Algorithms

Ignite's ML component is ramping up and in the version 2.5 it accepted a contribution of genetic algorithms (GAs) which help to solve optimization problems by simulating the process of biological evolution. GAs are excellent for searching through large and complex data sets for an optimal solution. Real world applications of GAs include automotive design, computer gaming, robotics, investments, traffic/shipment routing and more.

Refer to excessive articles of my community-mates Turik Campbell and Akmal B. Chaudhri which cover main benefits of GAs:

Continuous Self-Healing and Consistency Checks

It's a known fact that many companies and businesses trusted Ignite its mission-critical deployments and solutions. As a result, sometimes Ignite doesn't even have a right to "misfire" and should be able to handle critical or unpredictable situations automatically or provide facilities to do deal with them manually.

With Ignite 2.5, we've kicked off the realization of continuous self-healing concept that implies that no matter what happens with Ignite in production it should be able to tolerate unexpected failures and stay up and running. The following was done in 2.5:

SQL: Security and Fast Data Loading

The community stays strong and determined in its goal of making Ignite SQL engine undistinguishable from SQL engines of famous and mature SQL database. What's the purpose? We want to make it easy for you to migrate from a relational database to Ignite, so that you can reuse all your skills gained before. Overall, this is what our SQL engine got in 2.5:

Fast data loading with COPY command and streaming mode using SQL APIs.
Long running transactions monitoring and termination
Secured Ignite cluster. Use CREATE USER, DROP USER and ALTER USER commands to manage who is allowed to connect to your clusters.

In-place Execution of Spark DataFrame Queries

Apache Spark users can applaud because the following ticket got merged in 2.5. In short, it means that from now on Ignite will be able to execute as many DataFrames SQL queries as it can in-place on Ignite servers side avoiding data movement from Ignite to Spark. The performance of your DataFrames queries should boost significantly. Enjoy!

DEB and RPM packages

Last but not least, if you're a Linux user, now you can install the latest Ignite versions directly from DEB and RPM repositories. Refer to how-to and share your feedback with us.

Finally, I have no more paper left to cover other optimizations and improvements. So, go ahead and check out our release notes.