Apache Ignite® enables real-time analytics across Apache™ Hadoop® operational and historical data silos. The Ignite in-memory computing platform provides low-latency and high-throughput operations while Hadoop continues to be used for long-running OLAP workloads.
As the architecture diagram on the right suggests, you can achieve the performance acceleration of Hadoop-based systems by deploying Ignite as a separate distributed storage that maintains the data sets required for your low-latency operations or real-time reports.
First, depending on the data volume and available memory capacity, you can enable Ignite native persistence to store historical data sets on disk while dedicating a memory space for operational records. You can continue to use Hadoop as storage for less frequently used data or for long-running and ad-hoc analytical queries.
Next, your applications and services should use Ignite native APIs to process the data residing in the in-memory cluster. Ignite provides SQL, compute (aka. map-reduce), and machine learning APIs for various data processing needs.
Finally, consider using Apache Spark DataFrames APIs if an application needs to run federated or cross-database queries across Ignite and Hadoop clusters. Ignite is integrated with Spark, which natively supports Hive/Hadoop. Cross-database queries should be considered only for a limited number of scenarios when neither Ignite nor Hadoop contains the entire data set.
How to split data and operations between Ignite and Hadoop?
Consider using this approach:
- Use Apache Ignite for tasks that require low-latency response time (microseconds, milliseconds, seconds), high throughput operations (thousands and millions of operations per second), and real-time processing.
- Continue using Apache Hadoop for high-latency operations (dozens of seconds, minutes, hours) and batch processing.
Getting Started Checklist
Follow the steps below to implement the discussed architecture in practice:
- Download and install Apache Ignite in your system.
- Select a list of operations/reports to be executed against Ignite. The best candidates are operations that require low-latency response time, high-throughput, and real-time analytics.
- Depending on the data volume and available memory space, consider using Ignite native persistence. Alternatively, you can use Ignite as a pure in-memory cache or in-memory data grid that persists changes to Hadoop or another external database.
- Update your applications to ensure they use Ignite native APIs to process Ignite data and Spark for federated queries.
- If you need to replicate changes between Ignite and Hadoop clusters, consider using existing change-data-capture solutions like Debezium, Kafka, GridGain Data Lake Accelerator, Oracle GoldenGate or others. If you'd like Ignite to write-through changes to Hadoop directly, then implement Ignite's CacheStore interface.