Cluster Snapshots | Ignite Documentation

Apache Ignite 2.11 Released: See Why You Need to Upgrade →

Edit

Cluster Snapshots

Overview

Ignite provides an ability to create full cluster snapshots for deployments using Ignite Persistence. An Ignite snapshot includes a consistent cluster-wide copy of all data records persisted on disk and some other files needed for a restore procedure.

The snapshot structure is similar to the layout of the Ignite Persistence storage directory, with several exceptions. Let’s take this snapshot as an example to review the structure:

work
└── snapshots
    └── backup23012020
        └── db
            ├── binary_meta
            │         ├── node1
            │         ├── node2
            │         └── node3
            ├── marshaller
            │         ├── node1
            │         ├── node2
            │         └── node3
            ├── node1
            │    └── my-sample-cache
            │        ├── cache_data.dat
            │        ├── part-3.bin
            │        ├── part-4.bin
            │        └── part-6.bin
            ├── node2
            │    └── my-sample-cache
            │        ├── cache_data.dat
            │        ├── part-1.bin
            │        ├── part-5.bin
            │        └── part-7.bin
            └── node3
                └── my-sample-cache
                    ├── cache_data.dat
                    ├── part-0.bin
                    └── part-2.bin
  • The snapshot is located under the work\snapshots directory and named as backup23012020 where work is Ignite’s work directory.

  • The snapshot is created for a 3-node cluster with all the nodes running on the same machine. In this example, the nodes are named as node1, node2, and node3, while in practice, the names are equal to nodes' consistent IDs.

  • The snapshot keeps a copy of the my-sample-cache cache.

  • The db folder keeps a copy of data records in part-N.bin and cache_data.dat files. Write-ahead and checkpointing are not added into the snapshot as long as those are not required for the current restore procedure.

  • The binary_meta and marshaller directories store metadata and marshaller-specific information.

Note

Usually Snapshot is Spread Across the Cluster

The previous example shows the snapshot created for the cluster running on the same physical machine. Thus, the whole snapshot is located in a single place. While in practice, all the nodes will be running on different machines having the snapshot data spread across the cluster. Each node keeps a segment of the snapshot with the data belonging to this particular node. The restore procedure explains how to tether together all the segments during recovery.

Configuring Snapshot Directory

By default, a segment of the snapshot is stored in the work directory of a respective Ignite node and uses the same storage media where Ignite Persistence keeps data, index, WAL, and other files. Since the snapshot can consume as much space as already taken by the persistence files and can affect your applications' performance by sharing the disk I/O with the Ignite Persistence routines, it’s suggested to store the snapshot and persistence files on different media.

See the Configuring Snapshot Directory page for configuration examples.

Creating Snapshot

Ignite provides several APIs for the snapshot creation. Let’s review all the options.

Using Control Script

Ignite ships the Control Script that supports snapshots-related commands listed below:

#Create a cluster snapshot:
control.(sh|bat) --snapshot create snapshot_name

#Cancel a running snapshot:
control.(sh|bat) --snapshot cancel snapshot_name

#Kill a running snapshot:
control.(sh|bat) --kill SNAPSHOT snapshot_name

Using JMX

Use the SnapshotMXBean interface to perform the snapshot-specific procedures via JMX:

Method Description

createSnapshot(String snpName)

Create a snapshot.

cancelSnapshot(String snpName)

Cancel a snapshot on the node initiated its creation.

Using Java API

Also, it’s possible to create a snapshot programmatically in Java:

CacheConfiguration<Integer, String> ccfg = new CacheConfiguration<>("snapshot-cache");

try (IgniteCache<Integer, String> cache = ignite.getOrCreateCache(ccfg)) {
    cache.put(1, "Maxim");

    // Start snapshot operation.
    ignite.snapshot().createSnapshot("snapshot_02092020").get();
}
finally {
    ignite.destroyCache(ccfg.getName());
}

Checking Snapshot Consistency

Usually all the cluster nodes run on different machines and have the snapshot data spread across the cluster. Each node stores its own snapshot segment, so in some cases it may be necessary to check the snapshot for completeness of data and for data consistency across the cluster before restoring from the snapshot.

For such cases, Apache Ignite is delivered with built-in snapshot consistency check commands that enable you to verify internal data consistency, calculate data partitions hashes and pages checksums, and print out the result if a problem is found. The check command also compares hashes of a primary partitions with corresponding backup partitions and reports any differences.

See the Control Script that supports snapshots-related checking commands.

Restoring From Snapshot

A snapshot can be restored either manually on a stopped cluster or automatically on an active cluster. Both procedures are described below.

Manual Snapshot Restore Procedure

Stop the cluster, then replace persistence data and other files with the data from the snapshot, and restart the nodes.

The detailed procedure looks as follows:

  1. Stop the cluster you intend to restore

  2. Remove all files from the checkpoint $IGNITE_HOME/work/cp directory

  3. Do the following on each node:

    • Remove the files related to the {nodeId} from the $IGNITE_HOME/work/db/binary_meta directory.

    • Remove the files related to the {nodeId} from the $IGNITE_HOME/work/db/marshaller directory.

    • Remove the files and sub-directories related to the {nodeId} under your $IGNITE_HOME/work/db directory. Clean the db/{node_id} directory separately if it’s not located under the Ignite work dir.

    • Copy the files belonging to a node with the {node_id} from the snapshot into the $IGNITE_HOME/work/ directory. If the db/{node_id} directory is not located under the Ignite work dir then you need to copy data files there.

  4. Restart the cluster

Restore On Cluster of Different Topology

You may want to create a snapshot of an N-node cluster and use it to restore on an M-node cluster. The table below explains what options are supported:

Condition Description

N == M

The recommended case. Create and use the snapshot on clusters of a similar topology.

N < M

Start the first N nodes of the M-node cluster and apply the snapshot. Add the rest of the M-cluster nodes to the topology and wait while the data gets rebalanced and indexes are rebuilt.

N > M

Unsupported.

Automatic Snapshot Restore Procedure

The automatic restore procedure allows the user to restore cache groups from a snapshot on an active cluster by using the Java API or command line script.

Currently, this procedure has several limitations, that will be resolved in future releases:

  • Restoring is possible only if all parts of the snapshot are present in the cluster. Each node looks for a local snapshot data in the configured snapshot path by the given snapshot name and consistent node ID.

  • The restore procedure can be applied only to cache groups created by the user.

  • Cache groups to be restored from the snapshot must not be present in the cluster. If they are present, they must be destroyed by the user before starting this operation.

  • Concurrent restore operations are not allowed. Thus, if one operation has been started, the other can only be started after the first is completed.

Restoring Cache Group from the Snapshot

The following code snippet demonstrates how to restore an individual cache group from a snapshot.

// Restore cache named "snapshot-cache" from the snapshot "snapshot_02092020".
ignite.snapshot().restoreSnapshot("snapshot_02092020", Collections.singleton("snapshot-cache")).get();
# Restore cache group "snapshot-cache" from the snapshot "snapshot_02092020".
control.(sh|bat) --snapshot restore snapshot_02092020 --start snapshot-cache

Using CLI to control restore operation

The control.sh|bat script provides the ability to start, stop, and get the status of the restore operation.

# Start restoring all user-created cache groups from the snapshot "snapshot_09062021".
control.(sh|bat) --snapshot restore snapshot_09062021 --start

# Start restoring only "cache-group1" and "cache-group2" from the snapshot "snapshot_09062021".
control.(sh|bat) --snapshot restore snapshot_09062021 --start cache-group1,cache-group2

# Get the status of the restore operation for "snapshot_09062021".
control.(sh|bat) --snapshot restore snapshot_09062021 --status

# Cancel the restore operation for "snapshot_09062021".
control.(sh|bat) --snapshot restore snapshot_09062021 --cancel

Consistency Guarantees

All snapshots are fully consistent in terms of concurrent cluster-wide operations as well as ongoing changes with Ignite. Persistence data, index, schema, binary metadata, marshaller and other files on nodes.

The cluster-wide snapshot consistency is achieved by triggering the Partition-Map-Exchange procedure. By doing that, the cluster will eventually get to the point in time when all previously started transactions are completed, and new ones are paused. Once this happens, the cluster initiates the snapshot creation procedure. The PME procedure ensures that the snapshot includes primary and backup in a consistent state.

The consistency between the Ignite Persistence files and their snapshot copies is achieved by copying the original files to the destination snapshot directory with tracking all concurrent ongoing changes. The tracking of the changes might require extra space on the Ignite Persistence storage media (up to the 1x size of the storage media).

Current Limitations

The snapshot procedure has some limitations that you should be aware of before using the feature in your production environment:

  • Snapshotting of specific caches/tables is unsupported. You always create a full cluster snapshot.

  • Caches/tables that are not persisted in Ignite Persistence are not included into the snapshot.

  • Encrypted caches are not included in the snapshot.

  • You can have only one snapshotting operation running at a time.

  • The snapshot procedure is interrupted if a server node leaves the cluster.

  • Snapshot may be restored only at the same cluster topology with the same node IDs;

  • The automatic restore procedure is not available yet. You have to restore it manually.

If any of these limitations prevent you from using Apache Ignite, then select alternate snapshotting implementations for Ignite provided by enterprise vendors.