Metrics System | Ignite Documentation
Edit

Metrics System

Overview

This section explains the metrics system and how you can use it to monitor your cluster.

Let’s explore the basic concepts of the metrics system in Ignite. First, there are different metrics. Each metric has a name and a return value. The return value can be a simple value like String, long, or double, or can represent a Java object. Some metrics represent Histograms.

And then there are different ways to export the metrics — what we call exporters. To put it another way, the exporter are different ways you can access the metrics. Each exporter always gives access to all available metrics.

Ignite includes the following exporters:

  • JMX (default)

  • SQL Views

  • Log files

  • OpenCensus

You can create a custom exporter by implementing the MetricExporterSpi interface.

Metric Registers

Metrics are grouped into categories (called registers). Each register has a name. The full name of a specific metric within the register consists of the register name followed by a dot, followed by the name of the metric: <register_name>.<metric_name>. For example, the register for data storage metrics is called io.datastorage. The metric that return the storage size is called io.datastorage.StorageSize.

The list of all registers and the metrics they contain are described here.

Metric Exporters

If you want to enable metrics, configure one or multiple metric exporters in the node configuration. This is a node-specific configuration, which means it enables metrics only on the node where it is specified.

<bean class="org.apache.ignite.configuration.IgniteConfiguration">
    <property name="metricExporterSpi">
        <list>
            <bean class="org.apache.ignite.spi.metric.jmx.JmxMetricExporterSpi"/>
            <bean class="org.apache.ignite.spi.metric.log.LogExporterSpi"/>
            <bean class="org.apache.ignite.spi.metric.opencensus.OpenCensusMetricExporterSpi"/>
        </list>
    </property>
</bean>
IgniteConfiguration cfg = new IgniteConfiguration();

// Change metric exporter.
cfg.setMetricExporterSpi(new LogExporterSpi());

Ignite ignite = Ignition.start(cfg);
This API is not presently available for C++. You can use XML configuration.

The following sections describe the exporters available in Ignite by default.

JMX

org.apache.ignite.spi.metric.jmx.JmxMetricExporterSpi exposes metrics via JMX beans.

IgniteConfiguration cfg = new IgniteConfiguration();

// Create configured JMX metrics exporter.
JmxMetricExporterSpi jmxExporter = new JmxMetricExporterSpi();

//export cache metrics only
jmxExporter.setExportFilter(mreg -> mreg.name().startsWith("cache."));

cfg.setMetricExporterSpi(jmxExporter);
This API is not presently available for C++.
Note

This exporter is enabled by default if nothing is set with IgniteConfiguration.setMetricExporterSpi(…​).

IgniteConfiguration cfg = new IgniteConfiguration();

// Disable default JMX metrics exporter. Also could be disabled with the system property
// '-DIGNITE_MBEANS_DISABLED=true' or by setting other metrics exporter.
cfg.setMetricExporterSpi(new NoopMetricExporterSpi());
This API is not presently available for C++.

Enabling JMX for Ignite

By default, the JMX automatic configuration is disabled. To enable it, configure the following environment variables:

  • For control.sh, configure the CONTROL_JVM_OPTS variable

  • For ignite.sh, configure the JVM_OPTS variable

For example:

JVM_OPTS="-Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.port=${JMX_PORT} \
-Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false"

Understanding MBean’s ObjectName

Every JMX Mbean has an ObjectName. The ObjectName is used to identify the bean. The ObjectName consists of a domain and a list of key properties, and can be represented as a string as follows:

domain: key1 = value1 , key2 = value2

All Ignite metrics have the same domain: org.apache.<classloaderId> where the classloader ID is optional (omitted if you set IGNITE_MBEAN_APPEND_CLASS_LOADER_ID=false). In addition, each metric has two properties: group and name. For example:

org.apache:group=SPIs,name=TcpDiscoverySpi

This MBean provides various metrics related to node discovery.

The MBean ObjectName can be used to identify the bean in UI tools like JConsole. For example, JConsole displays MBeans in a tree-like structure where all beans are first grouped by domain and then by the 'group' property:

jconsole


SQL View

SqlViewMetricExporterSpi is enabled by default, SqlViewMetricExporterSpi exposes metrics via the SYS.METRICS view. Each metric is displayed as a single record. You can use any supported SQL tool to view the metrics:

> select name, value from SYS.METRICS where name LIKE 'cache.myCache.%';
+-----------------------------------+--------------------------------+
|                NAME               |             VALUE              |
+-----------------------------------+--------------------------------+
| cache.myCache.CacheTxRollbacks    | 0                              |
| cache.myCache.OffHeapRemovals     | 0                              |
| cache.myCache.QueryCompleted      | 0                              |
| cache.myCache.QueryFailed         | 0                              |
| cache.myCache.EstimatedRebalancingKeys | 0                         |
| cache.myCache.CacheEvictions      | 0                              |
| cache.myCache.CommitTime          | [J@2eb66498                    |
....

Log

org.apache.ignite.spi.metric.log.LogExporterSpi prints the metrics to the log file at regular intervals (1 min by default) at INFO level.

<?xml version="1.0" encoding="UTF-8"?>
<!--
  Licensed to the Apache Software Foundation (ASF) under one or more
  contributor license agreements.  See the NOTICE file distributed with
  this work for additional information regarding copyright ownership.
  The ASF licenses this file to You under the Apache License, Version 2.0
  (the "License"); you may not use this file except in compliance with
  the License.  You may obtain a copy of the License at

       http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License.
-->
<beans xmlns="http://www.springframework.org/schema/beans" xmlns:util="http://www.springframework.org/schema/util" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="         http://www.springframework.org/schema/beans         http://www.springframework.org/schema/beans/spring-beans.xsd         http://www.springframework.org/schema/util         http://www.springframework.org/schema/util/spring-util.xsd">
    <bean class="org.apache.ignite.configuration.IgniteConfiguration">
        <property name="metricExporterSpi">
            <list>
                <bean class="org.apache.ignite.spi.metric.log.LogExporterSpi"/>
            </list>
        </property>
    </bean>
</beans>

If you use programmatic configuration, you can change the print frequency as follows:

IgniteConfiguration cfg = new IgniteConfiguration();

LogExporterSpi logExporter = new LogExporterSpi();
logExporter.setPeriod(600_000);

//export cache metrics only
logExporter.setExportFilter(mreg -> mreg.name().startsWith("cache."));

cfg.setMetricExporterSpi(logExporter);

Ignite ignite = Ignition.start(cfg);

OpenCensus

org.apache.ignite.spi.metric.opencensus.OpenCensusMetricExporterSpi adds integration with the OpenCensus library.

To use the OpenCensus exporter:

  1. Enable the 'ignite-opencensus' module.

  2. Add org.apache.ignite.spi.metric.opencensus.OpenCensusMetricExporterSpi to the list of exporters in the node configuration.

  3. Configure OpenCensus StatsCollector to export to a specific system. See OpenCensusMetricsExporterExample.java for an example and OpenCensus documentation for additional information.

Configuration parameters:

  • filter - predicate that filters metrics.

  • period - export period.

  • sendInstanceName - if enabled, a tag with the Ignite instance name is added to each metric.

  • sendNodeId - if enabled, a tag with the Ignite node id is added to each metric.

  • sendConsistentId - if enabled, a tag with the Ignite node consistent id is added to each metric.

Histograms

The metrics that represent histograms are available in the JMX exporter only. Histogram metrics are exported as a set of values where each value corresponds to a specific bucket and is available through a separate JMX bean attribute. The attribute names of a histogram metric have the following format:

{metric_name}_{low_bound}_{high_bound}

where

  • {metric_name} - the name of the metric.

  • {low_bound} - start of the bound. 0 for the first bound.

  • {high_bound} - end of the bound. inf for the last bound.

Example of the metric names if the bounds are [10,100]:

  • histogram_0_10 - less than 10.

  • histogram_10_100 - between 10 and 100.

  • histogram_100_inf - more than 100.

Common monitoring tasks

Monitoring the Amount of Data

If you do not use Native persistence (i.e., all your data is kept in memory), you would want to monitor RAM usage. If you use Native persistence, in addition to RAM, you should monitor the size of the data storage on disk.

The size of the data loaded into a node is available at different levels of aggregation. You can monitor for:

  • The total size of the data the node keeps on disk or in RAM. This amount is the sum of the size of each configured data region (in the simplest case, only the default data region) plus the sizes of the system data regions.

  • The size of a specific data region on that node. The data region size is the sum of the sizes of all cache groups.

  • The size of a specific cache/cache group on that node, including the backup partitions.

Allocated Space vs. Actual Size of Data

There is no way to get the exact size of the data (neither in RAM nor on disk). Instead, there are two ways to estimate it.

You can get the size of the space allocated for storing the data. (The "space" here refers either to the space in RAM or on disk depending on whether you use Native persistence or not.) Space is allocated when the size of the storage gets full and more entries need to be added. However, when you remove entries from caches, the space is not deallocated. It is reused when new entries need to be added to the storage on subsequent write operations. Therefore, the allocated size does not decrease when you remove entries from the caches. The allocated size is available at the level of data storage, data region, and cache group metrics. The metric is called TotalAllocatedSize.

You can also get an estimate of the actual size of data by multiplying the number of data pages in use by the fill factor. The fill factor is the ratio of the size of data in a page to the page size, averaged over all pages. The number of pages in use and the fill factor are available at the level of data region metrics.

Add up the estimated size of all data regions to get the estimated total amount of data on the node.

Monitoring RAM Memory Usage

The amount of data in RAM can be monitored for each data region through the following metrics:

Attribute Type Description Scope

PagesFillFactor

float

The average size of data in pages as a ratio of the page size. When Native persistence is enabled, this metric is applicable only to the persistent storage (i.e. pages on disk).

Node

TotalUsedPages

long

The number of data pages that are currently in use. When Native persistence is enabled, this metric is applicable only to the persistent storage (i.e. pages on disk).

Node

PhysicalMemoryPages

long

The number of the allocated pages in RAM.

Node

PhysicalMemorySize

long

The size of the allocated space in RAM in bytes.

Node

If you have multiple data regions, add up the sizes of all data regions to get the total size of the data on the node.

Monitoring Storage Size

Persistent storage, when enabled, saves all application data on disk. The total amount of data each node keeps on disk consists of the persistent storage (application data), the WAL files, and WAL Archive files.

Persistent Storage Size

To monitor the size of the persistent storage on disk, use the following metrics:

Attribute

Type

Description

Scope

TotalAllocatedSize

long

The size of the space allocated on disk for the entire data storage (in bytes). Note that when Native persistence is disabled, this metric shows the total size of the allocated space in RAM.

Node

WalTotalSize

long

Total size of the WAL files in bytes, including the WAL archive files.

Node

WalArchiveSegments

int

The number of WAL segments in the archive.

Node

Data Region Size

For each configured data region, Metrics collection for data regions are disabled by default. You can link:monitoring-metrics/configuring-metrics#enabling-data-region-metrics[enable it in the data region configuration.

The size of the data region on a node comprises the size of all partitions (including backup partitions) that this node owns for all caches in that data region.

Attribute Type Description Scope

TotalAllocatedSize

long

The size of the space allocated for this data region (in bytes). Note that when Native persistence is disabled, this metric shows the total size of the allocated space in RAM.

Node

PagesFillFactor

float

The average amount of data in pages as a ratio of the page size.

Node

TotalUsedPages

long

The number of data pages that are currently in use.

Node

PhysicalMemoryPages

long

The number of data pages in this data region held in RAM.

Node

PhysicalMemorySize

long

The size of the allocated space in RAM in bytes.

Node

Cache Group Size

If you don’t use cache groups, each cache will be its own group.

Attribute

Type

Description

Scope

TotalAllocatedSize

long

The amount of space allocated for the cache group on this node.

Node

Monitoring Checkpointing Operations

Checkpointing may slow down cluster operations. You may want to monitor how much time each checkpoint operation takes, so that you can tune the properties that affect checkpointing. You may also want to monitor the disk performance to see if the slow-down is caused by external reasons.

See Pages Writes Throttling and Checkpointing Buffer Size for performance tips.

Attribute

Type

Description

Scope

DirtyPages

long

The number of pages in memory that have been changed but not yet synchronized to disk. Those will be written to disk during next checkpoint.

Node

LastCheckpointDuration

long

The time in milliseconds it took to create the last checkpoint.

Node

CheckpointBufferSize

long

The size of the checkpointing buffer.

Global

Monitoring Rebalancing

Rebalancing is the process of moving partitions between the cluster nodes so that the data is always distributed in a balanced manner. Rebalancing is triggered when a new node joins, or an existing node leaves the cluster.

If you have multiple caches, they will be rebalanced sequentially. There are several metrics that you can use to monitor the progress of the rebalancing process for a specific cache.

In the metric system, Cache metrics:

Attribute

Type

Description

Scope

RebalancingStartTime

long

This metric shows the time when rebalancing of local partitions started for the cache. This metric will return 0 if the local partitions do not participate in the rebalancing. The time is returned in milliseconds.

Node

EstimatedRebalancingFinishTime

long

Expected time of completion of the rebalancing process.

Node

KeysToRebalanceLeft

long

The number of keys on the node that remain to be rebalanced. You can monitor this metric to learn when the rebalancing process finishes.

Node

Monitoring Topology

Topology refers to the set of nodes in a cluster. There are a number of metrics that expose the information about the topology of the cluster. If the topology changes too frequently or has a size that is different from what you expect, you may want to look into whether there are network problems.

Attribute

Type

Description

Scope

TotalServerNodes

long

The number of server nodes in the cluster.

Global

TotalClientNodes

long

The number of client nodes in the cluster.

Global

TotalBaselineNodes

long

The number of nodes that are registered in the baseline topology. When a node goes down, it remains registered in the baseline topology and you need to remote it manually.

Global

ActiveBaselineNodes

long

The number of nodes that are currently active in the baseline topology.

Global

Attribute

Type

Description

Scope

Coordinator

String

The node ID of the current coordinator node.

Global

CoordinatorNodeFormatted

String

Detailed information about the coordinator node.

TcpDiscoveryNode [id=e07ad289-ff5b-4a73-b3d4-d323a661b6d4,
consistentId=fa65ff2b-e7e2-4367-96d9-fd0915529c25,
addrs=[0:0:0:0:0:0:0:1%lo, 127.0.0.1, 172.25.4.200],
sockAddrs=[mymachine.local/172.25.4.200:47500,
/0:0:0:0:0:0:0:1%lo:47500, /127.0.0.1:47500], discPort=47500,
order=2, intOrder=2, lastExchangeTime=1568187777249, loc=false,
ver=8.7.5#20190520-sha1:d159cd7a, isClient=false]

Global

Monitoring Caches

See the new metric system, Cache metrics.

Monitoring Build and Rebuild Indexes

To get an estimate on how long it takes to rebuild cache indexes, you can use one of the metrics listed below:

  1. IsIndexRebuildInProgress - tells whether indexes are being built or rebuilt at the moment;

  2. IndexBuildCountPartitionsLeft - gives the remaining number of partitions (by cache group) for indexes to rebuild.

Note that the IndexBuildCountPartitionsLeft metric allows to estimate only an approximate number of indexes left to rebuild. For a more accurate estimate, use the IndexRebuildKeyProcessed cache metric:

  • Use isIndexRebuildInProgress to know whether the indexes are being rebuilt for the cache.

  • Use IndexRebuildKeysProcessed to know the number of keys with rebuilt indexes. If the rebuilding is in progress, it gives a number of keys with indexes being rebuilt at the current moment. Otherwise, it gives a total number of the of keys with rebuilt indexes. The values are reset before the start of each rebuilding.

Monitoring Transactions

Note that if a transaction spans multiple nodes (i.e., if the keys that are changed as a result of the transaction execution are located on multiple nodes), the counters will increase on each node. For example, the 'TransactionsCommittedNumber' counter will increase on each node where the keys affected by the transaction are stored.

Attribute

Type

Description

Scope

LockedKeysNumber

long

The number of keys locked on the node.

Node

TransactionsCommittedNumber

long

The number of transactions that have been committed on the node

Node

TransactionsRolledBackNumber

long

The number of transactions that were rolled back.

Node

OwnerTransactionsNumber

long

The number of transactions initiated on the node.

Node

TransactionsHoldingLockNumber

long

The number of open transactions that hold a lock on at least one key on the node.

Node

Monitoring Snapshots

Attribute

Type

Description

Scope

LastSnapshotOperation

LastSnapshotStartTime

SnapshotInProgress

Monitoring Client Connections

Metrics related to JDBC/ODBC or thin client connections.

Attribute

Type

Description

Scope

Connections

java.util.List<String>

A list of strings, each string containing information about a connection:

JdbcClient [id=4294967297, user=<anonymous>,
rmtAddr=127.0.0.1:39264, locAddr=127.0.0.1:10800]

Node

Monitoring Message Queues

When thread pools queues' are growing, it means that the node cannot keep up with the load, or there was an error while processing messages in the queue. Continuous growth of the queue size can lead to OOM errors.

Communication Message Queue

The queue of outgoing communication messages contains communication messages that are waiting to be sent to other nodes. If the size is growing, it means there is a problem.

Attribute

Type

Description

Scope

OutboundMessagesQueueSize

int

The size of the queue of outgoing communication messages.

Node

Discovery Messages Queue

The queue of discovery messages.

Attribute

Type

Description

Scope

MessageWorkerQueueSize

int

The size of the queue of discovery messages that are waiting to be sent to other nodes.

Node

AvgMessageProcessingTime

long

Average message processing time.

Node