SQL Statistics
Overview
Apache Ignite can calculate statistics by request and use it to build an optimal SQL query plan. This enables you to significantly speed up the SQL query execution.
Without statistics, SQL query execution planner tries to guess about the query conditions selectivity using only general heuristics. To get better plans, make sure that statistics usage is enabled and configure statistics collection for tables participated in the query. See the example for more details.
Statistics are checked and updated each time after one of the following actions:
-
Node start
-
Topology change
-
Configuration change
A node checks partitions and collects statistics for each of them, which can be used in SQL query optimization.
Configuring Statistics
By default, statistics functionality is enabled.
Statistics are stored locally, while statistics configuration parameters are stored cluster-wide.
To enable or disable statistics in cluster use, execute the following command, specifying ON
, OFF
, or NO_UPDATE
values:
control.sh --property set --name 'statistics.usage.state' --val 'ON'
control.bat --property set --name 'statistics.usage.state' --val 'ON'
To see the statistics usage state, execute the command below:
control.sh --property get --name 'statistics.usage.state'
control.bat --property get --name 'statistics.usage.state'
For more details on configuring SQL statistics, see the System Views and Data Definition Language pages.
Statistics Overriding
Gathered values can be overridden by specifying additional parameters in the ANALYZE command. The specified values override gathered one on each node in STATISTICS_LOCAL_DATA (this data is used by SQL query optimizer), but not in STATISTICS_PARTITION_DATA (keeps real statistics info by partitions). Then, overriden values are used by SQL query optimizer.
Each ANALYZE command overrides all such values for its targets. For example, if there is an already overridden TOTAL value and a need to override DISTINCT value, one should use both parameters in one ANALYZE command. To set different values for different columns, use several ANALYZE commands as follows:
ANALYZE MY_TABLE(COL_A) WITH 'DISTINCT=5,NULLS=6';
ANALYZE MY_TABLE(COL_B) WITH 'DISTINCT=500,NULLS=1000,TOTAL=10000';
Statistics Obsolescence
Each partition has a specialized counter to track the total number of modified (inserted, removed, or updated) rows. If the total number of modified rows is greater than MAX_CHANGED_PARTITION_ROWS_PERCENT
, a partition is analyzed again. After that, a node aggregates statistics once again to get new statistics.
To adjust the MAX_CHANGED_PARTITION_ROWS_PERCENT parameter, run the ANALYZE
command with a desired parameter value once again.
By default, the DEFAULT_OBSOLESCENCE_MAX_PERCENT = 15
parameter is used.
These parameters are applied for all specified targets.
Note
|
As statistics are aggregated through the full scan of each partition, we recommend to disable statistics obsolescence functionality when working with a small number of changing rows. This is particularly true in the case of working with large amounts of data, when a full scan may lead to a performance drop. |
To save CPU resources on obsolescence tracking, use the NO_UPDATE
state:
control.sh --property set --name 'statistics.usage.state' --val 'NO_UPDATE'
control.bat --property set --name 'statistics.usage.state' --val 'NO_UPDATE'
Obtaining a Better Execution Plan Using Statistics
The steps below show an example of obtaining an optimized execution plan for basic query.
-
Create a table and insert data into it:
CREATE TABLE statistics_test(col1 int PRIMARY KEY, col2 varchar, col3 date); INSERT INTO statistics_test(col1, col2, col3) VALUES(1, 'val1', '2019-01-01'); INSERT INTO statistics_test(col1, col2, col3) VALUES(2, 'val2', '2019-03-01'); INSERT INTO statistics_test(col1, col2, col3) VALUES(3, 'val3', '2019-06-01'); INSERT INTO statistics_test(col1, col2, col3) VALUES(4, 'val4', '2019-09-01'); INSERT INTO statistics_test(col1, col2, col3) VALUES(5, 'val5', '2019-12-01'); INSERT INTO statistics_test(col1, col2, col3) VALUES(6, 'val6', '2020-02-01'); INSERT INTO statistics_test(col1, col2, col3) VALUES(7, 'val7', '2020-05-01'); INSERT INTO statistics_test(col1, col2, col3) VALUES(8, 'val8', '2020-08-01'); INSERT INTO statistics_test(col1, col2, col3) VALUES(9, 'val9', '2020-11-01');
-
Create indexes for each column:
CREATE INDEX st_col1 ON statistics_test(col1); CREATE INDEX st_col2 ON statistics_test(col2); CREATE INDEX st_col3 ON statistics_test(col3);
-
Get execution plan for basic query:
Note that the
col2
value is less than the max value in the table, while thecol3
value is higher than the maximum one. So it is highly probable that the second condition returns no result, which makes its selectivity higher. Therefore the database should usest_col3
index.EXPLAIN SELECT * FROM statistics_test WHERE col2 > 'val2' AND col3 > '2020-12-01' SELECT "__Z0"."COL1" AS "__C0_0", "__Z0"."COL2" AS "__C0_1", "__Z0"."COL3" AS "__C0_2" FROM "PUBLIC"."STATISTICS_TEST" "__Z0" /* PUBLIC.ST_COL2: COL2 > 'val2' */ WHERE ("__Z0"."COL2" > 'val2') AND ("__Z0"."COL3" > DATE '2020-12-01')
Note that without collected statistics the database does not have enough information to choose the right index (as both indexes have the same selectivity from the planner’s point of view). This issue is fixed below.
-
Collect statistics for
statistics_test table
:ANALYZE statistics_test;
-
Get execution plan again and make sure that
st_col3 index
is chosen:EXPLAIN SELECT * FROM statistics_test WHERE col2 > 'val2' AND col3 > '2020-12-01' SELECT "__Z0"."COL1" AS "__C0_0", "__Z0"."COL2" AS "__C0_1", "__Z0"."COL3" AS "__C0_2" FROM "PUBLIC"."STATISTICS_TEST" "__Z0" /* PUBLIC.ST_COL3: COL3 > DATE '2020-12-01' */ WHERE ("__Z0"."COL2" > 'val2') AND ("__Z0"."COL3" > DATE '2020-12-01')
Apache, Apache Ignite, the Apache feather and the Apache Ignite logo are either registered trademarks or trademarks of The Apache Software Foundation.