The PostgreSQL planner relies on statistical information about the contents of tables in order to generate the most efficient execution plan for queries. These statistics include a list of some of the most common values in each column and a histogram showing the approximate data distribution in each column. For large tables, a random sample of the table contents is taken, rather than examining every row. This enables large tables to be analyzed in a small amount of time. The statistical information is gathered by the ANALYZE
command and stored in the pg_statistic
catalog table. Since different kinds of statistics may be appropriate for different kinds of data, the table only stores very general statistics (such as number of null values) in dedicated columns. Everything else is stored in five “slots”, which are couples of array columns that store the statistics for a column of an arbitrary type.
The statistics collected for time types and temporal types are based on those collected by PostgreSQL for scalar types and range types. For scalar types, like float
, the following statistics are collected:
fraction of null values,
average width, in bytes, of non-null values,
number of different non-null values,
array of most common values and array of their frequencies,
histogram of values, where the most common values are excluded,
correlation between physical and logical row ordering.
For range types, like tstzrange
, three additional histograms are collected:
length histogram of non-empty ranges,
histograms of lower and upper bounds.
For geometries, in addition to (1)–(3), the following statistics are collected:
number of dimensions of the values, N-dimensional bounding box, number of rows in the table, number of rows in the sample, number of non-null values,
N-dimensional histogram that divides the bounding box into a number of cells and keeps the proportion of values that intersects with each cell.
The statistics collected for the new time types timestampset
, period
, and periodset
replicate those collected by PostgreSQL for the tstzrange
. This is clear for the period
type, which is equivalent to tszrange
, excepted that periods cannot be empty. For the timestampset
and the periodset
types, a value is converted into its bounding box which is a period
, then the period
statistics are collected.
The statistics collected for temporal types depend on their duration and their base type. In addition to statistics (1)–(3) that are collected for all temporal types, statistics are collected for the value dimension and the time dimension independently. More precisely, the following statistics are collected for the time dimension:
For temporal instant values, the statistics (4)–(6) are collected for the timestamps.
For all other durations, the statistics (7)–(8) are collected for the (bounding box) periods.
The following statistics are collected for the value dimension:
For temporal types with stepwise interpolation (that is, temporal types whose base type is tbool
, ttext
, or tint
):
For the instant duration, the statistics (4)–(6) are collected for the values.
For all other durations, the statistics (7)–(8) are collected for the values.
For temporal float types (that is, tfloat
):
For instant values, the statistics (4)–(6) are collected for the values.
For all other durations, the statistics (7)–(8) are collected for the (bounding) value ranges.
For temporal point types (that is, tgeompoint
and tgeogpoint
) the statistics (9)–(10) are collected for the points.
Boolean operators in PostgreSQL can be associated with two selectivity functions, which compute how likely a value of a given type will match a given criterion. These selectivity functions rely on the statistics collected. There are two types of selectivity functions. The restriction selectivity functions try to estimate the percentage of the rows in a table that satisfy a WHERE
-clause condition of the form column OP constant
. The join selectivity functions try to estimate the percentage of the rows in a table that satisfy a WHERE
-clause condition of the form table1.column1 OP table2.column2
.
MobilityDB defines 23 classes of Boolean operators (such as =
, <
, &&
, <<
, etc.), each of which can have as left or right arguments a built-in type (such as int
, timestamptz
, etc.) or a new type (such as period
, tintseq
, etc.). As a consequence, there is a very high number of operators with different arguments to be considered for the selectivity functions. The approach taken was to group these combinations into classes corresponding to the value and temporal features. The classes correspond to the type of statistics collected as explained in the previous section.
Currently, only restriction selectivity functions are implemented for temporal types, while join selectivity functions give a default selectivity value depending on the operator. It is planned to implement joint selectivity functions in the future.