The PostgreSQL planner relies on statistical information about the contents of tables in order to generate the most efficient execution plan for queries. These statistics include a list of some of the most common values in each column and a histogram showing the approximate data distribution in each column. For large tables, a random sample of the table contents is taken, rather than examining every row. This enables large tables to be analyzed in a small amount of time. The statistical information is gathered by the
ANALYZE command and stored in the
pg_statistic catalog table. Since different kinds of statistics may be appropriate for different kinds of data, the table only stores very general statistics (such as number of null values) in dedicated columns. Everything else is stored in five “slots”, which are couples of array columns that store the statistics for a column of an arbitrary type.
The statistics collected for time types and temporal types are based on those collected by PostgreSQL for scalar types and range types. For scalar types, like
float, the following statistics are collected:
fraction of null values,
average width, in bytes, of non-null values,
number of different non-null values,
array of most common values and array of their frequencies,
histogram of values, where the most common values are excluded,
correlation between physical and logical row ordering.
For range types, like
tstzrange, three additional histograms are collected:
length histogram of non-empty ranges,
histograms of lower and upper bounds.
For geometries, in addition to (1)–(3), the following statistics are collected:
number of dimensions of the values, N-dimensional bounding box, number of rows in the table, number of rows in the sample, number of non-null values,
N-dimensional histogram that divides the bounding box into a number of cells and keeps the proportion of values that intersects with each cell.
The statistics collected for the new time types
periodset replicate those collected by PostgreSQL for the
tstzrange. This is clear for the
period type, which is equivalent to
tszrange, excepted that periods cannot be empty. For the
timestampset and the
periodset types, a value is converted into its bounding box which is a
period, then the
period statistics are collected.
The statistics collected for temporal types depend on their duration and their base type. In addition to statistics (1)–(3) that are collected for all temporal types, statistics are collected for the value dimension and the time dimension independently. More precisely, the following statistics are collected for the time dimension:
For temporal instant values, the statistics (4)–(6) are collected for the timestamps.
For all other durations, the statistics (7)–(8) are collected for the (bounding box) periods.
The following statistics are collected for the value dimension:
For temporal types with stepwise interpolation (that is, temporal types whose base type is
For the instant duration, the statistics (4)–(6) are collected for the values.
For all other durations, the statistics (7)–(8) are collected for the values.
For temporal float types (that is,
For instant values, the statistics (4)–(6) are collected for the values.
For all other durations, the statistics (7)–(8) are collected for the (bounding) value ranges.
For temporal point types (that is,
tgeogpoint) the statistics (9)–(10) are collected for the points.
Boolean operators in PostgreSQL can be associated with two selectivity functions, which compute how likely a value of a given type will match a given criterion. These selectivity functions rely on the statistics collected. There are two types of selectivity functions. The restriction selectivity functions try to estimate the percentage of the rows in a table that satisfy a
WHERE-clause condition of the form
column OP constant. The join selectivity functions try to estimate the percentage of the rows in a table that satisfy a
WHERE-clause condition of the form
table1.column1 OP table2.column2.
MobilityDB defines 23 classes of Boolean operators (such as
<<, etc.), each of which can have as left or right arguments a built-in type (such as
timestamptz, etc.) or a new type (such as
tintseq, etc.). As a consequence, there is a very high number of operators with different arguments to be considered for the selectivity functions. The approach taken was to group these combinations into classes corresponding to the value and temporal features. The classes correspond to the type of statistics collected as explained in the previous section.
Currently, only restriction selectivity functions are implemented for temporal types, while join selectivity functions give a default selectivity value depending on the operator. It is planned to implement joint selectivity functions in the future.