PostgreSQL provides partitioning mechanisms so that large tables can be split in smaller physical tables. This may result in increased performance when querying and manipulating large tables. We will split the Trips
table given in the previous section using list partitioning, where each partitition will contain all the trips that start at a particular date. For doing this, we use the procedure given next for automatically creating the partitions according to a date range.
CREATE OR REPLACE FUNCTION create_partitions_by_date(TableName TEXT, StartDate DATE, EndDate DATE) RETURNS void AS $$ DECLARE d DATE; PartitionName TEXT; BEGIN IF NOT EXISTS ( SELECT 1 FROM information_schema.tables WHERE table_name = lower(TableName)) THEN RAISE EXCEPTION 'Table % does not exist', TableName; END IF; IF StartDate >= EndDate THEN RAISE EXCEPTION 'The start date % must be before the end date %', StartDate, EndDate; END IF; d = StartDate; WHILE d <= EndDate LOOP PartitionName = TableName || '_' || to_char(d, 'YYYY_MM_DD'); IF NOT EXISTS ( SELECT 1 FROM information_schema.tables WHERE table_name = lower(PartitionName)) THEN EXECUTE format('CREATE TABLE %s PARTITION OF %s FOR VALUES IN (''%s'');', PartitionName, TableName, to_char(d, 'YYYY-MM-DD')); RAISE NOTICE 'Partition % has been created', PartitionName; END IF; d = d + '1 day'::interval; END LOOP; RETURN; END $$ LANGUAGE plpgsql;
In order to partition table Trips
by date we need to add an addition column TripDate
to table TripsInput
.
ALTER TABLE TripsInput ADD COLUMN TripDate DATE; UPDATE TripsInput T1 SET TripDate = T2.TripDate FROM (SELECT DISTINCT TripId, date_trunc('day', MIN(T) OVER (PARTITION BY TripId)) AS TripDate FROM TripsInput) T2 WHERE T1.TripId = T2.TripId;
Notice that the UPDATE
statement above takes into account the fact that a trip may finish at a day later than the starting day.
The following statements create table Trips
partitioned by date and the associated partitions.
DROP TABLE Trips CASCADE; CREATE TABLE Trips ( TripId integer, TripDate date, VehId integer NOT NULL REFERENCES Vehicles(VehId), Trip tgeompoint NOT NULL, Traj geometry, PRIMARY KEY (TripId, TripDate) ) PARTITION BY LIST(TripDate); SELECT create_partitions_by_date('Trips', (SELECT MIN(TripDate) FROM TripsInput), (SELECT MAX(TripDate) FROM TripsInput));
To see the partitions that have been created automatically we can use the following statement.
SELECT I.inhrelid::regclass AS child FROM pg_inherits I WHERE i.inhparent = 'trips'::regclass;
In our case this would result in the following output.
trips_2020_06_01 trips_2020_06_02 trips_2020_06_03 trips_2020_06_04 trips_2020_06_05
We modify the query that loads table Trips
from the data in table TripsInput
as follows.
INSERT INTO Trips SELECT TripId, TripDate, VehId, tgeompoint_seq(array_agg(tgeompoint_inst( ST_Transform(ST_SetSRID(ST_MakePoint(PosX,PosY), 4326), 5676), T) ORDER BY T)) FROM TripsInput GROUP BY TripId, TripDate, VehId;
We can see how many trips are in each partition of the TripsInput
as follows.
SELECT COUNT(*) FROM trips_2020_06_01; -- 423 SELECT COUNT(*) FROM trips_2020_06_02; -- 411 SELECT COUNT(*) FROM trips_2020_06_03; -- 415 SELECT COUNT(*) FROM trips_2020_06_04; -- 419 SELECT COUNT(*) FROM trips_2020_06_05; -- 4
Then, we can define the indexes and the views on the table Trips
as shown in the previous section.
An important advantange of the partitioning mechanism in PostgreSQL is that the constraints and the indexes defined on the Trips
table are propagated to the partitions as shown next.
INSERT INTO Trips VALUES (1, '2020-06-01', 10, '[POINT(2389629.8979609837 5626986.483650829)@2020-06-02 08:00]'); -- ERROR: duplicate key value violates unique constraint "trips_2020_06_01_pkey" -- DETAIL: Key (tripid, tripdate)=(1, 2020-06-01) already exists.
Similarly, queries on the Trips
table are propagated to the partitions as shown next.
EXPLAIN SELECT COUNT(*) FROM Trips WHERE Trip && period '[2020-06-02, 2020-06-03)';
If there is no index defined on the Trip
column, the execution plan of the query is as follows:
Aggregate (cost=63.64..63.65 rows=1 width=8) -> Append (cost=0.00..63.62 rows=5 width=0) -> Seq Scan on trips_2020_06_01 trips_1 (cost=0.00..11.29 rows=1 width=0) Filter: (trip && '[2020-06-02 00:00:00+02, 2020-06-03 00:00:00+02)'::period) -> Seq Scan on trips_2020_06_02 trips_2 (cost=0.00..11.14 rows=1 width=0) Filter: (trip && '[2020-06-02 00:00:00+02, 2020-06-03 00:00:00+02)'::period) -> Seq Scan on trips_2020_06_03 trips_3 (cost=0.00..11.19 rows=1 width=0) Filter: (trip && '[2020-06-02 00:00:00+02, 2020-06-03 00:00:00+02)'::period) -> Seq Scan on trips_2020_06_04 trips_4 (cost=0.00..10.24 rows=1 width=0) Filter: (trip && '[2020-06-02 00:00:00+02, 2020-06-03 00:00:00+02)'::period) -> Seq Scan on trips_2020_06_05 trips_5 (cost=0.00..19.75 rows=1 width=0) Filter: (trip && '[2020-06-02 00:00:00+02, 2020-06-03 00:00:00+02)'::period)
After defining an index on the Trip
column as follows
CREATE INDEX Trips_Trip_gist_Idx ON Trips USING gist (Trip);
the execution plan of the query is as follows
Aggregate (cost=33.73..33.74 rows=1 width=8) -> Append (cost=0.14..33.71 rows=5 width=0) -> Index Scan using trips_2020_06_01_trip_idx on trips_2020_06_01 trips_1 (cost=0.14..8.16 rows=1 width=0) Index Cond: (trip && '[2020-06-02 00:00:00+02, 2020-06-03 00:00:00+02)'::period) -> Index Scan using trips_2020_06_02_trip_idx on trips_2020_06_02 trips_2 (cost=0.14..8.16 rows=1 width=0) Index Cond: (trip && '[2020-06-02 00:00:00+02, 2020-06-03 00:00:00+02)'::period) -> Index Scan using trips_2020_06_03_trip_idx on trips_2020_06_03 trips_3 (cost=0.14..8.16 rows=1 width=0) Index Cond: (trip && '[2020-06-02 00:00:00+02, 2020-06-03 00:00:00+02)'::period) -> Index Scan using trips_2020_06_04_trip_idx on trips_2020_06_04 trips_4 (cost=0.14..8.16 rows=1 width=0) Index Cond: (trip && '[2020-06-02 00:00:00+02, 2020-06-03 00:00:00+02)'::period) -> Seq Scan on trips_2020_06_05 trips_5 (cost=0.00..1.05 rows=1 width=0) Filter: (trip && '[2020-06-02 00:00:00+02, 2020-06-03 00:00:00+02)'::period)